Medline Analysis

From NCBO Wiki
Revision as of 17:10, 5 March 2010 by Rong (talk | contribs)
Jump to navigation Jump to search

Introduction

The Unified Medical Language System (UMLS) Metathesaurus are the most widely used underlying sources for biomedical natural language processing (NLP) systems, even though they were not designed as terminologies for NLP tasks. However the performances of these systems are not satisfactory. In this study, we systematically analyzed UMLS terms by analyzing their occurrences in over 18 million MEDLINE abstracts written in human natural language. Our goals are three folds: 1. analyze UMLS term frequency and syntactic distribution on MEDLINE; 2. build an automatically filtered UMLS Metathesaurus based on MEDLINE analysis; 3. build an augmented UMLS Metathesaurus where each term is associated with its MEDLINE frequency and syntactic distribution statistics. The automatically filtered and augmented UMLS Metathesaurus can be used to improve efficiency and precision of UMLS-based information retrieval and NLP tasks. After automatic MEDLINE filtering, the augmented UMLS contains 518,835 terms, roughly 13% of original terms. Each term in the augmented UMLS is associated with a vector of syntactic distribution statistics and its MEDLINE frequency.

Code repository: <on our g-forge server> .. https://bmir-gforge.stanford.edu/gf/project/

Data and Methods

18,413,784 million abstracts published in MEDLINE from 1965 to 2009 were parsed into sentences (96,374,837). Each sentence was lexically parsed to generate a parse tree using the Stanford Parser. We used the publicly available information retrieval library, Lucene, to create an index on sentences and their corresponding parse trees. UMLS 2009AB version was used in our study, which includes 5,175,449 distinct English strings and 2,120,271 concepts. The term frequency (sentence level) and document frequency (abstract level) were calculated by counting occurrences of each UMLS terms in all the MEDLINE sentences and abstracts. The tf-idf (term frequency-inverse document frequency) of each UMLS terms was calculated as following: tf_idf = (1+log10(tf)) * (log(N/df), where tf is term frequency, df is document frequency, N is the total number of abstracts (18,413,784 in total). The syntactic types and frequencies for each term were collected from all the parse trees where the term appears. Each term was assigned a vector of syntactic types and probabilities.

Data and data format

1. data directory on biox2:

Medline sentences: /scratch/users/xurong/medline_all/sentences

Medline parsetrees:/scratch/users/xurong/medline_all/trees

There are 100 files in both sentences and trees directories. The file of each sentence or parse tree is stored is determined by the last two digits of the PMID ID. For example, all sentences from abstract with PMID 13062500 are stored in file: /scratch/users/xurong/medline_all/sentences/00.txt


2. sentence file format: (pmid_sentenceID|sentence|year)

13062100_0|Further studies on the formation of adrenaline and noradrenaline in the body|1953

13062100_0 means the sentence "Further studies on the formation of adrenaline and noradrenaline in the body" is from abstract with PMID 13062100 and it is the title. The sentence id assignment starts with title assigned '0' and goes on.

3. parse tree file format: pmid_sentenceID|parse tree

1306200_1|(ROOT [165.455] (S [165.352] (NP [4.654] (PRP [3.405] We)) (VP [150.901] (VBN [7.607] reviewed) (NP [45.822] (NP [26.386] (DT [0.650] the) (JJ [13.702] neuroimaging) (NNS [7.227] studies)) (PP [19.028] (IN [0.669] of) (NP [17.957] (CD [6.498] 150) (NNS [7.159] patients)))) (PP [91.214] (IN [2.595] with) (NP [87.790] (NP [40.279] (JJ [10.955] cavernous) (NNS [11.334] sinus) (NNS [9.944] tumors)) (VP [45.296] (VBN [7.882] operated) (PRT [3.235] (RP [3.173] on)) (PP [28.914] (IN [4.860] during) (NP [23.382] (DT [3.222] an) (JJ [11.114] 8-year) (NN [6.078] period)))))))))

program

1. code /home/xurong/java/src/medline_analysis

2. how to complie: [xurong@frontend1 medline_analysis]$ ant

3. how to run:

1. medline count:

input: the terms to be counted output:term|pmid_sentenceID sh scripts/t.sh -Xmx2048m medline_analysis.MedlineCountFinder find_single /scratch/users/xurong/medline_all/sentences/00.txt 00.input 00.output stopwords_2.txt not

[xurong@frontend1 medline_analysis]$ more 00.input breast cancer

[xurong@frontend1 medline_analysis]$ more 00.output breast cancer|4114100_0

breast cancer|14486400_0

breast cancer|13886800_0 breast cancer|168100_1 breast cancer|10791000_0