UMLS Concepts and Terms in Clinical Notes: Large-scale Corpus Analysis



Ontologies such as SNOMED-CT, MeSH, or RxNorm may be used to identify terms in text as normalized concepts, which in turn allows the contents of medical language to be comparable at a high-throughput level. However, systems that intelligently process clinical and biomedical data are often slowed down and distracted by the extensive nature of the utilized ontologies, and may therefore benefit in efficiency by a customizable filtering. Based on the occurrences of terms in a 51 million document corpus of clinical notes from Mayo Clinic, this study computes a suite of statistics with distributional characteristics by source ontology, semantic type, and syntactic type. These statistics imply empirically- based criteria for filtering a lexicon in the clinical domain; a small, intelligent lexicon would make near real-time processing of clinical text a possibility.



Stephen T. Wu is a Research Associate and Instructor in Medical Informatics at Mayo Clinic.  With a background in Electrical Engineering (BS/MS) and statistical Natural Language Processing (PhD), Dr. Wu joined the Mayo NLP program in July 2010. His interests include computational semantics and its application to real-world clinical and epidemiological problems.  This includes the discovery and modeling of semantic content in clinical text, and Dr. Wu has thus conducted comparative, large-scale studies on the semantic output of NLP systems in both the clinical and biomedical domains.  He is also working on developing evidence-based ontological resources. His research in general-domain NLP has included broad-coverage distributional semantics, speech interfaces sensitive to ontological context, and cognitively-motivated models of language.