Difference between revisions of "Importing UMLS To Virtual Appliance"

From NCBO Wiki
Jump to navigation Jump to search
(Created page with "Category:NCBO Virtual Appliance <p>The NCBO Virtual Appliance supports [http://www.geneontology.org/GO.format.obo-1_2.shtml OBO] and [http://www.w3.org/TR/owl-features/ O...")
 
(changes to add an example workflow)
Line 52: Line 52:
  
 
<p>NCBO dedicates a fair amount of resources (powerful servers) to handle a good portion of UMLS ontologies. Some of the UMLS ontologies contain millions of classes. To import the largest UMLS ontologies (i.e: RXNORM or SNOMEDCT) Users will have to run the Appliance in a powerful dedicated environment with 8GB RAM and 5GB hard disk space available.</p>
 
<p>NCBO dedicates a fair amount of resources (powerful servers) to handle a good portion of UMLS ontologies. Some of the UMLS ontologies contain millions of classes. To import the largest UMLS ontologies (i.e: RXNORM or SNOMEDCT) Users will have to run the Appliance in a powerful dedicated environment with 8GB RAM and 5GB hard disk space available.</p>
 +
 +
<h2>Example Workflow</h2>
 +
 +
<p>This workflow for importing UMLS data has been provided by Vincent Emonet, and has been provided here without testing by the NCBO team.</p>
 +
 +
<h3>1. Install UMLS using mmsys</h3>
 +
 +
<ul>
 +
<li>Download everything from https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html</li>
 +
<li>Unzip mmsys.zip</li>
 +
<li>Put the following files in the now unzipped mmsys directory : </li>
 +
<ul>
 +
  <li>2015AB-1-meta.nlm</li>
 +
  <li>2015AB-2-meta.nlm</li>
 +
  <li>2015AB-otherks.nlm</li>
 +
  <li>mmsys.zip (why not?)</li>
 +
  <li>2015AB.CHK</li>
 +
    </ul>
 +
<li>./run_linux.sh (or run.bat or run_mac.sh)</li>
 +
<code>Install UMLS</code>
 +
<li>Source: path to mmsys directory and Destination: path of a directory where the subset file will be generated</li>
 +
<li>Semantic Network -> Choose Database Load Scripts: Mysql 5.6 (to generate the mysql load script for the semantic network subset, aka STY in bioportal)</li>
 +
<li>Select "New configuration.."</li>
 +
<li>Select Default Subset: select the default subset you want (it is not really important, you can specificaly choose each thesaurus in the next step)</li>
 +
<li>Go to the "Output Options" tab > Write Database Load Scripts</li>
 +
<ul><li>Select database > MySQL 5.6</li></ul>
 +
<li>Go to the "Source List" and select the sources (aka ontologies in bioportal) you want</li>
 +
<ul>
 +
  <li>hold ctrl to select many</li>
 +
  <li>careful there is an option to define if you want the selected source excluded or included in the subset</li>
 +
</ul>
 +
<li>Then click on "Done" in the top command on the window</li>
 +
<li>And wait for UMLS to install</li>
 +
<li>Note: mmsys will generate RRF files. You can use the mmsys software to browse it, but we will just use it to populate a SQL database</li>
 +
</ul>
 +
<h3>Load subset in MySQL</h3>
 +
<ul>
 +
<li>Go to mysql</li>
 +
   
 +
<code>&gt;create database umls2015ab;</code></li>
 +
<li>Configure database character encoding to UTF-8</li>
 +
<code>&gt;ALTER DATABASE umls2015ab CHARACTER SET utf8 COLLATE utf8_unicode_ci;</code></li>
 +
<li>Go to the 2015AB directory (where UMLS has been installed) in the META directory</li>
 +
<li>Open the populate_mysql_db.sh script and change the first lines with your MySQL credentials:</li>
 +
<ul><li>(note: for MYSQL_HOME it adds "/bin/mysql" to find the mysql bin)</li></ul>
 +
<code>
 +
MYSQL_HOME=/usr
 +
user=<username>
 +
password=<password>
 +
db_name=umls2015ab
 +
</code>
 +
 +
<li>Then run <code>./populate_mysql_db.sh</code></li>
 +
</ul>
 +
<h3>Generate RDF from MySQL</h3>
 +
<ul>
 +
<li>Clone https://github.com/ncbo/umls2rdf</li>
 +
<li>Rename conf_sample.py as conf.py and configure the access to your database (example when database in local)</li>
 +
<pre>
 +
#Folder to dump the RDF files.
 +
OUTPUT_FOLDER = "output"
 +
 +
#DB Config
 +
DB_HOST = "localhost"
 +
DB_NAME = "umls2015ab"
 +
DB_USER = "root"
 +
DB_PASS = "<password>"
 +
 +
UMLS_VERSION = "2015ab"
 +
UMLS_BASE_URI = "http://purl.bioontology.org/ontology/"
 +
</pre>
 +
 +
<li>Define the ontology you want to generate in umls.conf</li>
 +
<ul>
 +
  <li>Example for LNC-RU-RU: "LNC-RU-RU,LNC-RU-RU.ttl,load_on_codes"</li>
 +
    <li>If you want to get the name used for you ontology in the database you can get them by going to mySQL, select the umls database and get it from the following query: "select distinct SAB from MRCONSO;"</li>
 +
</ul>
 +
<li>Run ./umls2rdf.py</li>
 +
<li>Get the ttl files in the output directory</li>
 +
</ul>

Revision as of 17:24, 4 December 2015


The NCBO Virtual Appliance supports OBO and OWL ontology formats but not UMLS in its native form. To bridge this gap, we have developed a project called UMLS2RDF that transforms UMLS ontologies into OWL/RDF.

UMLS2RDF is a Python script that connects to a UMLS MySQL installation and extracts the UMLS ontologies in a format that the Appliance can work with.

Install UMLS MySQL

To import UMLS ontologies, a local installation of the UMLS MySQL release needs to be available. Please refer to the UMLS documentation for instructions on how to install the UMLS MySQL distribution.

Install UMLS2RDF

  1. First clone the github project:
    git clone https://github.com/ncbo/umls2rdf/
  2. Install the MySQL Python driver. We recommend to use pip for this:
    pip install MySQL-python

Configure UMLS2RDF

UMLS2RDF has two configuration files:

  1. conf.py where the database configuration (host,name,user and password) needs to be specified. Also the output folder.
  2. umls.conf where one can specified the UMLS ontologies to be extracted. This is a comma separated file with the following 4 fields:
    1. SAB
    2. This is legacy. Any value works.
    3. Output file name.
    4. Conversion strategy. Accepted values (load_on_codes, load_on_cuis).


With load_on_codes the original source of the ontology will be used as strategy. The Class IDs will be constructed with the MRCONSO.CODE field. If load_on_cuis is selected then the strategy to transform the ontology will use CUIs to construct the Class IDs.

In our configuration file, you can see the settings used by our production system. These are all the UMLS ontologies that are publicly available in BioPortal.

Run UMLS2RDF

Once the configuration files have the settings run the command:

python umls2rdf.py

Depending on how many ontologies are extracted the run time can range from a few minutes to four hours. This process is memory intensive and to transform the largest UMLS ontologies (i.e: SNOMED) one needs at least 16G RAM available.

Upload files to the NCBO Virtual Appliance

The output files will be located in the folder specified in conf.py. Use the BioPortal Web form available in your appliance to submit the extracted ontologies. IMPORTANT: The ontology format in the submission form should be UMLS.

Hardware Considerations

NCBO dedicates a fair amount of resources (powerful servers) to handle a good portion of UMLS ontologies. Some of the UMLS ontologies contain millions of classes. To import the largest UMLS ontologies (i.e: RXNORM or SNOMEDCT) Users will have to run the Appliance in a powerful dedicated environment with 8GB RAM and 5GB hard disk space available.

Example Workflow

This workflow for importing UMLS data has been provided by Vincent Emonet, and has been provided here without testing by the NCBO team.

1. Install UMLS using mmsys

  • Download everything from https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
  • Unzip mmsys.zip
  • Put the following files in the now unzipped mmsys directory :
    • 2015AB-1-meta.nlm
    • 2015AB-2-meta.nlm
    • 2015AB-otherks.nlm
    • mmsys.zip (why not?)
    • 2015AB.CHK
  • ./run_linux.sh (or run.bat or run_mac.sh)
  • Install UMLS
  • Source: path to mmsys directory and Destination: path of a directory where the subset file will be generated
  • Semantic Network -> Choose Database Load Scripts: Mysql 5.6 (to generate the mysql load script for the semantic network subset, aka STY in bioportal)
  • Select "New configuration.."
  • Select Default Subset: select the default subset you want (it is not really important, you can specificaly choose each thesaurus in the next step)
  • Go to the "Output Options" tab > Write Database Load Scripts
    • Select database > MySQL 5.6
  • Go to the "Source List" and select the sources (aka ontologies in bioportal) you want
    • hold ctrl to select many
    • careful there is an option to define if you want the selected source excluded or included in the subset
  • Then click on "Done" in the top command on the window
  • And wait for UMLS to install
  • Note: mmsys will generate RRF files. You can use the mmsys software to browse it, but we will just use it to populate a SQL database

Load subset in MySQL

  • Go to mysql
  • >create database umls2015ab;
  • Configure database character encoding to UTF-8
  • >ALTER DATABASE umls2015ab CHARACTER SET utf8 COLLATE utf8_unicode_ci;
  • Go to the 2015AB directory (where UMLS has been installed) in the META directory
  • Open the populate_mysql_db.sh script and change the first lines with your MySQL credentials:
    • (note: for MYSQL_HOME it adds "/bin/mysql" to find the mysql bin)

    MYSQL_HOME=/usr user=<username> password=<password> db_name=umls2015ab

  • Then run ./populate_mysql_db.sh

Generate RDF from MySQL

  • Clone https://github.com/ncbo/umls2rdf
  • Rename conf_sample.py as conf.py and configure the access to your database (example when database in local)
  • #Folder to dump the RDF files.
    OUTPUT_FOLDER = "output"
    
    #DB Config
    DB_HOST = "localhost"
    DB_NAME = "umls2015ab"
    DB_USER = "root"
    DB_PASS = "<password>"
    
    UMLS_VERSION = "2015ab"
    UMLS_BASE_URI = "http://purl.bioontology.org/ontology/"
    
  • Define the ontology you want to generate in umls.conf
    • Example for LNC-RU-RU: "LNC-RU-RU,LNC-RU-RU.ttl,load_on_codes"
    • If you want to get the name used for you ontology in the database you can get them by going to mySQL, select the umls database and get it from the following query: "select distinct SAB from MRCONSO;"
  • Run ./umls2rdf.py
  • Get the ttl files in the output directory