Annotator Dataset Workflow Howto

From NCBO Wiki
Jump to navigation Jump to search

Chapter 1: Synchronizing Data with BioPortal

  • The synchronization with BioPortal data should be performed regularly (currently scheduled bi-weekly or on-demand).
  • The synchronization (Incremental Update) should be done in separate environment from Staging & PROD.
Environment for OBS Data Population
Instance of tomcat : ncbodev-obs
Instance of DB : ncbodev-obsdb1.sunet
  • The steps are as following:
  1. Remove out-dated ontologies from OBS Database (e.g. older version of ontologies that does not in BioPortal anymore). By invoking this Restlet, it will remove all the outdated ontology data and the associated entities such as concepts, terms, relations, semantic types and hierarchy information.

    Query: Get List of ontologies to be removed (old): http://ncbodev-obs:8080/obs_hibernate/admin/ontologies/list/old
    Run: http://ncbodev-obs:8080/obs_hibernate/admin/ontologies/remove

  2. Add new ontologies from BioPortal to OBS. By invoking this Restlet, it will add all the new ontology data and the associated entities such as concepts, terms, relations, semantic types and hierarchy information.

    Query: Get List of ontologies to be added (new): http://ncbodev-obs:8080/obs_hibernate/admin/ontologies/list/new
    Run: http://ncbodev-obs:8080/obs_hibernate/admin/ontologies/add

  3. Populate Concepts (For details, please refer to Chapter 2.1)

    http://ncbodev-obs:8080/obs_hibernate/loaderBigConcepts/all

  4. Populate Hierarchy (For details, please refer to Chapter 2.2)

    http://ncbodev-obs:8080/obs_hibernate/loaderBigPaths/all

    To monitor the progress and error, refer to:

    1. The "status" field in the table obs_ontology in OBS DB. (ncbo-dev-obsdb1.sunet)
    2. Check the log in tomcat. (ncbodev-obs: /usr/local/tomcat5/logs)
  5. Create Dictionary: To run this, this step has to be complete: "3. Populate Concepts" (For details, please refer to Chapter 4)
  6. Create Mapping Data (For details, please refer to Chapter 6)
  • When the update is complete, the snapshot of DB should be copied (or replicated) to Staging/PROD.

Chapter 2: Data Population - Concepts and Hierarchy

Introduction: OBS application pulls the ontology and the concept data from BioPortal via BioPortal REST services, then extracts and computes the hierarchy information then stores in OBS Database. Then this computed data is accessible via OBS REST services. The BioPortal REST URL is specified in build.properties.

bp.url.base= http://rest.bioontology.org

Data population is divided into two parts – 1. Concepts and 2. Hierarchy. Please see below for the details.

Concepts and Direct Relation (level = 1)

Pre-requisite in "status":

The ontology should be in valid status ("status = 3") in obs_ontology table in OBS Database to start this process (i.e. This is the initial status from BioPortal when ontology is successfully parsed). This is also serves as a safety lock to avoid launching population again for the ontologies already in the process of population or already populated.

Restlet calls available:

  1. For all ontologies: populate all ontologies with valid status ("status = 3") in OBS DB
    http://ncbodev-obs:8080/obs_hibernate/loaderBigConcepts/all
  2. For a specific ontology : populate the given ontology if "status = 3". Otherwise the data population process will complain that the status is invalid in the tomcat log and exit.
    http://ncbodev-obs:8080/obs_hibernate/loaderBigConcepts/{ontology_versoin_id}

Error Handling:

The error is logged both in tomcat catalina.out and DB (obs_error_queue table).

  1. Case 1: BioPortal REST Service is Down

    When you kick off the process (OBS Restlet call - either via web browser or shell script), it checks first if BioPortal REST service is alive. If the BioPortal REST service is down – the tomcat log will generate an error message about BioPortal REST service being down (But it does not change the ontology "status" field. Just simply kick off the process again when BioPortal REST service is back up. If BioPortal REST service is down, no change in OBS database, therefore no need to clean up. See "Error Handling" for the "status" change scenario).

  2. Case 2: Critical Error

    If the error is critical – RunTimeException and etc – the "status" the ontology in obs_ontology table becomes "99" and the data population process halt.

  3. Case 3: Non-critical Error

    If the error is not critical, the "status" the ontology in obs_ontology table becomes "99" but the data population process still continues to populate the rest of the data. An example for a non-critical error is a discrepancy between the data from two different BioPortal Restlet calls – i.e. Some of the root concepts from getRootConcepts call are missing in BioPortal (The list of concepts from getAllConcepts does not have some of the root concepts).

In the case of case 2 & 3, data clean up may be necessary. Please see "3. Monitoring the Progress: Error Handling".

Internal Users:

Currently OBS is using "Clone" BioPortal REST Services instead of BP PROD URL ("Clone" BioPortal is a snapshot of most recent BioPortal PROD data). By doing this, OBS is not affect by BioPortal PROD server restart/shutdown and drag the performance of BioPortal PROD. Please use the following URL:

bp.url.base=http://ncboprod-core2.stanford.edu:8080

Indirect Relation Hierarchy (level > 1)

Pre-requisite in "status":

The ontology should be in valid status ("status = 14") in obs_ontology table in OBS Database to start this process (i.e. "loaderBigConcepts" should have been completed for the given ontology). This is a safety lock to avoid launching population again for the ontologies already in the process of population or already populated.

Pre-requisite in configuration:

Tomcat should be configured to run the stored procedures via JDBC. Please refer to Appendix C. (Tomcat - pre-requisite).

Restlet calls available:

  1. For all ontologies: populate all ontologies with "status = 14"
    http://ncbodev-obs:8080/obs_hibernate/loaderBigPaths/all
  2. B. For a specific ontology : populate the given ontology if "status = 14". Otherwise the data population process will complain that the status is invalid in the tomcat log and exit.
    http://ncbodev-obs:8080/obs_hibernate/loaderBigPaths/{ontology_version_id}
    e.g. http://ncbodev-obs:8080/obs_hibernate/loaderBigPaths/40671

Monitoring the progress

To monitor the progress, please see "status" field in obs_ontology table.

Restlets Status Required (valid status required to begin the process) Status Start -> Finish
loaderBigConcepts 3 11 -> 14
loaderBigPaths 14 21 -> 28
(status value in obs_ontology table changes as progress continues)

Error Handling:

  • If there is any error, the status will be set to 99
  • If critical error (e.g. BioPortal REST services down) occurs, the population process will stop and exit. But if it is NOT a critical error (e.g. if the semantic type description is not found from semantic look up table etc), it will mark the status to 99, log the error (both Tomcat log and DB obs_error_queue), but continue with the population.
  • To restart the process, just run the script to clean up. The script will reset the status either to "3" or "14" depending on Concept clean up or Hierarchy clean up. (See Section II. Data Clean up)

Chapter 3: Data Clean up

Src is located in obs_hibernate/db/sql/obs_db_cleanup.sql – DO NOT RUN the entire script since this is just compiled list of commands.

/* 
	----------------------------------------------------------------------- 
 	to clean up or undo everything on one ontology, just run #1 and #2	
 	-----------------------------------------------------------------------
 	1. Rollback BPConceptManager. 											
 	Run this to rollback to initial state - to undo 'loaderConcepts' restlet	
*/

set @var_ontology_version_id := '39545'; 


delete a.* from obs_relation a, obs_concept b, obs_ontology c 
	where a.level = 1 and a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id;

delete a.* from obs_term a, obs_concept b, obs_ontology c 
	where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id;
	
delete a.* from obs_semantic_type a, obs_concept b,  obs_ontology c   
	where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id;

delete a.* from obs_concept a, obs_ontology b
	where a.ontology_id = b.id and b.local_ontology_id = @var_ontology_version_id;
	
UPDATE obs_ontology set status = 3 where local_ontology_id = @var_ontology_version_id;

/* 
	----------------------------------------------------------------------- 	
 	2. Rollback BPPathManager. 											
 	Run this to rollback - to undo 'loaderPaths' restlet  		 			
*/

set @var_ontology_version_id := '39545'; 

delete a.* from obs_relation a, obs_concept b, obs_ontology c 
	where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id and a.level > 1;
	
delete a.* from obs_path_to_root a, obs_concept b, obs_ontology c 
	where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id;
	
delete a.* from obs_path_to_leaf a, obs_concept b, obs_ontology c 
	where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id;
	
UPDATE obs_ontology set status = 14 where local_ontology_id = @var_ontology_version_id;		

Chapter 4: Create Dictionary File

All the terms will be created as dictionary file. (Data is coming from obs_term)

  • Location : The directory location is specified in build.properties
  • # Dictionary File path
    obs.dictionary.path=${deploy.path}/obs_hibernate/WEB-INF/resources/dictionary/
    
  • Default limit is specified in ApplicationConstants.java
  • public static final String DEFAULT_OFFSET_LIMIT_FOR_DICTIONARY_FILE = "50000";
    
  • Default stop is currently set to 1000000
  • Restlet call:
  • http://ncbodev-obs4:8080/obs_hibernate/createDictionary/0
    

Chapter 5: MGREP Server

Every time dictionary files are re-generated, it should be redeployed to MGREP server.

  • Location: The MGREP URL is specified in build.properties
  • # Mgrep URL
    obs.mgrep.url=ncbo-mgrep2.sunet:55555
    

Chapter 6: Mapping Data Population

Mapping data (source data of obs_map) is coming from BioPortal UI Database, not from BioPortal REST Services. The source data ("mappings" table in BioPortal UI DB) was copied to the OBS DB as table "mappings" and extracted directly (This data might be available from BioPortal REST services in the future).

Data Source: "mappings" table
Data Target: "obs_map" table
  1. Step 1: Remove all data from obs_map
  2. truncate table obs_map
  3. Step 2: Run the following
  4. http://ncbodev-obs:8080/obs_hibernate/loaderMapping


Appendix A: HTML TEST pages

http://ncbodev-obs:8080/test_obs.html (OBS)
http://ncbodev-obs:8080/test_oba.html (Annotator)

For Annotator User Guide, please refer to:
http://www.bioontology.org/wiki/index.php/Annotator_User_Guide

Appendix B: Useful SQL Queries for Monitoring and Validation

Src is located in obs_hibernate/db/sql/obs_db_cleanup.sql (In the same script in the DB Clean up)

Number of concepts for a specific Ontology

select count(*) from obs_concept a, obs_ontology b
	where a.ontology_id = b.id and b.local_ontology_id = '40261';

Number of total relations or path_to_root/leaf for a specific ontology (to see the progress) – by looking at how fast the number is growing

	
select count(*) from obs_relation a, obs_concept b, obs_ontology c
  where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = '40133' and a.level > 1;

select a.*, b.local_concept_id from obs_path_to_root a, obs_concept b, obs_ontology c 
	where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = '40483';

Ontologies that have the status "Ready" (28) but have no concepts in obs_concept table

SELECT o.*, c.ontology_id
FROM obs_ontology o
LEFT OUTER JOIN (
    SELECT DISTINCT ontology_id FROM obs_concept
) c ON o.id = c.ontology_id
WHERE o.status = 28 AND c.ontology_id IS NULL;

Appendix C: OBS Server/DB Information

DB

  • Host: ncbodev-obsdb1
  • Connection String: username = obs_hiber_api
    • please see the build.properties in tomcat section

Tomcat

  • Host: ncbodev3 or ncbodev4
  • Pre-requisite for populating hierarchy ("loaderBigPaths"): this process is using stored procedure and the permission should be set in JDBC connection. And the JDBC connection is specified in the build.properties below:
    obs.jdbc.url=jdbc:mysql://ncbodev-obsdb1.sunet:3306/obs_hibernate?noAccessToProcedureBodies=true
    
    build.properties (/apps/bmir.apps/obs/svn/trunk): this file has a bunch of properties including DB connection strings and other configuration. To be able to populate hierarchy data( which involves executing stored procedure), the permission should be set in JDBC URL as parameter as shown above.
  • Build: Run ./build_obs.sh
    (get the latest code and restart server)
    cd /apps/bmir.apps/obs/svn/trunk
    sudo svn update
    sudo /sbin/service tomcat6 stop
    sudo /sbin/service tomcat6 status
    echo "====================== ant build ======================"
    sudo ant clean
    sudo ant deploywar
    sudo /sbin/service tomcat6 start