Difference between revisions of "Annotator User Guide"

From NCBO Wiki
Jump to navigation Jump to search
Line 21: Line 21:
 
|-valign="top"
 
|-valign="top"
 
|width="10%"|'''longestOnly'''
 
|width="10%"|'''longestOnly'''
|width="5%"|{true, false}
+
|width="10%"|{true, false}
|width="5%"|default: false
+
|width="10%"|default: false
|width="80%"|Specifies either or not the concept recognition step (done with University of Michigan Mgrep tool) must match the longest words only if they are several concepts that match to an expression.
+
|width="70%"|Specifies either or not the concept recognition step (done with University of Michigan Mgrep tool) must match the longest words only if they are several concepts that match to an expression.
  
 
* For example: If ''longestOnly=true'', the phrase 'skin neoplasms'  will match the concept NCI/C0037286 (Skin Neoplasms) and NCI/C0027651 (Neoplasms) only. If ''longestOnly=false'', the concept NCI/C1123023(Skin) will also match in addition.
 
* For example: If ''longestOnly=true'', the phrase 'skin neoplasms'  will match the concept NCI/C0037286 (Skin Neoplasms) and NCI/C0027651 (Neoplasms) only. If ''longestOnly=false'', the concept NCI/C1123023(Skin) will also match in addition.
Line 29: Line 29:
 
|-valign="top"
 
|-valign="top"
 
|width="10%"|'''wholeWordOnly'''
 
|width="10%"|'''wholeWordOnly'''
|width="5%"|{true, false}
+
|width="10%"|{true, false}
|width="5%"|default: true
+
|width="10%"|default: true
|width="80%"|Specifies whether the concept recognition step must match whole words only or not, if they are several concepts that match to a given word.
+
|width="70%"|Specifies whether the concept recognition step must match whole words only or not, if they are several concepts that match to a given word.
 
* For example: If ''wholeWordOnly=true'', the phrase 'neoplasms'  will match the concept NCI/C0027651 (Neoplasms) only. If ''wholeWordOnly=false'', the concept NCI/C1551054 (S) or the concept NCI/C0242536 (ASM) will also match (~80 concepts in NCI) in addition.
 
* For example: If ''wholeWordOnly=true'', the phrase 'neoplasms'  will match the concept NCI/C0027651 (Neoplasms) only. If ''wholeWordOnly=false'', the concept NCI/C1551054 (S) or the concept NCI/C0242536 (ASM) will also match (~80 concepts in NCI) in addition.
 
* Note that the concept recognition step does not consider text cast.
 
* Note that the concept recognition step does not consider text cast.
Line 104: Line 104:
 
*  TabDelimited:  shorter version of "Text" format. returns not the full result content but the annotations only (no statistics, etc.). The format of the tab delimited file is: score \t conceptId \t preferredName \t synonyms (separated by ' /// ') \t semanticType (separated by ' /// ') \t contextName \t isDirect \t other context information (e.g., childConceptID, mappedConceptID, level, mappingType) (separated by ' /// ').
 
*  TabDelimited:  shorter version of "Text" format. returns not the full result content but the annotations only (no statistics, etc.). The format of the tab delimited file is: score \t conceptId \t preferredName \t synonyms (separated by ' /// ') \t semanticType (separated by ' /// ') \t contextName \t isDirect \t other context information (e.g., childConceptID, mappedConceptID, level, mappingType) (separated by ' /// ').
  
 +
annotatorResultBean
 +
{| border="1" cellpadding="2"
 +
|-valign="top"
 +
|width="10%"|'''resultID'''
 +
|width="90%"|
 +
|-valign="top"
 +
|width="10%"|'''dictionary'''
 +
|width="90%"| Dictionary contains the metadata (not the content) of the dictionary used for a result. dictionaryID, dictionaryName, and dictionaryDate identify the dictionary on the server side and give information about its content. Dictionary versioning is strongly linked to the evolution of the ontologies used. Each time ontologies change, the dictionary is updated. All the dictionary information may be useful for comparing results of the Annotator Restlet service on time.
 +
|-valign="top"
 +
|width="10%"|'''statistics'''
 +
|width="90%"| Statistics contains information on the number of annotations done for a given context. The contextName keyword identifies the type of context and nbAnnotation is the number of annotations of this type.
 +
|-valign="top"
 +
|width="10%"|'''parameters'''
 +
|width="90%"| Parameters summarizes all the parameters specified by the user when requesting the Annotator Restlet service. Those parameters are described in section Service parameters
 +
|-valign="top"
 +
|width="10%"|'''ontologies'''
 +
|width="90%"| Ontology is a representation of an ontology in the Annotator Restlet service ontology model represented in Figure 1. To keep the model simple, we provide only the global ontology identifier, localOntologyID the name (ontologyName) and version (ontologyVersion). This information come from the original repositories (UMLS/BioPortal) and might help the user to select the right ontology to use. When an ontology is used in the annotation, a result has a set of OntologyUsed which specify 2 other properties: nbAnnotation, the number of annotation that have been made with concepts from this ontology. score, the sum of all the scores of the annotations done with concepts from this ontology (if parameter scored=true). Therefore, score represents the most accurate ontology to annotate the given text.
 +
|-valign="top"
 +
|width="10%"|'''annotations'''
 +
|width="90%"| Annotation is a representation of one annotation. An annotation has a score which represents the accuracy of the annotation computed by the scoring algorithm (if the scored=true parameter was chosen, otherwise score=-1). An annotation is done with a concept in a context.
  
*        Annotation is a representation of one annotation. An annotation has a score which represents the accuracy of the annotation computed by the scoring algorithm (if the scored=true parameter was chosen, otherwise score=-1). An annotation is done with a concept in a context.
+
|}   
  
  
Line 120: Line 140:
  
 
*  DISTANCE, used to identify expanded annotations done with the semantic distance expansion component. A DISTANCE context has 2 properties: relatedConceptID identifies the concept from which the annotation was derived. distance specifies the distance (as an integer) between the 2 concepts.
 
*  DISTANCE, used to identify expanded annotations done with the semantic distance expansion component. A DISTANCE context has 2 properties: relatedConceptID identifies the concept from which the annotation was derived. distance specifies the distance (as an integer) between the 2 concepts.
 
*  Dictionary contains the metadata (not the content) of the dictionary used for a result. dictionaryID, dictionaryName, and dictionaryDate identify the dictionary on the server side and give information about its content. Dictionary versioning is strongly linked to the evolution of the ontologies used. Each time ontologies change, the dictionary is updated. All the dictionary information may be useful for comparing results of the Annotator Restlet service on time.
 
 
*  Ontology is a representation of an ontology in the Annotator Restlet service ontology model represented in Figure 1. To keep the model simple, we provide only the global ontology identifier, localOntologyID the name (ontologyName) and version (ontologyVersion). This information come from the original repositories (UMLS/BioPortal) and might help the user to select the right ontology to use. When an ontology is used in the annotation, a result has a set of OntologyUsed which specify 2 other properties: nbAnnotation, the number of annotation that have been made with concepts from this ontology. score, the sum of all the scores of the annotations done with concepts from this ontology (if parameter scored=true). Therefore, score represents the most accurate ontology to annotate the given text.
 
 
 
*        Statistics contains information on the number of annotations done for a given context. The contextName keyword identifies the type of context and nbAnnotation is the number of annotations of this type.
 
 
 
*        Parameters summarizes all the parameters specified by the user when requesting the Annotator Restlet service. Those parameters are described in section Service parameters
 

Revision as of 15:50, 5 May 2009

Sample HTTP Client for the Annotator Restlet

HTML http://ncbo-obs-prod1:8080/test_oba.html


Service endpoint

POST your requests at http://ncbo-obs-prod1.stanford.edu:8080/obs_hibernate/annotator

Restlet Service Parameters

The Annotator Restlet service offers a set of parameters that allows a user to customize the annotator work flow and filter the result. To customize the work flow and the result, the user can specify a set of ontologies and a specific set of semantic types. Plus, the two steps of the annotation work flow can be parametrized.

The Annotator Restlet service level agreement (e.g., response time) depends on the selected components as each consumes resources at a different level. For example, the is_a transitive closure takes a long time to process, even when using a pre-computed hierarchy table. As another example, an annotation with wholeWordOnly=false will be significantly longer that with wholeWordOnly=true.

Please see below for the list of parameters and the possible values.


longestOnly {true, false} default: false Specifies either or not the concept recognition step (done with University of Michigan Mgrep tool) must match the longest words only if they are several concepts that match to an expression.
  • For example: If longestOnly=true, the phrase 'skin neoplasms' will match the concept NCI/C0037286 (Skin Neoplasms) and NCI/C0027651 (Neoplasms) only. If longestOnly=false, the concept NCI/C1123023(Skin) will also match in addition.
wholeWordOnly {true, false} default: true Specifies whether the concept recognition step must match whole words only or not, if they are several concepts that match to a given word.
  • For example: If wholeWordOnly=true, the phrase 'neoplasms' will match the concept NCI/C0027651 (Neoplasms) only. If wholeWordOnly=false, the concept NCI/C1551054 (S) or the concept NCI/C0242536 (ASM) will also match (~80 concepts in NCI) in addition.
  • Note that the concept recognition step does not consider text cast.
stopWords {stopWord1,...,stopWordN} default: empty (i.e. none) Specifies the list of stop words to use.
withDefaultStopWords {true, false} default: false Specifies whether to use stop words or not. The default stop word list are available from sample HTML page. If set to true, this override the value of stopWords given by the user.
scored {true, false} default: true Specifies either or not the annotations are scored. A score is a number assigned to an annotation that reflects the accuracy of the annotation. The higher the score is the better the annotation is. The scoring algorithm gives a specific weight to an annotation according to the context of this annotation. For instance, an annotation done by matching a concept preferred name will be given a higher weight than an annotation done by matching a concept synonym or than an annotation done with a parent level 3 in the is_a hierarchy. Details on the scoring algorithm are given in section Scoring algorithm.
  • For example, the phrase 'melanoma' is annotated both with the concept NCI/C0025202 (melanoma) and the concept NCI/C1522102 (Mouse Melanoma). The former annotation is scored 10 where as the latter is scored 8.
ontologiesToExpand {localOntology1,...,localOntologyN} default: empty (i.e. all ontologies) Specifies the list of ontologies to use to expand in the annotation process. The list of ontologies that can be used is available in the sample HTML page. The values are separated with comma (without spaces)
  • For example, SNOMEDCT,NCI,13578,36625,MSH.
ontologiesToKeepInResult {localOntology1,...,localOntologyN} default: empty (i.e. all ontologies) Specifies the list of ontologies you want to filter in the result from the annotation process. The list of ontologies that can be used is available in the sample HTML page. The values are separated with comma (without spaces)
  • For example, SNOMEDCT,NCI,MSH.
semanticTypes {semanticType1,...,semanticTypeN} default: empty (i.e. all semanticTypes) Specifies the list of semantic types to use in the annotation process. The list of semantic types that can be used is available at the /obs/semanticTypes URL. Note that the restriction to semantic types is also applied during the semantic expansion steps.
  • For example, T047,T048,T191.
levelMax {integer} default: 0 Specifies the minimum (resp. maximum) level a parent concept must have to be considered for the is_a semantic closure expansion step.
  • For example, an annotation done with levelMin=1 & levelMax=3 will expand a direct annotations done with a concept up to the 3rd level parent in the is_a hierarchy for this concept. An annotation done with levelMin=0 & levelMax=0 is equivalent to disable the is_a transitive closure expansion step.
mappingTypes {null,mappingType1,...,mappingTypeN} default: empty (i.e. all mappingTypes) Specifies the list of mapping type to use during the mapping expansion step. The list of rmapping types that can be used is available at the /obs/mappingTypes URL. The current list is described in section Mapping types.
  • For example, from-mrrel,Human.
  • Note that the use of the key word "null" in the mappingTypes list disables the mapping expansion component. Note also that the mapping expansion is also limited by other parameters such as ontologiesToExpand and ontologiesToKeepInResult.
textToAnnotate Specifies the text to be annotated
format {xml,text,tabDelimited} default: xml Specifies the desired format of the response from Annotator Restlet.


Restlet Service Response

  • ObaResult is the main object returned by the Annotator Restlet service. text refers to the piece of text originally sent to the service, while resultID identifies the result. The properties annotations, dictionary, ontologies and statistics and parameters are defined hereafter. An ObaResult provides functions to export its content in different form for the user. There are three different format for the response from the Annotator Restlet service:
  • XML: returns XML representation of the result bean.
  • Text: returns the result content as plain text.
  • TabDelimited: shorter version of "Text" format. returns not the full result content but the annotations only (no statistics, etc.). The format of the tab delimited file is: score \t conceptId \t preferredName \t synonyms (separated by ' /// ') \t semanticType (separated by ' /// ') \t contextName \t isDirect \t other context information (e.g., childConceptID, mappedConceptID, level, mappingType) (separated by ' /// ').

annotatorResultBean

resultID
dictionary Dictionary contains the metadata (not the content) of the dictionary used for a result. dictionaryID, dictionaryName, and dictionaryDate identify the dictionary on the server side and give information about its content. Dictionary versioning is strongly linked to the evolution of the ontologies used. Each time ontologies change, the dictionary is updated. All the dictionary information may be useful for comparing results of the Annotator Restlet service on time.
statistics Statistics contains information on the number of annotations done for a given context. The contextName keyword identifies the type of context and nbAnnotation is the number of annotations of this type.
parameters Parameters summarizes all the parameters specified by the user when requesting the Annotator Restlet service. Those parameters are described in section Service parameters
ontologies Ontology is a representation of an ontology in the Annotator Restlet service ontology model represented in Figure 1. To keep the model simple, we provide only the global ontology identifier, localOntologyID the name (ontologyName) and version (ontologyVersion). This information come from the original repositories (UMLS/BioPortal) and might help the user to select the right ontology to use. When an ontology is used in the annotation, a result has a set of OntologyUsed which specify 2 other properties: nbAnnotation, the number of annotation that have been made with concepts from this ontology. score, the sum of all the scores of the annotations done with concepts from this ontology (if parameter scored=true). Therefore, score represents the most accurate ontology to annotate the given text.
annotations Annotation is a representation of one annotation. An annotation has a score which represents the accuracy of the annotation computed by the scoring algorithm (if the scored=true parameter was chosen, otherwise score=-1). An annotation is done with a concept in a context.


  • Concept is a representation of an ontology concept in the Annotator Restlet service ontology model represented in Figure 1. localConceptID globally identifies the concept in its original repository. preferredName is the label or preferred term for this concept (as assigned by the original repository). synonyms is a set of possible terms that represent the concept but are not preferred. isTopLevel specifies if the concept is a root concept in its ontology. localOntologyID identifies the ontology in which the concept is defined. localSemanticTypesIDs is the set of the semantic types of the concept (assigned by UMLS + T000 and T999).


  • Context is a representation of the context in which annotation has been done. It specifies if it is a direct or expanded annotation and give precision about the origin of the annotation. contextName identifies the type of context. The context properties vary with the type of concept. There are 4 possible contexts identified by their contextName:
  • MGREP, used to identify direct annotations done with the Mgrep concept recognizer. A MGREP context has 3 properties: termName, identifies the expression (preferred name or synonyms) that was matched by Mgrep. from, and to, specify the character index in the given text for the matched expression.
  • ISA_CLOSURE, used to identify expanded annotations done with the is_a transitive closure expansion component. A ISA_CLOSURE context has 2 properties: childConceptID identifies the concept from which the annotation was derived. level specifies the distance in the is_a hierarchy between the annotating concept and the concept from which the annotation was derived. For example, if a direct annotation with NCI/C0025202 (melanoma) was done, the is_a transitive closure component may expand it to another annotation with NCI/C1302746 (Melanocytic Neoplasm) because the latter is a direct parent (i.e., level 1) concept of the former. The ISA_CLOSURE annotation generated will have the following properties {NCI/C0025202, 1}.
  • MAPPING, used to identify expanded annotations done with the mapping expansion component. A MAPPING context has 2 properties: mappedConceptID identifies the concept from which the annotation was derived. mappingType specifies the type of mapping.
  • DISTANCE, used to identify expanded annotations done with the semantic distance expansion component. A DISTANCE context has 2 properties: relatedConceptID identifies the concept from which the annotation was derived. distance specifies the distance (as an integer) between the 2 concepts.