Annotator Optimizing and Troublehooting

From NCBO Wiki
Jump to: navigation, search

If you have a lot of material to annotate, or are having trouble getting your annotation jobs to run, this page provides some tips on making things better.

Caveat: The information in this page was produced by analysis, and may not reflect real-world experience. Your experience and test results are welcome to improve this content.

Contents

Optimizing the Annotator

Let's assume you have thousands or tens of thousands (or more!) of full page text articles, for which you wish to retrieve and parse the annotations. Assuming it takes about 20 second to parse and organize the response for each article, what's the right way to organize and optimize the job for the Annotator?

Annotation API Optimizations and Recommendations

BIoPortal has a fair amount or compute power, but it can definitely be overwhelmed by requests if there are enough of them. (This is especially true for the Recommender, which has maybe 100 times the compute requirements of the Annotator.) Therefore organizing the API requests appropriately is important for you, and for other BioPortal users. If the constraints suggested by this article mean your annotations will take too long, then you will need to set up a BioPortal Virtual Appliance in your own organization to accomplish your goals.

Optimizing Your Query

We'll assume that you want to make your query execute as quickly as possible. How can you set it up to make that happen?

The key adjustments to improve efficiency are (a) select ontologies to annotate with; (b) set options to minimize the number of annotations; and (c) as a special case, select specific UMLS semantic type to annotate with.

By selecting specific ontologies that you want to use in the annotation, you will reduce the processing speed considerably, in comparison to the very large set of ontologies the Annotator will check otherwise.

In the API, the only option that needs to be changed for optimal results is exclude_synonyms and longest_only; the other settings are already appropriate. (In the UI, options to check to minimize the number of annotations are Match Longest Only (excluding shorter phrases within long ones), Exclude Synonyms, and Exclude Numbers. Other options should remain unset, and the Match Ancestors option should be none.) Of course, these changes limit the annotations you will get, so they must be configured to meet your needs.

Setting the UMLS semantic type further constrains the items that are mapped, but this produces a very specific result.

Identifying Your True Rate Limit for Your Query

BioPortal has a rate limit on API requests for a particular key; this is currently set to 15 requests per second. However, this limit does not reflect BioPortal's ability to process complex requests, and submitting 15 annotation requests of one page each might be enough to slow down or stop BioPortal over time, not to mention inconveniencing many other users.

To stay within BioPortal's capacity, we recommend that you monitor the response time for your requests, and adjust the flow of requests to keep that response time close to normal. To measure the normal response time, submit ten different annotation requests, each one after the previous one's results are received. Measure the average time for a single annotation request.

Now, to arrive at a 'typical' throughput that BioPortal can sustain, do the following. Start submitting more multiple annotation requests in a batch (starting with a small number), without waiting for the response (or from different clients). Submit each of your 10 batches only after the previous batch is fully processed. See if the average response time for the annotations in these 10 batches is significantly slower. If it isn't, you can double the batch size (up to 15, to avoid the rate limit), and try the experiment again.

Once the average response goes up by, say, more than 50% (estimate), you are likely to be loading BioPortal faster than it is processing your requests. If you keep placing your requests at that right, BioPortal will eventually run out of resources, and until then all other users will be heavily impacted. So scale back that number to submit per batch, and wait for the batch to complete before submitting another batch.

To optimize further, dynamically monitor the response time for every request and the total number of requests outstanding. If the response time increases significantly above the normal value, reduce the number of requests you have outstanding; if the response time stays normal, you can increase the number of requests outstanding (up to 15/second), sending a request whenever the number outstanding is less than the limit.

Note that for particularly big requests, like a 20-page document in the Recommender, BioPortal might not even be able to process one request per second without using up its resources, and you will have to spread out the requests further.

On the other hand, if you have trivial requests, it might be that BioPortal can keep up with no problem. In this case, using 2 API keys to submit requests faster than 15/second might be acceptable, but please consult with the BioPortal staff before doing this so that we can monitor the system.

Note that initially (in 2009, see [1], the service responded on average in 1.8 seconds for a mean input word count of 180 words, and in 2.3 seconds for the mean input word count of 280 words. When simulating 10 simultaneous users, the response time was between 4.5 and 5.0 seconds for 280 words. These numbers are likely to be significantly faster today.

Other factors to consider:

  • Sending your requests late at night in the US is likely to minimize impact to other UI users, and may provide faster responses.
  • You will need to take into account your own processing of the BIoPortal response when calculating the overall throughput you can sustain.
  • A slow transmission line between you and BioPortal may cause requests or responses to back up, but this is unlikely to be a real scenario for the typical user.
  • Any latency in the network should not affect the overall throughput.

Annotation API Troubleshooting

If you are getting errors when using the API, a simple first test is to try accessing the UI version of the Annotator, and see if it works. If it works with the sample text, try it with your text, to make sure the specific text does not excite an issue referencing a particular ontology.

If the UI version does not work with the sample text, or the Annotator page or whole site is down, send a report to support@bioontology.org. If it's the weekend, or you are particularly eager, try the UI version after some time (15 minutes, or an hour) has passed; sometimes the system will recover itself, or we will go on-line and see the issue.

If the UI version works with the sample text, but not your text, there may be an ontology that is failing. Try performing the annotations with a single ontology; if that works, it indicates an ontology issue, which will probably require expert support to fix. Again, send a report to support@bioontology.org, and consider whether you want to troubleshoot by looking for failing ontology(ies) with a divide by 2 strategy of ontology selection.

If the UI version is fully working, but your API call does not, try the call using curl or via https in the browser (note the sample https string at the end of the returned content), verifying your API is the same as the one in your BioPortal profile. If this works, it suggests your code may have an issue in presenting the API call to BioPortal, or that BioPortal may be rate-limiting your requests. You can test the latter by spreading our your queries, perhaps echoing each one to the console to confirm the expected rate.

If a single curl or https call does not work. look at the returned error code for a clue. These error codes should accurately reflect the reason the system has rejected the request. If you do not understand the error code, please contact us via support@bioontology.org. Even if we do not see your post, someone else may see it and offer advice.

If you are not reaching the BioPortal system at all, and there is also a problem accessing related services that are not at bioportal.bioontology.org (for example, http://data.bioontology.org/documentation), the problem is likely in the network between the two systems. Send a report for this as well; if it is under our control, we will get it fixed.

References

[1] Shah et al BMC Bioinformatics. 2009 Sep 17;10 Suppl 9:S14 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2745685/)

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox