Scripts and hacks for curation

When you curate scientific literature, there are lots of little tasks and procedures and requirements that can all add up into a big inefficient mess if they’re not integrated very well into your workflow. Some of these things are in-house and so you have some level of control over how they are handled (depending, sometimes, on engineering or management), while for others you are at the mercy of the wilderness that is science. It would be great, for example, if all abstracts contained enough information for us to evaluate whether the rest of the article is worth reading – or, in many cases, worth buying in order to read it.

Photo by hbart on Flickr

Photo by hbart on Flickr

In the context of genetic association literature, it would also be great if all abstracts used standard database identifiers for SNPs (i.e. rs #s), as this unambiguously defines the variant; used standard means of reporting the association (e.g. odds ratios with 95% confidence intervals and p-values); and mentioned pertinent aspects of the study, such as the population, number of cases and controls, and adjustments for confounders and multiple comparisons. For me, this qualifies as “enough information to evaluate whether the rest of the article is relevant.” I hope that when they do not mention things like the corrected p-values, it is not because their p-values were not significant. When articles cost up to $90 a pop… well, let’s not get me started.

But I digress. The point is that there are lots of things that could make curation a challenge, and consequently there are lots of things that could make curation easier. Standardization of abstracts, while it doesn’t make for juicy reading, makes going through high volumes of abstracts easier (and machine-accessible). Linking articles from journal websites to PubMed would also be useful, as PubMed serves as a portal to many other resources. Currently, almost all journals use digital object identifiers (DOIs), which are unique pointers to objects on the web. But PubMed IDs (PMIDs), like digital object identifiers (DOIs), are a little simpler, and provide a lot of useful functionality through integration with NCBI’s many databases. You can imagine all the little scripts and hacks you could come up with to improve the curation process, using greasemonkey scripts on Firefox, bookmarklets on any browser, and even web apps.

One somewhat mundane task we often have to do is search for a paper on PubMed to get the PMID. This is straightforward given we already know the authors, title, journal, etc, but still kind of a pain. Fortunately, PubMed allows you to search by DOI, which almost all publishers provide. So a slight improvement is to use the DOI as the search term in PubMed, as this will return the exact result if the DOI exists in the database. But you still have to open up a new browser window or navigate to PubMed and copy and paste the DOI into the search bar. To reduce the number of steps even further, we can use a simple bookmarklet containing a bit of javascript (if it looks cut off, you can still double-click copy and paste it):

javascript:var%20t;%20try%20%7B%20t=%20((window.getSelection%20&&%20window.getSelection())%20%7C%7C%20(document.getSelection%20&&%20document.getSelection())%20%7C%7C%20(document.selection%20&&%20document.selection.createRange%20&&%20document.selection.createRange().text));%20%7D%20catch(e)%20%7B%20%20t%20=%20%22%22;%20%7D;%20location.href='http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&term='+t+'%5BAID%5D';

This script extracts whatever text you’ve highlighted on a page and attempts to search PubMed using it as the DOI. So obviously it will only work if you’ve highlighted something and that something is a DOI, and that DOI is in PubMed. But assuming you do and it is, it will send you directly to the PubMed entry for that paper. Save the script as a browser bookmark, put the bookmark in your bookmarks bar, and whenever you’re on an article webpage (or RSS feed) and want to see the PubMed entry for that article, just highlight the DOI and click the bookmark. (Cameron wrote up a pipe on Yahoo!Pipes a while ago that does something similar, which inspired this bookmarklet.)

Clearly even this simple hack can be improved – it would be nice, perhaps, to have it return the PMID in an alert box so you can make a note and then continue doing whatever you were doing, rather than being sent away to PubMed (this might make use of AJAX?). It would be nice if you didn’t have to highlight, but the script would look for and extract the DOI from the page automatically. And I’m sure you could add even more bells and whistles, within reason.

My latest hackneyed…. “hack-need”… is to be able to identify follow-up studies for a particular genetic association. If you read a paper with PMID X saying SNP A is significantly associated with a disease, it would be really useful to know when future studies look into that association and either replicate or contradict the finding. Hopefully when they do so, they cite PMID X and/or mention SNP A. Essentially, I’d like to query PubMed for papers that cite a given PMID or SNP (via rs #). Ideally, I could do this in batch for many PMIDs and many SNPs automatically, and have each query return only results that are newer than the previous query (or query date). Then I set the script running behind the scenes, process the results using another script, and maybe have it send me an email with a list of new PMIDs to look into every week. Can world domination be far behind?*

Seriously though, I am looking for tips on how to do this follow-up identification thing, so any help appreciated. Pierre has given me some useful hints for how to search PubMed for papers citing a given rs #, and it would be great if this could be modified with dates:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=snp&id=1802710&db=pubmed&reldate=7

(insert your favorite rs # as the id)
Update: reldate limits results to those within a number of days immediately preceding today’s date; could also use mindate and maxdate to specify a date range.

* Why stop there, you might ask? I could write a script that downloads the abstracts, “reads” them, filters out the irrelevant ones, summarizes the important information, and populates curation reports. But then I’d be out of a bloody job…

7 Responses to Scripts and hacks for curation

  1. cariaso says:

    http://pipes.yahoo.com/pipes/pipe.info?_id=ao63itmU3BGqh1xwJphxuA

    is the yahoo pipe I use to find articles for SNPedia. It looks for anything which mentions “snp” or “variation” or … and then runs it through a regex which looks for a valid rs#.

    you could also just watch the RecentChanges on SNPedia via
    http://snpedia.com/index.php?title=Special:RecentChanges&feed=rss

    which will also find SNPs from OMIM and genome.gov.

    If you do manage to put yourself out of a job, you’ll be a billionaire within the week. Text processing remains hard, despite the awesomeness of
    http://gnosis.cx/TPiP/

    just be glad that SNPs have such easy names (rs###) because looking for gene names is about 1000x harder.

    To the extent that you are willing, able, and allowed
    You are in a unique position to clarify some of these
    http://snpedia.com/index.php/Category:23andMe_SNP

    welcome to the party.

  2. cariaso says:

    no offense to the gnosis book, but
    http://nltk.googlecode.com/
    was actually the link I’d intended.

  3. Will Dampier says:

    I’ve had to create a similar set of scripts to hack everything together in trying to curate a list of articles linking a specific gene to a specific disease. Since I work with microarray data I need to do this in very large batches. I cobbled together a nice framework using python, the Natural Language ToolKit and a whole lot of chewing-gum. I used the pubmed API to retrieve abstracts that match any of the gene names and any of the disease names. Then I use Named Entity Recognition from the NLTK to weed out false positives. I still get plenty of false-positives, but its much less than I got with just a pubmed search.

  4. Pingback: Scripts and hacks for curation « I was lost but now I live here Scripts Rss

  5. Chris Lasher says:

    I’m fairly certain the LibX plugin http://libx.org/ hunts for DOIs on a page and provides annotations of them. (The VT LibX plugin, at least, usually puts a Virginia Tech logo by a DOI if it’s available from our library.) If you have the inclination to poke around their source code, it is open, and might provide you hints about how to make the improvements you imagined.

  6. Firas says:

    Biomedical Informatics grad student here. I am building hybrid NLP/machine learning classifiers for finding papers in PubMed in a domain that is very similar to yours. Once I’m done running the experiments, I want to release some of the code that I’ve written in open source. It’s not fancy, but I’m sure someone somewhere can find useful ways to add to it. It’s basically a web app written in CakePHP (like Ruby on Rails for PHP) that allows different kinds of queries to PubMed and then does bibliography management stuff, including ability to index based on NLP-extracted features.

    I’d be more motivated to do start an open source project if other people are interested.

  7. Hey Shirley. A similar question just came up on the BioNLP mailing list. PubMed Central might be able to get you part of the way there?

    You can get free, open, programmatic access to lists of all PubMed papers cited by any paper in PubMed Central. Admittedly this limits the set in two ways (the cited papers need to have a PubMed ID, and the citing papers need to be in PubMed Central), but it can nonetheless be sufficient for many projects. Especially true as the number of papers in PubMed Central continues to rise. It also sets you up to filter the lists in all sorts of powerful ways, using the other Entrez tools.

    eLink is the key to programmatic access for citations, just as with your SNP example.

    For example, this URL returns a list of all five PubMed Central IDs that cite PubMed article ID 17375194 :
    http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=17375194&db=pmc

    This is the same list you see on the lower right “Cited by” list on the main PubMed page for the article:
    http://www.ncbi.nlm.nih.gov/pubmed/17375194

    I’ve done quite a bit of this sort of PubMed Central citation extraction in Python. I’ve been using an updated version of the EUtils python library and have some home-spun code as well (to be open source, but I haven’t posted it yet). If you are interested in more info or the code I’ve been using, let me know and I’d be glad to help (though with a bit of lag time, since I’m in the midst of moving to Vancouver Canada this week and am surrounded by moving boxes! )

Leave a comment