Scripts and hacks for curation

When you curate scientific literature, there are lots of little tasks and procedures and requirements that can all add up into a big inefficient mess if they’re not integrated very well into your workflow. Some of these things are in-house and so you have some level of control over how they are handled (depending, sometimes, on engineering or management), while for others you are at the mercy of the wilderness that is science. It would be great, for example, if all abstracts contained enough information for us to evaluate whether the rest of the article is worth reading – or, in many cases, worth buying in order to read it.

Photo by hbart on Flickr

Photo by hbart on Flickr

In the context of genetic association literature, it would also be great if all abstracts used standard database identifiers for SNPs (i.e. rs #s), as this unambiguously defines the variant; used standard means of reporting the association (e.g. odds ratios with 95% confidence intervals and p-values); and mentioned pertinent aspects of the study, such as the population, number of cases and controls, and adjustments for confounders and multiple comparisons. For me, this qualifies as “enough information to evaluate whether the rest of the article is relevant.” I hope that when they do not mention things like the corrected p-values, it is not because their p-values were not significant. When articles cost up to $90 a pop… well, let’s not get me started.

But I digress. The point is that there are lots of things that could make curation a challenge, and consequently there are lots of things that could make curation easier. Standardization of abstracts, while it doesn’t make for juicy reading, makes going through high volumes of abstracts easier (and machine-accessible). Linking articles from journal websites to PubMed would also be useful, as PubMed serves as a portal to many other resources. Currently, almost all journals use digital object identifiers (DOIs), which are unique pointers to objects on the web. But PubMed IDs (PMIDs), like digital object identifiers (DOIs), are a little simpler, and provide a lot of useful functionality through integration with NCBI’s many databases. You can imagine all the little scripts and hacks you could come up with to improve the curation process, using greasemonkey scripts on Firefox, bookmarklets on any browser, and even web apps.

One somewhat mundane task we often have to do is search for a paper on PubMed to get the PMID. This is straightforward given we already know the authors, title, journal, etc, but still kind of a pain. Fortunately, PubMed allows you to search by DOI, which almost all publishers provide. So a slight improvement is to use the DOI as the search term in PubMed, as this will return the exact result if the DOI exists in the database. But you still have to open up a new browser window or navigate to PubMed and copy and paste the DOI into the search bar. To reduce the number of steps even further, we can use a simple bookmarklet containing a bit of javascript (if it looks cut off, you can still double-click copy and paste it):

javascript:var%20t;%20try%20%7B%20t=%20((window.getSelection%20&&%20window.getSelection())%20%7C%7C%20(document.getSelection%20&&%20document.getSelection())%20%7C%7C%20(document.selection%20&&%20document.selection.createRange%20&&%20document.selection.createRange().text));%20%7D%20catch(e)%20%7B%20%20t%20=%20%22%22;%20%7D;%20location.href='http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&term='+t+'%5BAID%5D';

This script extracts whatever text you’ve highlighted on a page and attempts to search PubMed using it as the DOI. So obviously it will only work if you’ve highlighted something and that something is a DOI, and that DOI is in PubMed. But assuming you do and it is, it will send you directly to the PubMed entry for that paper. Save the script as a browser bookmark, put the bookmark in your bookmarks bar, and whenever you’re on an article webpage (or RSS feed) and want to see the PubMed entry for that article, just highlight the DOI and click the bookmark. (Cameron wrote up a pipe on Yahoo!Pipes a while ago that does something similar, which inspired this bookmarklet.)

Clearly even this simple hack can be improved – it would be nice, perhaps, to have it return the PMID in an alert box so you can make a note and then continue doing whatever you were doing, rather than being sent away to PubMed (this might make use of AJAX?). It would be nice if you didn’t have to highlight, but the script would look for and extract the DOI from the page automatically. And I’m sure you could add even more bells and whistles, within reason.

My latest hackneyed…. “hack-need”… is to be able to identify follow-up studies for a particular genetic association. If you read a paper with PMID X saying SNP A is significantly associated with a disease, it would be really useful to know when future studies look into that association and either replicate or contradict the finding. Hopefully when they do so, they cite PMID X and/or mention SNP A. Essentially, I’d like to query PubMed for papers that cite a given PMID or SNP (via rs #). Ideally, I could do this in batch for many PMIDs and many SNPs automatically, and have each query return only results that are newer than the previous query (or query date). Then I set the script running behind the scenes, process the results using another script, and maybe have it send me an email with a list of new PMIDs to look into every week. Can world domination be far behind?*

Seriously though, I am looking for tips on how to do this follow-up identification thing, so any help appreciated. Pierre has given me some useful hints for how to search PubMed for papers citing a given rs #, and it would be great if this could be modified with dates:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=snp&id=1802710&db=pubmed&reldate=7

(insert your favorite rs # as the id)
Update: reldate limits results to those within a number of days immediately preceding today’s date; could also use mindate and maxdate to specify a date range.

* Why stop there, you might ask? I could write a script that downloads the abstracts, “reads” them, filters out the irrelevant ones, summarizes the important information, and populates curation reports. But then I’d be out of a bloody job…

Fun Mac OS X command: say

Group meetings in the Altman lab often kick off with a Unix or computing tip. These range from examples of built-in but lesser known utilities that make our lives at the command line easier, to scripting hacks, to full-fledged applications you download and install.

At the last group meeting I attended, the presenter showed us a fun little command that comes with Mac OS X, called ‘say’. This command basically does what you think it does – it says whatever comes after it. Here’s a simple example:

shwu$ say hello world

The default voice is whatever is set as the default in your system (usually a female, unless you’ve changed it), but there are many others you can use by setting the -v parameter:

shwu$ say -v Agnes "this is another woman's voice"
shwu$ say -v Bruce "this is a man's voice"

Some are especially fun, like “Bad News”, Bubbles, “Pipe Organ”, Trinoids, and Zarvox. Others are a little weird, like Albert and Whisper. And then there are ones you just shouldn’t use if you’re home alone at night – Hysterical and Deranged, for example. A more complete list can be found here.

The ‘say’ command isn’t just for amusing yourself, though the tricks you could play on people remotely are endless. You can also use it in conjunction with other commands or in scripts:

shwu$ python -c "print 'stuff'" && say done printing stuff || say you have a bug in your script

will say ‘done printing stuff’, whereas if I’d left out one of the single quotes in the python command it would have said ‘you have a bug in your script’ instead. This is great for when you start a script running and turn your attention to YouTube videos other work, but want to be notified when your script either finishes or encounters an error.

Bench scientists can get in on the fun, too. Suppose you have a complicated pipetting protocol that specifies different amounts of different things in different places. A long list can be cumbersome to print out or read, so why not ‘say’ it instead? (Actually, while you can specify a file for it to say using -f, I’m not sureĀ  how you would specify pauses if you had it read your aliquots from a text file… so you might need to create a script that wraps all the aliquot amounts in ‘say’ commands with pauses in between, and then put all that in another script… anyway, it would be pretty cool and all your lab mates would be jealous. Or maybe they’d just think you’re strange.)