Scripts and hacks for curation
September 16, 2009 7 Comments
When you curate scientific literature, there are lots of little tasks and procedures and requirements that can all add up into a big inefficient mess if they’re not integrated very well into your workflow. Some of these things are in-house and so you have some level of control over how they are handled (depending, sometimes, on engineering or management), while for others you are at the mercy of the wilderness that is science. It would be great, for example, if all abstracts contained enough information for us to evaluate whether the rest of the article is worth reading – or, in many cases, worth buying in order to read it.In the context of genetic association literature, it would also be great if all abstracts used standard database identifiers for SNPs (i.e. rs #s), as this unambiguously defines the variant; used standard means of reporting the association (e.g. odds ratios with 95% confidence intervals and p-values); and mentioned pertinent aspects of the study, such as the population, number of cases and controls, and adjustments for confounders and multiple comparisons. For me, this qualifies as “enough information to evaluate whether the rest of the article is relevant.” I hope that when they do not mention things like the corrected p-values, it is not because their p-values were not significant. When articles cost up to $90 a pop… well, let’s not get me started.
But I digress. The point is that there are lots of things that could make curation a challenge, and consequently there are lots of things that could make curation easier. Standardization of abstracts, while it doesn’t make for juicy reading, makes going through high volumes of abstracts easier (and machine-accessible). Linking articles from journal websites to PubMed would also be useful, as PubMed serves as a portal to many other resources. Currently, almost all journals use digital object identifiers (DOIs), which are unique pointers to objects on the web. But PubMed IDs (PMIDs), like digital object identifiers (DOIs), are a little simpler, and provide a lot of useful functionality through integration with NCBI’s many databases. You can imagine all the little scripts and hacks you could come up with to improve the curation process, using greasemonkey scripts on Firefox, bookmarklets on any browser, and even web apps.
This script extracts whatever text you’ve highlighted on a page and attempts to search PubMed using it as the DOI. So obviously it will only work if you’ve highlighted something and that something is a DOI, and that DOI is in PubMed. But assuming you do and it is, it will send you directly to the PubMed entry for that paper. Save the script as a browser bookmark, put the bookmark in your bookmarks bar, and whenever you’re on an article webpage (or RSS feed) and want to see the PubMed entry for that article, just highlight the DOI and click the bookmark. (Cameron wrote up a pipe on Yahoo!Pipes a while ago that does something similar, which inspired this bookmarklet.)
Clearly even this simple hack can be improved – it would be nice, perhaps, to have it return the PMID in an alert box so you can make a note and then continue doing whatever you were doing, rather than being sent away to PubMed (this might make use of AJAX?). It would be nice if you didn’t have to highlight, but the script would look for and extract the DOI from the page automatically. And I’m sure you could add even more bells and whistles, within reason.
My latest hackneyed…. “hack-need”… is to be able to identify follow-up studies for a particular genetic association. If you read a paper with PMID X saying SNP A is significantly associated with a disease, it would be really useful to know when future studies look into that association and either replicate or contradict the finding. Hopefully when they do so, they cite PMID X and/or mention SNP A. Essentially, I’d like to query PubMed for papers that cite a given PMID or SNP (via rs #). Ideally, I could do this in batch for many PMIDs and many SNPs automatically, and have each query return only results that are newer than the previous query (or query date). Then I set the script running behind the scenes, process the results using another script, and maybe have it send me an email with a list of new PMIDs to look into every week. Can world domination be far behind?*
Seriously though, I am looking for tips on how to do this follow-up identification thing, so any help appreciated. Pierre has given me some useful hints for how to search PubMed for papers citing a given rs #, and
it would be great if this could be modified with dates:
(insert your favorite rs # as the id)
reldate limits results to those within a number of days immediately preceding today’s date; could also use
maxdate to specify a date range.
* Why stop there, you might ask? I could write a script that downloads the abstracts, “reads” them, filters out the irrelevant ones, summarizes the important information, and populates curation reports. But then I’d be out of a bloody job…