Scripts and hacks for curation

When you curate scientific literature, there are lots of little tasks and procedures and requirements that can all add up into a big inefficient mess if they’re not integrated very well into your workflow. Some of these things are in-house and so you have some level of control over how they are handled (depending, sometimes, on engineering or management), while for others you are at the mercy of the wilderness that is science. It would be great, for example, if all abstracts contained enough information for us to evaluate whether the rest of the article is worth reading – or, in many cases, worth buying in order to read it.

Photo by hbart on Flickr

Photo by hbart on Flickr

In the context of genetic association literature, it would also be great if all abstracts used standard database identifiers for SNPs (i.e. rs #s), as this unambiguously defines the variant; used standard means of reporting the association (e.g. odds ratios with 95% confidence intervals and p-values); and mentioned pertinent aspects of the study, such as the population, number of cases and controls, and adjustments for confounders and multiple comparisons. For me, this qualifies as “enough information to evaluate whether the rest of the article is relevant.” I hope that when they do not mention things like the corrected p-values, it is not because their p-values were not significant. When articles cost up to $90 a pop… well, let’s not get me started.

But I digress. The point is that there are lots of things that could make curation a challenge, and consequently there are lots of things that could make curation easier. Standardization of abstracts, while it doesn’t make for juicy reading, makes going through high volumes of abstracts easier (and machine-accessible). Linking articles from journal websites to PubMed would also be useful, as PubMed serves as a portal to many other resources. Currently, almost all journals use digital object identifiers (DOIs), which are unique pointers to objects on the web. But PubMed IDs (PMIDs), like digital object identifiers (DOIs), are a little simpler, and provide a lot of useful functionality through integration with NCBI’s many databases. You can imagine all the little scripts and hacks you could come up with to improve the curation process, using greasemonkey scripts on Firefox, bookmarklets on any browser, and even web apps.

One somewhat mundane task we often have to do is search for a paper on PubMed to get the PMID. This is straightforward given we already know the authors, title, journal, etc, but still kind of a pain. Fortunately, PubMed allows you to search by DOI, which almost all publishers provide. So a slight improvement is to use the DOI as the search term in PubMed, as this will return the exact result if the DOI exists in the database. But you still have to open up a new browser window or navigate to PubMed and copy and paste the DOI into the search bar. To reduce the number of steps even further, we can use a simple bookmarklet containing a bit of javascript (if it looks cut off, you can still double-click copy and paste it):


This script extracts whatever text you’ve highlighted on a page and attempts to search PubMed using it as the DOI. So obviously it will only work if you’ve highlighted something and that something is a DOI, and that DOI is in PubMed. But assuming you do and it is, it will send you directly to the PubMed entry for that paper. Save the script as a browser bookmark, put the bookmark in your bookmarks bar, and whenever you’re on an article webpage (or RSS feed) and want to see the PubMed entry for that article, just highlight the DOI and click the bookmark. (Cameron wrote up a pipe on Yahoo!Pipes a while ago that does something similar, which inspired this bookmarklet.)

Clearly even this simple hack can be improved – it would be nice, perhaps, to have it return the PMID in an alert box so you can make a note and then continue doing whatever you were doing, rather than being sent away to PubMed (this might make use of AJAX?). It would be nice if you didn’t have to highlight, but the script would look for and extract the DOI from the page automatically. And I’m sure you could add even more bells and whistles, within reason.

My latest hackneyed…. “hack-need”… is to be able to identify follow-up studies for a particular genetic association. If you read a paper with PMID X saying SNP A is significantly associated with a disease, it would be really useful to know when future studies look into that association and either replicate or contradict the finding. Hopefully when they do so, they cite PMID X and/or mention SNP A. Essentially, I’d like to query PubMed for papers that cite a given PMID or SNP (via rs #). Ideally, I could do this in batch for many PMIDs and many SNPs automatically, and have each query return only results that are newer than the previous query (or query date). Then I set the script running behind the scenes, process the results using another script, and maybe have it send me an email with a list of new PMIDs to look into every week. Can world domination be far behind?*

Seriously though, I am looking for tips on how to do this follow-up identification thing, so any help appreciated. Pierre has given me some useful hints for how to search PubMed for papers citing a given rs #, and it would be great if this could be modified with dates:

(insert your favorite rs # as the id)
Update: reldate limits results to those within a number of days immediately preceding today’s date; could also use mindate and maxdate to specify a date range.

* Why stop there, you might ask? I could write a script that downloads the abstracts, “reads” them, filters out the irrelevant ones, summarizes the important information, and populates curation reports. But then I’d be out of a bloody job…


Labmeeting releases a Firefox plug-in

labmeetingWe’ve been hearing a lot about reference managers lately, and I have another post brewing that probably goes into way more detail than anyone needs, but in the meantime, here’s a quick rundown of one particular tool: Labmeeting‘s new Firefox plug-in.

It’s still in beta (or alpha?) and therefore behind a registration wall according to Mozilla’s policy, but blog reviews are supposed to help move it out of beta so here we go.

First, what is Labmeeting? At the most basic level, it’s online software for managing and sharing your reference collection. On top of this, it allows you to search for new papers, share documents and have discussions with your lab (hence the name Labmeeting), and read the papers in your collection, all within the browser. Your paper collection is available to you wherever there’s internet access.

One thing that’s been missing from most tools is a way to automatically add items to your library without disrupting your normal workflow. Many people still find most of their papers through PubMed (old habits die hard), and it would be inefficient to navigate to and redo the search in Labmeeting or upload a downloaded paper. Bookmarklets for posting a new item help, but these still take you away from the page you were on so that you can enter tags and confirm the submission manually.

plug-inLabmeeting attempts to do this one step better with their plug-in, which, in theory, allows you to add references to your library with a minimum of interruption. After registering for Mozilla (a bit of a pain, thus the move to get it out of beta), I downloaded and installed the plug-in, which placed a little button on the top of my browser window.

Now, when I search or go to PubMed and navigate to a particular paper, I should be able to add it to my collection with just a click of this button, no questions asked (unless the answer’s not obvious). I gave this a whirl but it didn’t work quite as well as I’d hoped. For example, I searched for “microblogging” and was routed straight to the abstract page for the ISMB microblogging paper. But when I clicked “send to labmeeting”, it gave me this:


It says, “could not find PubMed Record… try navigating to an individual citation page…” Hmm. What could be more “individual citation-ey” than this page? At first I thought it might be because the PubMed record is “in process” rather than “indexed”, but if I take their suggestion to “highlight a record id” and then clicking “send”, the posting is successful:


Note that if there’s no link to a PDF or you don’t have access to the journal, it gives you this handy error message:


But back to the problem. Maybe it’s still possible that I tripped it up with an in-process PubMed record, or the “online only” publishers behave differently. But the same thing happened when I tried adding articles from paper journals from a few years ago that are obviously indexed. So I’m not sure what I’m supposed to say here in this review, other than that the plug-in is a decent idea, but it needs some work. I can see how easy it would make adding papers to your collection, if it only did what I thought it was supposed to do.

I feel like I’m missing something, or doing something wrong, because how can it be that the plug-in has trouble with every. single. paper. I tried? (Also, the three reviews of the plug-in on Mozilla were very favorable. Different plug-in?) Go to PubMed. Navigate to an article. Click the button. Right? Here, let me try a few more random papers just to be sure. Nope, same problem. And I made sure they weren’t “Epubs” just in case. But the point is that the vast majority, if not all, of the PubMed abstracts I fed to the plug-in failed because it couldn’t find the PubMed record, and if it’s because of online publishing, or in-process records, or anything else, then that’s a problem they need to fix, because that’s a heck of a lot of records.

I admit that I’m writing this in the wee hours so this is coming out a bit harsher than I’d like, especially because they asked me to write a review. But if it doesn’t work for me, I think that’s an important piece of information. Because either someone on Labmeeting will tell me what I did wrong (in which case they might need to tell everyone who downloads it, if what I did was logical), or it’s a legitimate issue that they need to address.

So here’s my summary.


  • Doesn’t work.
  • Doesn’t appear to work with proxy authentication (so I can’t access any closed-access PDFs unless I’m connected to a network with a subscription; even then it’s not clear it would work because of above – I’ll check later).
  • Only “works” with Firefox.
  • Requires 2 steps to install (download, and restart browser – which is a pain).
  • Only “works” with PubMed. It would be nice if you could post an article to your collection from that article’s webpage (on the journal website, for example), among other possibilities.


  • The idea of being able to add papers to your collection without disrupting your workflow is a good one.

Bottom-line: Either I missed the memo or the plug-in needs a lot of work. Assuming it did work, there are still some drawbacks that make it less than ideal for a broad audience (PubMed only, Firefox only, no proxy). But maybe they’re not going for a broad audience – just the academic/biomedical research community that uses Firefox from within their institutional network. Or maybe they’ll roll out some new versions that will take care of some of these issues (like the not working one). Because a button like this that could handle a proxy server, worked cross-browser, and allowed postings from multiple web resources would be pretty sweet indeed.