When you curate scientific literature, there are lots of little tasks and procedures and requirements that can all add up into a big inefficient mess if they’re not integrated very well into your workflow. Some of these things are in-house and so you have some level of control over how they are handled (depending, sometimes, on engineering or management), while for others you are at the mercy of the wilderness that is science. It would be great, for example, if all abstracts contained enough information for us to evaluate whether the rest of the article is worth reading – or, in many cases, worth buying in order to read it.

Photo by hbart on Flickr
In the context of genetic association literature, it would also be great if all abstracts used standard database identifiers for SNPs (
i.e. rs #s), as this unambiguously defines the variant; used standard means of reporting the association (
e.g. odds ratios with 95% confidence intervals and p-values); and mentioned pertinent aspects of the study, such as the population, number of cases and controls, and adjustments for confounders and multiple comparisons. For me, this qualifies as “enough information to evaluate whether the rest of the article is relevant.” I hope that when they do not mention things like the corrected p-values, it is not because their p-values were not significant. When articles cost up to $90 a pop… well, let’s not get me started.
But I digress. The point is that there are lots of things that could make curation a challenge, and consequently there are lots of things that could make curation easier. Standardization of abstracts, while it doesn’t make for juicy reading, makes going through high volumes of abstracts easier (and machine-accessible). Linking articles from journal websites to PubMed would also be useful, as PubMed serves as a portal to many other resources. Currently, almost all journals use digital object identifiers (DOIs), which are unique pointers to objects on the web. But PubMed IDs (PMIDs), like digital object identifiers (DOIs), are a little simpler, and provide a lot of useful functionality through integration with NCBI’s many databases. You can imagine all the little scripts and hacks you could come up with to improve the curation process, using greasemonkey scripts on Firefox, bookmarklets on any browser, and even web apps.
One somewhat mundane task we often have to do is search for a paper on PubMed to get the PMID. This is straightforward given we already know the authors, title, journal, etc, but still kind of a pain. Fortunately, PubMed allows you to search by DOI, which almost all publishers provide. So a slight improvement is to use the DOI as the search term in PubMed, as this will return the exact result if the DOI exists in the database. But you still have to open up a new browser window or navigate to PubMed and copy and paste the DOI into the search bar. To reduce the number of steps even further, we can use a simple bookmarklet containing a bit of javascript (if it looks cut off, you can still double-click copy and paste it):
javascript:var%20t;%20try%20%7B%20t=%20((window.getSelection%20&&%20window.getSelection())%20%7C%7C%20(document.getSelection%20&&%20document.getSelection())%20%7C%7C%20(document.selection%20&&%20document.selection.createRange%20&&%20document.selection.createRange().text));%20%7D%20catch(e)%20%7B%20%20t%20=%20%22%22;%20%7D;%20location.href='http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&term='+t+'%5BAID%5D';
This script extracts whatever text you’ve highlighted on a page and attempts to search PubMed using it as the DOI. So obviously it will only work if you’ve highlighted something and that something is a DOI, and that DOI is in PubMed. But assuming you do and it is, it will send you directly to the PubMed entry for that paper. Save the script as a browser bookmark, put the bookmark in your bookmarks bar, and whenever you’re on an article webpage (or RSS feed) and want to see the PubMed entry for that article, just highlight the DOI and click the bookmark. (Cameron wrote up a pipe on Yahoo!Pipes a while ago that does something similar, which inspired this bookmarklet.)
Clearly even this simple hack can be improved – it would be nice, perhaps, to have it return the PMID in an alert box so you can make a note and then continue doing whatever you were doing, rather than being sent away to PubMed (this might make use of AJAX?). It would be nice if you didn’t have to highlight, but the script would look for and extract the DOI from the page automatically. And I’m sure you could add even more bells and whistles, within reason.
My latest hackneyed…. “hack-need”… is to be able to identify follow-up studies for a particular genetic association. If you read a paper with PMID X saying SNP A is significantly associated with a disease, it would be really useful to know when future studies look into that association and either replicate or contradict the finding. Hopefully when they do so, they cite PMID X and/or mention SNP A. Essentially, I’d like to query PubMed for papers that cite a given PMID or SNP (via rs #). Ideally, I could do this in batch for many PMIDs and many SNPs automatically, and have each query return only results that are newer than the previous query (or query date). Then I set the script running behind the scenes, process the results using another script, and maybe have it send me an email with a list of new PMIDs to look into every week. Can world domination be far behind?*
Seriously though, I am looking for tips on how to do this follow-up identification thing, so any help appreciated. Pierre has given me some useful hints for how to search PubMed for papers citing a given rs #, and it would be great if this could be modified with dates:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=snp&id=1802710&db=pubmed&reldate=7
(insert your favorite rs # as the id)
Update: reldate limits results to those within a number of days immediately preceding today’s date; could also use mindate and maxdate to specify a date range.
* Why stop there, you might ask? I could write a script that downloads the abstracts, “reads” them, filters out the irrelevant ones, summarizes the important information, and populates curation reports. But then I’d be out of a bloody job…
A brief analysis of commenting at BMC, PLoS, and BMJ
November 18, 2009 by shwu
As announced on FriendFeed and Twitter, a writing collaboration between me and the inimitable Cameron Neylon has just been published at PLoS Biology, “Article-level metrics and the evolution of scientific impact”! (Loosely based on a blog post from several months ago.)
One of the many issues Cameron and I touched on was the problem of commenting. Most people probably aren’t aware of the problem; after all, commenting is alive and well on the internet in most places you look! But click over to PLoS or BioMed Central (BMC) and the comment sections are the digital equivalent of rolling tumbleweed.
As we mention briefly in the article, comments have great potential for improving science. For one thing, they’re a form of peer review, but without the month-long wait and seemingly arbitrary review criteria. Readers, authors, and other evaluators can also get a sense of what people think about the article. The ideal is certainly tantalizing — vigorous, rigorous debates over the finer scientific points as well as the overarching conclusions with participation both from experts in the field as well as informed laypeople, always with intelligence and civility!!!1!11!!one!! But let’s not kid ourselves — the worst-case scenario is all too easy to imagine and would probably look something like the discussions over at YouTube.
And this would be positively urbane. (From PhD comics)
Clearly, few scientists would find such discussions illuminating or useful as a means of evaluating the impact of a particular paper. But the idea of using comments to attain immediate feedback — and, verily, to give it — still beckons. Ironically, the real problem seems not to be with keeping undesirables out, but in enticing potential contributors in.
Despite having commenting platforms up for several years, the vast majority of articles published in PLoS and BioMed Central have no comments, and those that do average in the low single digits. Only a handful of articles at both publishers have comment numbers in the double digits.
What types of articles are these? Looking at BMC’s top commented articles, about 80% are commentaries, editorials, correspondence, etc; in other words, not primary research. If you add in the odd case report or study protocol, that percentage only goes down to 67%. (These numbers were taken from a list of most commented articles as of July 2008, looking at the 12 articles with more than five comments. Obviously, the list may have changed since that time, but I’m guessing not significantly, as very few of the most highly accessed articles on BMC have any comments and comment numbers for the 12 articles I looked at have not gone up for the most part.)
If we instead turn to the top 100 most highly accessed articles at BMC in the last 30 days (data collected around Oct 30), we see a slightly different picture. Here, the balance is approximately 60:40 in favor of research articles. The number of articles in the top 100 that have comments, however, is somewhat depressing — only 12 have any comments at all, and only two of these have more than two. The good news is that 1/2 of the commented articles are research (disproportionately more than for non-research), and research articles averaged 4.3 comments compared to 1.5 for front matter. Taking a quick glance at the all-time most highly accessed articles reinforces the observation that number of views does not seem to correlate with number of comments nor with time since publication.
Turning our attention to a sampling of highly commented papers at PLoS, we see a slightly different pattern. Although the article with the most comments is in PLoS Medicine, the median number of comments per article in PLoS Medicine is lower than that for PLoS ONE. This is to be expected given the fact that PLoS ONE publications have been designed to be highly interactive from the start.
Reassuringly, “Research” articles tend to have more comments than non-research articles, though this may be due to an outlier. The number of highly commented articles also seems to be fairly even for both types overall, though this balance is skewed by the fact that PLoS ONE only publishes primary research. PLoS Comp Bio publishes both types of articles and the highly commented ones tend to be front matter. Articles with comments at PLoS Medicine, on the other hand, seem to be more evenly distributed between research and front matter.
Interestingly, articles with comments tend to be in the medical or public health fields at both BMC and PLoS. If you go to BMJ, a popular medical journal that also allows comments (“Rapid Responses”), you’ll actually find fairly lively discourse. Yet a bias towards “front matter” like editorials and news items is apparent: out of 159 articles commented on between 9/30/09 and 10/30/09, only 31% were research-type articles (for which I included “Research”, “Methods”, “Analysis”, and “Practice” articles).
The good news is that research articles in BMJ tended to have more comments than front matter articles, though not by much — three per research article commented on in that time period vs. two per front matter article. Research articles also showed a slight trend towards more comments over time which the front matter articles did not exhibit.
* Note that the sample sizes for all analyses are very small and thus any conclusions are purely speculative.
In the next post, I’ll offer some very informal and more or less unsubstantiated explanations for these observations. One final observation is that none of the three publishers I looked at allows browsing or ranking of articles based on number of comments as far as I could tell, although with PLoS I could certainly go through the ALM data that they’ve kindly made available. Still, I would imagine that having a browsable list of the most commented articles would be very useful just for the casual reader.
Posted in Academia, Open science | Tagged biomed central, bmj, commenting, PLoS, publishing, science communication | 5 Comments »