In memoriam: Warren DeLano



PyMOL has starred in many journal covers

On Tuesday, November 3rd, the scientific community suffered a great loss with the passing of Warren DeLano. Most people know him as the creator of PyMOL, a popular and extremely powerful molecular visualization tool, but most – including myself, until recently – may not know all of the other unique qualities that made Warren a mentor, collaborator, inspiration and friend to many. And by making PyMOL open source, Warren demonstrated his generosity and ensured that his work would continue to help future generations of scientists.
Read more of this post


Posts in the pipeline, and in the meantime

October’s been a busy month and so I haven’t had much time to post. But busy means interesting, and so I have lots of things to write about, it just doesn’t really get done. Some of the posts I have in the pipeline — mostly just as titles with scarce notes to remind myself what they mean:

The commenting conundrum: about where and why scientists do or don’t comment on scientific articles.

Responding to “them”: about the whos, whats, wheres, whens, whys, and hows of criticism and responding (or not) to it; mostly on the web but also off.

A detailed look into PLoS’s article-level metrics data: it’s open, so why not? And the results might just surprise you.

Thoughts from Science Commons Salon: with the amount of brainpower in that room, I’m surprised it didn’t explode. In fact, I’m surprised the whole town of Mountain View hasn’t exploded from sheer intellect yet.

So yeah, plenty to write about, sometime. I saw Pete Binfield of PLoS at the SC Salon and he joked that I was falling behind, reposting things that he’d posted a whole four days ago. Makes me want to start the Slow Blog movement…

Those posts will probably keep simmering for a little while. In the meantime, I haven’t been completely idle – in the last three weeks, I’ve written three blog posts for 23andMe‘s Spittoon on genetic association studies on glaucoma, bone mineral density, and blood-related traits. Another one is set to come out early next week. So if you haven’t been tuning in regularly to the Spittoon, now you know where else to find me!

Scripts and hacks for curation

When you curate scientific literature, there are lots of little tasks and procedures and requirements that can all add up into a big inefficient mess if they’re not integrated very well into your workflow. Some of these things are in-house and so you have some level of control over how they are handled (depending, sometimes, on engineering or management), while for others you are at the mercy of the wilderness that is science. It would be great, for example, if all abstracts contained enough information for us to evaluate whether the rest of the article is worth reading – or, in many cases, worth buying in order to read it.

Photo by hbart on Flickr

Photo by hbart on Flickr

In the context of genetic association literature, it would also be great if all abstracts used standard database identifiers for SNPs (i.e. rs #s), as this unambiguously defines the variant; used standard means of reporting the association (e.g. odds ratios with 95% confidence intervals and p-values); and mentioned pertinent aspects of the study, such as the population, number of cases and controls, and adjustments for confounders and multiple comparisons. For me, this qualifies as “enough information to evaluate whether the rest of the article is relevant.” I hope that when they do not mention things like the corrected p-values, it is not because their p-values were not significant. When articles cost up to $90 a pop… well, let’s not get me started.

But I digress. The point is that there are lots of things that could make curation a challenge, and consequently there are lots of things that could make curation easier. Standardization of abstracts, while it doesn’t make for juicy reading, makes going through high volumes of abstracts easier (and machine-accessible). Linking articles from journal websites to PubMed would also be useful, as PubMed serves as a portal to many other resources. Currently, almost all journals use digital object identifiers (DOIs), which are unique pointers to objects on the web. But PubMed IDs (PMIDs), like digital object identifiers (DOIs), are a little simpler, and provide a lot of useful functionality through integration with NCBI’s many databases. You can imagine all the little scripts and hacks you could come up with to improve the curation process, using greasemonkey scripts on Firefox, bookmarklets on any browser, and even web apps.

One somewhat mundane task we often have to do is search for a paper on PubMed to get the PMID. This is straightforward given we already know the authors, title, journal, etc, but still kind of a pain. Fortunately, PubMed allows you to search by DOI, which almost all publishers provide. So a slight improvement is to use the DOI as the search term in PubMed, as this will return the exact result if the DOI exists in the database. But you still have to open up a new browser window or navigate to PubMed and copy and paste the DOI into the search bar. To reduce the number of steps even further, we can use a simple bookmarklet containing a bit of javascript (if it looks cut off, you can still double-click copy and paste it):


This script extracts whatever text you’ve highlighted on a page and attempts to search PubMed using it as the DOI. So obviously it will only work if you’ve highlighted something and that something is a DOI, and that DOI is in PubMed. But assuming you do and it is, it will send you directly to the PubMed entry for that paper. Save the script as a browser bookmark, put the bookmark in your bookmarks bar, and whenever you’re on an article webpage (or RSS feed) and want to see the PubMed entry for that article, just highlight the DOI and click the bookmark. (Cameron wrote up a pipe on Yahoo!Pipes a while ago that does something similar, which inspired this bookmarklet.)

Clearly even this simple hack can be improved – it would be nice, perhaps, to have it return the PMID in an alert box so you can make a note and then continue doing whatever you were doing, rather than being sent away to PubMed (this might make use of AJAX?). It would be nice if you didn’t have to highlight, but the script would look for and extract the DOI from the page automatically. And I’m sure you could add even more bells and whistles, within reason.

My latest hackneyed…. “hack-need”… is to be able to identify follow-up studies for a particular genetic association. If you read a paper with PMID X saying SNP A is significantly associated with a disease, it would be really useful to know when future studies look into that association and either replicate or contradict the finding. Hopefully when they do so, they cite PMID X and/or mention SNP A. Essentially, I’d like to query PubMed for papers that cite a given PMID or SNP (via rs #). Ideally, I could do this in batch for many PMIDs and many SNPs automatically, and have each query return only results that are newer than the previous query (or query date). Then I set the script running behind the scenes, process the results using another script, and maybe have it send me an email with a list of new PMIDs to look into every week. Can world domination be far behind?*

Seriously though, I am looking for tips on how to do this follow-up identification thing, so any help appreciated. Pierre has given me some useful hints for how to search PubMed for papers citing a given rs #, and it would be great if this could be modified with dates:

(insert your favorite rs # as the id)
Update: reldate limits results to those within a number of days immediately preceding today’s date; could also use mindate and maxdate to specify a date range.

* Why stop there, you might ask? I could write a script that downloads the abstracts, “reads” them, filters out the irrelevant ones, summarizes the important information, and populates curation reports. But then I’d be out of a bloody job…

New job and curation 101

It’s been several weeks now since I started working at 23andMe, a personal genomics company located in Mountain View, CA. Perhaps not coincidentally, it’s also been several weeks since I last blogged. The transition hasn’t been difficult, but it did take some getting used to, mentally and physically. I mean, leaving for work by 8:30am? Regular hours? Commuting??

Ok, so I really have nothing to complain about. 8:30 isn’t that early, and I could shave half an hour off each end of my commute if I didn’t choose to take advantage of bike-friendly roads, good weather, and a company-sponsored free train pass (OMG benefits!?). All in all, things are pretty much fantastic. The work environment is friendly, flexible, and laid-back; we have plenty of food and drink to keep us fueled throughout the day, and regular workouts/yoga if we need to get fired up or mellowed down (and to keep the “Free Food 15” at bay). Plus, personal genomics is a super interesting and rapidly evolving industry, so there’s really never a dull moment.

So what is personal genomics, anyway? We’ve known for a while that genetics – the sequence of DNA inside our cells – plays an important role in our form and functioning. Many diseases are caused by changes in DNA (often in genes, parts of DNA that code for proteins) that alter the normal functioning of cells, though not all genetic differences lead to negative changes. (Genetics can also tell us about ancestry – who is related to whom and the history of populations – but I won’t be addressing that in this post.) Where it gets personal is when you apply it to individuals, such as when someone gets a genetic test to determine whether they have or are at risk of developing or passing on a particular disease. Where it gets genomics is when we use high-throughput technologies to do what is essentially thousands of genetics tests at once. Put them together, and you get personal genomics.

How do we know what genetic “pieces” correspond to what conditions or diseases? The general strategy is to compare the DNA of a whole bunch of individuals that have that condition (cases) to a whole bunch of individuals that don’t (controls). As long as both groups are similar save for their case-control status, any significant genetic differences between them should have something to do with that condition. We call this a genetic association.

It turns out that there are millions of single locations in the human genome where the exact sequence of the DNA might differ between two people, and these places, called single nucleotide polymorphisms, or SNPs, can contribute to differences we can observe, such as whether you flush when you drink alcohol or how easily you put on weight. 23andMe personal genomics kit determines what your sequence is for a representative subset of SNPs. Many are already known to be associated with certain conditions, and new research is being done every day to uncover more and more of these associations.

So what exactly do I do at 23andMe? My official job title is “Scientist, Content Curation”. Curation, I’ve found, is not very familiar to most people. Most people probably know that there is such a thing as a museum curator, but might not know what they do. Hardly anyone has ever heard of scientific curation. (And I thought explaining what I was studying as a grad student was hard! Biomedical informatics, anyone?)

But it’s really not that complicated. The essence of curation is almost always the same: the selection, acquisition, and management of content. What that content is differs depending on the field – for example, an art curator might look for and organize artwork for exhibition in a gallery, while a curator in the “Ancient Civilizations” department of a museum may be in charge of acquiring, managing, and presenting archaeological artifacts.

In science, curation involves organization of scientific knowledge and data. An area where this has been especially important is the life sciences, as the amount of information being generated by high-throughput experiments, large-scale projects, and scholarly publishing has skyrocketed. In order to manage this information and render it useful to others, the field of biocuration was born. Any database that organizes scientific knowledge – UniProt (the Universal Protein resource), FlyBase (database for that very important model organism, Drosophila), PharmGKB (a database focused on how genes and drugs interact), etc – depends on curators to keep the information up to date and easy to use.

And so it is with 23andMe. The genetic testing kit is one part of the product, but the other part is information – what knowledge is there about associations between the SNPs on our platform and health traits or conditions? What does your particular data mean? The science is far from exhausted on this subject, and in order to stay up to date with the research, 23andMe spends a lot of effort on curating the scientific literature for new genetic associations and presenting the information on our website for our customers.

Day to day, this means that we keep track of papers recently published in scientific journals, skim through to find ones that may have promising findings, and then vet these more thoroughly to see if they pass our stringent scientific standards. If they do, we extract the bits of information we need and put the bits together in reports that will eventually become part of the content on the website. It’s a job that definitely benefits from an organized system and an eye for detail – as well as a sense of curiosity.

After three weeks on the job, I think I’m starting to get the hang of the day to day work. Since my work is even more directly tied to the literature than it was as a graduate student in academia, I’m also developing an enhanced awareness of issues surrounding scientific publishing – those related to standardization and metadata, publication bias towards positive results, and closed vs. open access. The hardest aspect of transitioning from academia to industry hasn’t been the regular schedule, or the work environment, or the work itself, it’s been getting used to being on the other side of the pay-wall of scientific journals.

But that’s a rant for another time. ;)

A Real Guitar Hero – Sungha Jung 12 Year Old Prodigy Fingerstyle Guitarist at

This kid is truly talented… the style, skill, originality, I can’t think of anything he’s missing. I’d definitely buy an album of his arrangements!

Sydney International Food Festival: Flags

Whimsical, clever, beautiful and delicious – what more could you want? Food is a common denominator across the world!

How personal genomics is rocking the boat

I’ve been doing some reading on personal genomics, direct-to-consumer genetic tests, and personalized medicine lately, in an effort to steep myself in the science and issues prior to starting work in this field. Today, I read an opinion piece by R. J. Carlson titled “The disruptive nature of personalized medicine technologies: implications for the health care system,” [1] that was especially interesting. Rather than expound on the usual arguments for or against consumer genomics, it laid out several important areas where personalized medicine and genomics technologies would disrupt the current system, often with brutal honesty.

Clearly, one of these areas is private health insurance. Describing private health insurance as “a hybrid of economic ruthlessness and utilitarian social policy … is supposed to perform the social policy role that the public sector can’t or won’t, and
that is to ration,” Carlson points out some sobering scenarios. One is that the Genetic Information Non-discrimination Act (GINA) covers only the underwriting process, and does not guard against denial of coverage or steep increases in premiums once a genetically-suggested condition manifests. Another is the moral and social dilemma posed by the knowledge – on either side – that those with “demonstrably superior health” are subsidizing care for those with “known genetic risk”. And, given the increasing knowledge we’ll have about health risks, it would be ridiculous not to use any of it in designing insurance packages. Carlson doesn’t paint this as a negative thing, necessarily, but instead calls on public policy to “facilitate the constructive uses of these data by shaping financial and access reforms to the genomics medicine that is arriving.”

The debate over health insurance is fairly familiar, however. What Carlson makes very clear in the rest of the piece is that personal genomics takes medicine in a fundamentally different direction than where it has been going for the last half century. Traditional modern medicine has focused on mechanism and reductionism, finding what’s wrong and fixing it, and applying that knowledge to new cases of the same thing. We use the fact that humans are more or less similar to enact standards of care.

But personalized medicine focuses on the differences between people and treats every patient as a unique case. This leads to two natural consequences: it makes medical care more costly, and it renders the standardization of medical practice obsolete, if not impossible. Of course, personalized medicine could conceivably be more cost-effective through better preventative care, but this is only if significant effort goes towards realizing this potential. And although I hadn’t thought about personal genomics in the context of evidence-based medicine, it’s not hard to see the conflict:

There’s the rub: to be effective, a personalized medicine must build on our ever more definitive differences, defying standardization for the very long haul, if ever. Measuring quality in health care under a genomics model is crudely analogous to measuring automobile fuel efficiency when every automobile is assembled from a wide array of materially different but functionally interchangeable parts, performs differently on every trip, and changes in performance with the moods and capacities of every driver.

This article captures the nuances of some very interesting challenges facing health care in response to genomics technologies with a view that is both realistic and optimistic. Carlson recognizes that the era of medical paternalism is giving way to democratization of health information, and we must adapt our policies to reflect this. Indeed, he argues that without active and careful management of this process, we may very well sabotage our ability to reap any rewards from this technology.

Definitely worth a read, and worth thinking about.

[1] Carlson RJ. (2009) The disruptive nature of personalized medicine technologies: implications for the health care system. Public Health Genomics 12(3):180-184.
DOI: 10.1159/000189631 [PubMed] [Journal]