Moving targets in proteomics

Photo by cmatsuoka on Flickr

Photo by cmatsuoka on Flickr

Proteomics – the large-scale study of proteins – often doesn’t get as much attention as its older sibling, genomics. But the potential applications have just as much, if not more, importance, for proteomics has to do with the physical nuts and bolts in biological systems – proteins – rather than the blueprints. To be sure, DNA plays a huge role in many different ways, but these macromolecules bustling around inside the cell deserve their own special credit.

Most of us working in bioinformatics hear much more about genomics than proteomics on a regular basis. The field of genomics is more mature and the problems have been around long enough to be fairly well-defined. Proteomics, however, is messy in a number of ways. While the genome remains more or less constant across conditions, the proteome can change drastically depending on when you assay it and from where. The same gene sequence can produce multiple different proteins. Proteins can also undergo many chemical modifications which change their function. On top of this, identifying specific proteins in high-throughput assays is still a challenge.

Yet the potential benefits of proteomics research are tantalizing. For instance, we could develop non-invasive diagnostics – instead of performing tissue biopsies, we could collect samples of blood or urine to identify whether someone has a particular disease, is at risk for organ rejection after a transplant, or is responding well to a treatment. So what some proteomics researchers are trying to do is identify potential biomarkers for a condition. They do this by studying the differences between the sample population of interest – burn patients who suffered multiple organ failure, say – and an appropriate control population – burn patients who did not suffer multiple organ failure.

Example mass spec data

Example mass spec data

In a typical scenario, the samples might be analyzed using mass spectrometry, where proteins are broken up into smaller fragments (peptides) whose masses are measured and used to specify the identities of the original proteins. In most cases the peptides map unambiguously to specific proteins, but a fraction cannot be resolved with complete certainty and are usually excluded from the analysis. (This is separate from issues regarding the methods used to identify the proteins, the quality of the raw data, and difficulties with the instrumentation itself, which I won’t be discussing here.) Once we’ve identified the proteins in the sample, we can associate them with the intensity levels of the detected peptides, producing an expression vector (of sorts) for each sample analyzed.

Given these expression vectors, we want to do the same things we do to gene expression data – namely, normalize the data and find proteins showing statistically significant differential expression between sample groups. Unfortunately, we can’t just apply the same methods; the data, although identical in theory, is actually quite different in practice. Missing values, merely an annoyance in gene expression data, can be a serious problem in protein expression data, and the reason why is related to the general difficulties of studying proteins.


To understand why, consider this: mass spectrometry is a great way to assay a large number of proteins in high-throughput, but there is no way to know ahead of time which proteins we expect to find. We don’t have a pre-determined array of possibilities to fill in; rather, we construct the array from the experimental data. With multiple samples, the array consists of proteins detected in the entire experiment along one axis and samples along the other, with intensity for a particular protein in a particular sample filling in the matrix. As we do this for greater numbers of samples, the proportion of missing values actually increases. This is probably due to low abundance proteins in that each additional sample increases the likelihood that a low abundance protein will be detected. When such a protein is detected, it is added to the data matrix, and all the columns corresponding to the other samples for that protein now become missing values. So how do we handle these missing values? Should they be converted to zeroes? Set to a “minimum detection value”? Set to something else? If the lack of detection for the previous samples really is due to low abundance, then one of the first two solutions makes sense.

This seemingly mundane example made me realize two things. One is that in proteomics we are essentially trying to study a moving target. This is extremely hard. We go from working in one dimension to working in four – and we can’t do it in just one step. Things happen in proteomics assays that don’t happen in genomics assays and we have to develop a whole new set of tools. This is the second lesson: no matter how much informaticians want to generalize, we can’t always apply the same methods blindly. Even seemingly simple problems like missing values can have different causes and different consequences, depending on the biological context.

Note: I have no background in proteomics and mass spectrometry other than the general science grapevine and a short talk I heard today which inspired this post. If I described anything inaccurately, please help by leaving a comment. :)

Further reading:

McHugh L, Arthur JW (2008) Computational Methods for Protein Identification from Mass Spectrometry Data. PLoS Comput Biol 4(2): e12. doi:10.1371/journal.pcbi.0040012

4 Responses to Moving targets in proteomics

  1. Deepak says:

    The problem with proteomics, having lived around it, is that it’s a lot less mature. Instruments are expensive, temperamental, and the data can be noisy and hard to collect for the reasons you describe. The uptake has definitely taken a lot longer than anticipated, although the last couple of years have made me more optimistic (a couple of HUPOs ago, it was kinda depressing)

  2. shwu says:

    Deepak, yes, for sure. I glossed over pretty much all of the technical aspects to focus just on the fact that the data itself is different and harder to interpret.

    Of course, any time we increase the scale, we increase the complexity because it’s so much harder to isolate what we’re trying to study. Same for when we increase the physical scale – going from genes to proteins, to cells, to tissues, to organisms. But we’ll get there.

  3. Deepak says:

    We will. I think the “industry” learned some hard lessons in the proteomics space and got a reality check. But instruments are stabilizing, so I think the next few years should be very cool, especially with things like SILAC

  4. Wilka says:

    The huge amounts of data being generated is certainly a problem, and we need better ways of collecting and analysing the data. However, there’s also a surprising number of people that still don’t see a problem with it. Thankfully that is starting to change as more people realise that the problems are bigger than they first thought. But I’ve spoken to people that almost seem to think it’s a good thing to get different results from doing the same thing – “It means you get more answers!”

    It doesn’t help that newer mass spec machines are generating even more data (100s of gigabytes from a single sample) before we’ve figured out a sensible way to deal with the amount of data generated from the older machines. Also, the amount of noise in the data is sometimes extremely high – so just gathering more of it isn’t going to be much use to anyone.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s