Moving targets in proteomics
February 18, 2009 4 Comments
Proteomics – the large-scale study of proteins – often doesn’t get as much attention as its older sibling, genomics. But the potential applications have just as much, if not more, importance, for proteomics has to do with the physical nuts and bolts in biological systems – proteins – rather than the blueprints. To be sure, DNA plays a huge role in many different ways, but these macromolecules bustling around inside the cell deserve their own special credit.
Most of us working in bioinformatics hear much more about genomics than proteomics on a regular basis. The field of genomics is more mature and the problems have been around long enough to be fairly well-defined. Proteomics, however, is messy in a number of ways. While the genome remains more or less constant across conditions, the proteome can change drastically depending on when you assay it and from where. The same gene sequence can produce multiple different proteins. Proteins can also undergo many chemical modifications which change their function. On top of this, identifying specific proteins in high-throughput assays is still a challenge.
Yet the potential benefits of proteomics research are tantalizing. For instance, we could develop non-invasive diagnostics – instead of performing tissue biopsies, we could collect samples of blood or urine to identify whether someone has a particular disease, is at risk for organ rejection after a transplant, or is responding well to a treatment. So what some proteomics researchers are trying to do is identify potential biomarkers for a condition. They do this by studying the differences between the sample population of interest – burn patients who suffered multiple organ failure, say – and an appropriate control population – burn patients who did not suffer multiple organ failure.
In a typical scenario, the samples might be analyzed using mass spectrometry, where proteins are broken up into smaller fragments (peptides) whose masses are measured and used to specify the identities of the original proteins. In most cases the peptides map unambiguously to specific proteins, but a fraction cannot be resolved with complete certainty and are usually excluded from the analysis. (This is separate from issues regarding the methods used to identify the proteins, the quality of the raw data, and difficulties with the instrumentation itself, which I won’t be discussing here.) Once we’ve identified the proteins in the sample, we can associate them with the intensity levels of the detected peptides, producing an expression vector (of sorts) for each sample analyzed.
Given these expression vectors, we want to do the same things we do to gene expression data – namely, normalize the data and find proteins showing statistically significant differential expression between sample groups. Unfortunately, we can’t just apply the same methods; the data, although identical in theory, is actually quite different in practice. Missing values, merely an annoyance in gene expression data, can be a serious problem in protein expression data, and the reason why is related to the general difficulties of studying proteins.
To understand why, consider this: mass spectrometry is a great way to assay a large number of proteins in high-throughput, but there is no way to know ahead of time which proteins we expect to find. We don’t have a pre-determined array of possibilities to fill in; rather, we construct the array from the experimental data. With multiple samples, the array consists of proteins detected in the entire experiment along one axis and samples along the other, with intensity for a particular protein in a particular sample filling in the matrix. As we do this for greater numbers of samples, the proportion of missing values actually increases. This is probably due to low abundance proteins in that each additional sample increases the likelihood that a low abundance protein will be detected. When such a protein is detected, it is added to the data matrix, and all the columns corresponding to the other samples for that protein now become missing values. So how do we handle these missing values? Should they be converted to zeroes? Set to a “minimum detection value”? Set to something else? If the lack of detection for the previous samples really is due to low abundance, then one of the first two solutions makes sense.
This seemingly mundane example made me realize two things. One is that in proteomics we are essentially trying to study a moving target. This is extremely hard. We go from working in one dimension to working in four – and we can’t do it in just one step. Things happen in proteomics assays that don’t happen in genomics assays and we have to develop a whole new set of tools. This is the second lesson: no matter how much informaticians want to generalize, we can’t always apply the same methods blindly. Even seemingly simple problems like missing values can have different causes and different consequences, depending on the biological context.
Note: I have no background in proteomics and mass spectrometry other than the general science grapevine and a short talk I heard today which inspired this post. If I described anything inaccurately, please help by leaving a comment. :)
McHugh L, Arthur JW (2008) Computational Methods for Protein Identification from Mass Spectrometry Data. PLoS Comput Biol 4(2): e12. doi:10.1371/journal.pcbi.0040012