Clustering protein microenvironments – best metrics?
August 27, 2008 1 Comment
My day job has me working on various aspects of protein function prediction and annotation, most prominently on building 3D models of functional sites from 1D sequence motifs that can be used to predict function in protein structures, and most recently on using literature to annotate clusters of microenvironments derived from protein structures for use in building models of novel functional sites. Right now I’m continuing to work on the cluster annotation problem, but am also trying to collect a set of high confidence function predictions on newly solved structural genomics targets in the hopes that we can secure an experimental collaborator and a PSI grant to validate them.
Sometimes evaluating the predictions is easy – there are only a few for that protein structure and maybe you can rule them out because the model itself has low performance or the scores of the predictions don’t fall within the range of known examples. But other times there are dozens of predictions from different models for the same structure, some of which look good, some of which look close to good, and some of which are dubious. When a structure hits so many models you immediately think two things: 1) that protein does SOMETHING, all right! 2) how do we make heads or tails out of these predictions?
One thing we’ve thought of is to group the models by similarity. If the majority of the predictions come from a group of models that is similar in some way, it can provide some rationale for the plurality of predictions and boost your confidence in that general class of function. But how to group the models? This turns out to be somewhat nontrivial as the functions that the models comprise include catalytic sites and ligand binding sites, some of which cannot be classified together (e.g. using EC numbers). Just knowing what the function is (oxidoreductase vs. cation binding vs. protease, etc) can help, but only generally; in any case, these approaches seem too informal and arbitrary.
So I thought about using the data in the models themselves, which is more formal (though not necessarily any less arbitrary). Each model consists of a matrix listing features (physicochemical properties at various distances from the functional site center) along with whether that feature is significantly enriched, significantly depleted, or not significantly different in positive examples compared to negative examples. I could represent each model as a vector and use any of a whole suite of similarity metrics to determine the distance between any two models, which could then be used to arrange the models in a hierarchy. Now the questions are – how to represent the models? What metric is most appropriate? And what clustering method to use? So far I am converting the models into vectors of 1’s (if that property is significantly enriched), -1’s (if that property is significantly depleted), and 0’s (if that property is not significant), and have tried a number of the metrics available in the Cluster 3.0 program, but I am not sure how to assess the results. There are also different options for the hierarchical clustering – single linkage, complete linkage, average linkage, etc.
Right now I have very little intuition as to the best representation for the models, or the most appropriate distance metric and clustering method. Or maybe manual grouping is the way to go after all (but as a bioinformatics person, I resist that notion). Anyone else have any insight?