While many scientific disciplines face huge difficulties when trying
to experimentally validate theoretical predictions, protein modelling
is in a fortunate situation: since 1994, the biennial CASP ('Critical
Assessment of Structure Prediction') contests (Moult et al. 2007)
provide an ideal opportunity to evaluate the accuracy of today's
many protein structure prediction methods. During each CASP
season (lasting about four months once every two years), about 200
research groups try to predict the structures of ~100 proteins, the
CASP targets. The target sequences are provided to CASP by structural
biology labs just before the corresponding structures are solved. The
predictions are thus true blind predictions, allowing to measure
performance in realistic test cases, locating areas of progress as
well as yet unsolved problems.
CASP regularly shows that the eight homology modelling steps summarized here allow in many cases building reliable models, from which a lot of structural and functional insights can be derived. However, these eight steps are unfortunately not sufficient to actually solve the protein structure prediction problem via homology modelling as soon as enough templates become available.
The figure shows CASP8 targets T0498 and T0499: both proteins are 56 amino acids long, 53 of which are conserved (95% sequence identity). Still, the two structures are entirely different; just three point mutations completely change the fold. While this is an extreme example of human protein engineering art (He et al. 2008), also naturally occurring proteins with similar sequences often show surprising structural diversity (Kosloff and Kolodny 2008), letting classic homology modelling fail miserably. The prion protein (Prusiner 1998) and other amyloid-forming proteins provide an even more dramatic case; here 100% identical sequences can exist in two totally different structures. Obviously, the homology modelling problem is tightly intertwined with the more general protein folding problem itself. Even if a close template is available, there can always be structurally diverging regions, which are either expected from the poor local alignment, or unexpectedly caused by critical point mutations, or widely differing crystal packing contacts.
The only way to handle these difficult cases is to apply more general ab initio folding algorithms, which do not depend on template structures, but try to simulate the complete folding process from the stretched-out conformation. As it turns out, this 'one-algorithm-for-everything' approach is the currently most successful one at CASP (Chivian et al. 2003; Pandit et al. 2006): if available, it uses known templates (or fragments thereof) only to guide the search, but does not depend on them. As a side-effect, this allows to build hybrid-models, combining the best parts from multiple templates.
Despite these encouraging developments, the protein folding problem is far from solved. The best models are still built by those who got the alignment right in the first place, which unfortunately implies that structural diversity is often missed: one cannot yet ignore the difficult-to-align regions and simply predict them with ab initio folding instead. The sequence alignment problem thus remains an active research field for years to come.
Noteworthy progress has been made with model optimization to bridge the structural gap between initial model and target. While in the early days of CASP, predictors were well advised to keep the backbone of their model fixed (the 'frozen core approach'), simply because the danger of messing up the model was just too large, the situation is quite different today: force field accuracy (Krieger et al. 2004) and sampling efficiency (Misura et al. 2006) have improved to a level that allows well performing methods like Modeller-CSA (Joo et al. 2008), Rosetta (Chivian et al. 2003), undertaker (Vriend 1990), and YASARA (Krieger et al. 2009) to free all atoms during the refinement, often moving models considerably closer to the target.
While homology modelling currently focuses on the protein in a model, other entities, i.e. carbohydrates, small molecules, and ions, also make up important parts of certain proteins and protein complexes. For instance, zinc atoms in so-called zinc fingers are important for the stability of the protein, and a common protein like haemoglobin would be useless without its haem groups and the iron atoms therein. Carbohydrates in glycoproteins perform numerous functions, ranging from providing stability to signalling and labelling for intra-cellular transport (Lütteke 2009). The many roles of non-protein entities make it obvious that homology modelling should look beyond the protein. A complete model should thus be more than a 3D representation of an amino acid sequence. One major challenge for homology modelling is recognizing binding sites for non-protein entities.
Drug docking software (e.g. Rary et al. 1996; Nabuurs et al. 2007) can be used to detect the binding sites of compounds such as heme groups or co-enzymes. However, relevant biological information is needed to select compounds that may be bound to the protein. Copying the binding site from the template structure is the simplest method, but does not work for ab initio folding models. For such models, spectroscopic analysis of the protein can provide insight on which compounds are bound. This approach is not limited to homology modelling; X-ray crystallography can also benefit from spectroscopic analysis of a protein to identify a bound compound (Chen et al. 2002).
Incorporating ions can be an additional step of the modelling process.
Nayal and Di Cera (1996) have suggested a method to detect sodium binding
sites in protein structures which can be extended to detect various
other ion binding sites. Of course, any additional experimental data
can guide this ion site detection process. Especially tightly bound
functional ions that co-purify with the protein can be detected by
means of spectroscopic analysis. A significant number of PDB files
have bound ions or water molecules that were erroneously assigned.
We have observed
The power of force-field based model optimization methods can be reduced significantly when such problems include a difference in the ionic charge. It is therefore very important to (experimentally) validate the ions in template structures when these are important for the final homology model.
Carbohydrates can be modelled at the final stage of the homology modelling process, but this does not always reflect the protein folding process. Carbohydrates are not only added in post-translational modification, but also during the protein expression by the ribosome. They are important in the protein folding process and the detection of misfolded proteins (Parodi 2002). It may therefore prove interesting to add the necessary carbohydrates to the unfolded protein before ab initio folding. Apart from their role in protein folding, carbohydrates are sometimes important in oligomerisation of proteins. For instance, the neuraminidase protein from influenza shows different glycosylation states in its monomeric, dimeric, and tetrameric states. The carbohydrates in tetrameric state provide extra stability and, in the case of the Spanish flu influenza virus, resistance to trypsin digestion leading to increased virulence (Wu et al. 2009).
This shows the vital (and sometimes lethal) importance of considering carbohydrates in homology models.