Bioinformatics Seminars

Canon

In the seminar series and the associated practicals we try to focus on concepts and not on dumb facts. Nevertheless, knowing a few ′things′ by heart is unavoidable, and this page is an attempt to list those ′things′. This page holds many of the terms introduced in the bioinformatics seminar series for second year students, but also a series of terms, words, and concepts that I think every second year science student should simply know, independent of the courses he or she has followed.

This list is non complete, so anything not in this list can still be part of the exam. This list is just to help you, but it is not more than that.

Sequence analysis

Two sequences are called homologous if they evolved from a common ancestor. In the sequence analysis world we often use the program Clustal to align sequences, and the software BLAST to find homologous sequences in databases. SwissProt is the protein sequence database with well- curated and well- annotated protein sequences. The words BLAST, CLUSTAL, SwissProt, and curation are not exam material, but I guess it helps if you at least recognize the words at exam time.

Residues have characteristics like hydrophobicity or charge. You do not need to learn the amino acids by hearth, but it often helps to know the special ones like cysteine (that can form a cysteine bridge), or aspartic acid and glutamic acid that, as the name already suggests, are acids and thus good at binding positively charged ions like calcium.

Sometimes residues have functional roles, like binding a ligand, a co-factor, a drug, a partner in a multimeric contact, or being part of the active site.

When looking at a Multiple Sequence Alignment ( MSA) you can see patterns that have an evolutionary meaning. Examples can be found in correlated mutations, and high or low sequence identity and/or variability. The latter can be expressed in multiple ways like, for example sequence entropy. Sometimes residues correlate with external factors, in that case they probably are involved in the function related to that external factor.

Homology Modelling

The basic concept of homology modelling is that you have the sequence of a protein with unknown structure, and you have a homologous protein for which the structure is know. The latter we call the template for the homology modelling process. After obtaining and optimising the sequence alignment, the conserved residues are kept in full, but for the non-conserved ones the side chains are removed so that only the backbone is left. At this stage deletions are introduced. These normally are observed in loops and seldom in areas with a regular secondary structure like helix or strand. Deletions and insertions tend to occur much more often at the protein's surface than in its hydrophobic core. Often rotamer libraries are being used to predict the side chain conformations of the residues that need to be newly introcduced because they are not conserved between the template and the model structure.

Template structures can be determined experimentally with either X-ray or [NMR]]. [[X-ray]] tends to produce [[protein]] [structure]]s that are more precise and more accurate, but X-ray requires that you have a crystal of your macromolecule, and that causes artefactual crystal contacts. The quality of X-ray structures is related to the resolution of the reflection data; 1.0 Ångström resolution is very good, while 3.0 Ångström is poor. The quality of NMR structures is determined by the number and quality of so the called NOEs. NMR has some technical problems that make it difficult to solve very large structures, so NMR spectroscopists often solve big proteins by one domain at a time. X-ray crystallography doesn't allow you to see parts of the protein that are mobile. NMR has difficulties seeing ions, while with X-ray you often don't see which ion is actually there. Structures solved by X-ray or NMR must be deposited in the PDB.

The Sander and Schneider plot explains when homology modelling is possible (and when not) given the length of the alignment and the percentage sequence identity.

After the model has been completed, it can be optimized with molecular dynamics, and should be validated with software that has been designed for just that purpose.

Structure validation

When we validate protein structures or models, we look for things that are unlikely. That means that we first need to know what is likely, and determine an average and standard deviation for the likely values. We can now define a Z-score for any event as the number of standard deviations that tha event deviates from the average. Often these standard values are determined using PDB files solved at high resolution, but when possible we prefer to use small molecule structures from the CSD because they are solved with much higher accuracy and precision than PDB files.

The difference in experimental methods used to detemine structures causes a difference in the way we should validate those structures. In X-ray crystallography one obtains reflections. Each reflection holds information about every atom in the structure. A Fourier transformation converts the reflections into electron density and the atoms can then be placed in that density. If one reflection is wrong, all atoms will become a tiny bit wrong. Wint NMR, on the other hand, one mainly measures NOEs, and these hold information about the distance between two atoms. In this case, a single wrong NOE causes a local error in the final structure.

In structure validation we often use statistics on experimental data, to figure out what is right and what is wrong, and normally we do that using force field technology. A very simple Force Field is the Engh and Huber Force Field for determining the quality of bond lengths. For each bond length we know the ideal average and standard deviation from a study of structures in the CSD. So if we now want to score one bond length, we can determine its Z-score (i.e. the number of standard deviations that that bond length deviates from the ideal average). That Z-score relates to the chance that that bond length is true. And using Boltzmann's law we can then determine a pseudo energy value for that bond length.

Force Field

There are good historical reasons for calling a force field a force field. Today that name occasionally sounds crazy as often neither forces nor a field are involved. Sorry, live with it.

A force field is a set of data with an associated algorithm that allows you to score a situation, fact, or event, and often also to predict the future of that situation, fact, or event.

force fields come in several flavours.

Molecular force fields run from quantum chemistry to molecular dynamics with everything inbetween that compromises between accuracy and speed.
Self consistent field methods, also called finite element methods, tend to be used in electrostatic calculations.
Simple force fields like the one made by Engh and Huber come in two variants: a) based on an arbitrary truth (like using the CSD to determine things and scoring PDB files with it); b) based on self, i.e. score a series of PDB files to find what is normal for contacts, surfaces, ion binding, etc, and score other PDB files using that realisation of "normal".
Non-molecular force fields like the one Chou and Fasman made to predict the secondary structure of proteins.
Other, exotic, force fields

Molecular dynamics

Molecular dynamics is a computer technique that can be used to simulate in the computer what motions a protein (or more general macromolecule) can make.

In molecular dynamics Newton's laws of motion are worked out using a very much simplified force field for the atomic interactions. This force field genarally includes torsion angles, bond angles, bond lengths, electrostatic interactions, and Van der Waals interactions. These all have their associated formulas that relate their actual value to forces and to an energy. It is just as important to realize which forces are incorporated in these force fields as it is to know which ones are not being used.

Some important aspects of molecular dynamics are the time step, the initial speed that separates molecular dynamics from energy minimisation, the things done to speed up the simulations so that near-realistic simulation times can be achieved, and the fact that in silico one can do alchemy, like thermodynamic cycles to estimate, for example, binding constants for ligands or the stabilistion obtained from a mutant.

Please realize that energy is the integral of the force over time (and vice versa, the force is is the differential of the energy).

Drug docking

Drug docking is one aspect of the much wider concept drug design. Drug design traditionally consists of a series of steps that include:

Chosing a disease
Finding the target (mostly a protein)
Searching for an initial binder, the so-called lead compound. This is the step where docking comes in, often in terms of one step in virtaul screening.
Optimising the lead
Testing on animals
Testing on healthy humans
Testing on patients
Marketing, bribing docters, paying bonusses

Flexibility is a crucial aspect of docking. This can be the flexibility of the ligand or the flexibility of the protein. The latter can not yet be taken into account fully because of CPU time limitations, but rotamer searches can help, and sometimes B-factors and alternate positions in X-ray structures can also suggest where alternate rotamers should be used in the docking process.

Summary

This page is a first attempt at making a canon for the bioinformatics seminar series. Undoubtedly this canon is still missing a number of terms. Feel free to mail me suggestions for words or terms to include.

This canon does not describe what you should know for the exam. That is much more than is written here, of course, but if functions a bit as a dictionary for terms used in the seminars and practicals that you might want to look up while studying. except for a few things that were explicitly mentioned, all terms described in this canon are likely to show up in the exam.