Report on Murthy structures

WHAT_CHECK

The WHAT_CHECK software knows many, many validation options. Not many of these are useful for the detection of fraud because when I started writing WHAT IF (of which WHAT_CHECK is one menu that was mainly written by Rob Hooft (a brilliant theoretical crystallographer) I was young and foolish enough to think that fraud couldn't happen in a world as pure as the scientific one.

WHAT_CHECK is one option of the WHAT IF software. For WHAT IF and extensive website exists. The WHAT_CHECK website holds much information about WHAT_CHECK, and we are in the process of making a course on using WHAT_CHECK. Because of this the WHAT_CHECK site will be in shambles for a while, but feel free to occasionally come back to it and see how far we got.

The WHAT_CHECK options can be classified in many ways. One way could, for example, be:

Geometric errors.

For things like bond lengths, bond angles, etc., we do know the real values with great accuracy and precission thanks to small molecule crustallography. For example, we know that the length of a Cα-Cβ bond in alanine is 1.521±0.033 Ångström. In due time we might find out that it should actually be 1.523±0.021, but that will do little harm to the present day conclusions. There is a twenty year long body of evidence to show that σ-values (i.e. standard deviations) on parameters we 'know' only get smaller. So, something we call today a 5σ deviation might in the year 2011 have become a 7σ deviation, but there is no chance it will become a 4σ deviation. As the accuracy and precision of these parameters is much higher than what today can be achieved by protein crystallography, comparisons with such parameters provide a gold standard.

Structural inprobabilities.

There is no absolute standard for things like rotamer distributions, atomic clashes, or atomic packing distributions. In such cases we resort to internal callibration. With that I mean that we set up a scoring method. We run this score over 500-1000 high quality X-ray structures. This gives us a score-distribution. If that score-distribution is adequately close to a normal distribution, it can be used for checking purposes, and scores for other proteins can be expressed as the number of standard deviations any protein is away (using the same scorings method as for the 500-1000 used for callibration, of course) from the mean of the callibration distribution.

Other checks.

WHAT_CHECK knows about 100 checks that deal with symmetry related problems. We warn for nomenclature errors. We report missing atoms and a whole series of about 25 other administrative problems. We warn for wrong residues, for flipped His, Asn, Gln side chains, etc. All in all there are a little over 250 checks in this category. Non of these checks, however, are good at detecting fraud.

A word about σ

In WHAT_CHECK we attach Z-scores to values. Z-scores are the number of standard deviations that an observed value deviates from the expected mean. So, if we say that some observation is a 4σ deviation, we take a (very small) chance that we call something wrong that is correct. The table below (which I extracted from the Wikipedia) indicates how certain we are that something is actually wrong when we call it wrong.

1σ 68.3%
2σ 95.4%
3σ 99.7%
4σ 99.994%
5σ 99.99994%
6σ 99.9999998%
7σ 99.9999999997%

So, if something deviates from normal by 5σ we take a chance of 0.00006% that we call something wrong that is actually right.

It should be kept in mind that the chance that a series such deviations occur in one protein is roughly the product of all possible occurrences of the event multiplied by the chance of each occurrence. So, if we have 1000 bonds in a protein we very roughly ecpect three of those bonds to deviate by 3σ. So, finding three of such events is normal. Finding 17 of them, on the other hand, is very, very abnormal.