Answer for question in bioinformatics course

Answer:

1) A Z-score is the number of standard deviations that an observation is away from the mean. So, if we have observed that the C=O distance is 1.232 +/- 0.023 Ångström then a C=O bond that has a bond length of 1.255 Ångström has a Z-score of 1.0. Obviously, one should be certain about the data 1.232 +/- 0.023 Ångström because if that is wrong everything that follows will be wrong. That is why E&H used the CSD (remember what that is?) because data in the CSD is so much more precise than data in the PDB that for PDB structure validation the CSD derived data can for all practical purposes be called correct.

2) So, the data in the E&H FF consists of CSD derived bond lengths with standard deviations.

3) The validation algorithm is simple:
a) Measure all bond lengths in the protein that you want to validate;
b) check for each bond length the Z-score;
c) Report any bond length with a |Z| > 4.0 (or 3.0 if you want to get picky);
d) Determine the RMS of all Z-scores (the RMS-Z ) and report if the RMS-Z deviates significantly from 1.0. (The latter is a minor detail that is useful to know: if a distribution is normal than its RMS-Z score is 1.0; and please talk with the assistants if you don't understand this).