- Validation related

The ′quality′ of protein structures used in bioinformatics ′experiments′ sometimes is of crucial importance. Today′s PDB is big enough for most studies to not worry about losing a few thousand files. The structure validation Lists help you decide which files to incorporate in your training and testing sets.

This section is not yet ready. The .sco and .qua Lists databases
require that the PDBREPORT database is up-to-date, but PDBREPORT is
presently being redone, something that might take till the second week of
November. Sorry. But don't worry; they will come.

Motivation

The quality of protein structures in the PDB is heterogenous. Some proteins will crystallize only in tiny crystals with high mosaicity. Many crystal structures were solved by a master student for whom this was the first (and often unfortunately only) work in this field. The Validation section of the Lists is a first attempt at giving you a handle on protein structure quality. be awre, though, that with quality of a structure we neither judge how well the crystal was solved, nor how low the Rfree became, but rather, we judge how likely it is that inclusion of this PDB file in a protein structure bioinformatics study will improve the correctness of the final conclusions drawn.

Touw et al illustrated that structures of better quality can often also lead to conclusions of better quality in biomedical studies that are based on the analysis of just one or a few protein structures:

New Biological Insights from Better Structure Models.
Touw WG, Joosten RP, Vriend G.
J Mol Biol. 2016 Mar 27;428(6):1375-1393.

The following Lists are available in this section:

CodeType of data
scoGives structures a grade from 1-10
quaPacking quality, coarse
nqaPacking quality, fine
flpBackbone peptide flips
rotrotamer normality
c12Χ12 normality

Grades for structures (sco)

Totally neglecting sensitivity issues, we have given all protein structures a grade from 1-10. 1 means catastrophy; 10 means perfect.
This score weighs resolution, the presence of protein, the length of the largest contiguous chain of connected residues, the presence of UNK residues, the presence of Cα-only residues, the number of residues with missing atoms, the number of ′things′ without know topology, the Ramachandran plot score, and the coarse packing score (DACA).

Again, this score is not a judgement of the crystallographer who deposited the structure. A 27 amino acid protein, solved at 0.7 Ångström  resolution in which WHAT_CHECK cannot find a single error, still gets a low score because it is not very representative for the protein universe, and thus not a good participant in the training/test set of protein structure bioinformatics studies. The file SCORES.html holds the list of scores, sorted by ′quality′.

Packing quality, coarse (qua)

WHAT IF knows two ways of determining the packing normality of proteins. The old-style packing quality, called DACA by gerard Kleywegt, measures packing in a very general way, it looks if residues are at the right location:
J. Appl. Cryst. (1993). 26, 47-60.
Quality control of protein models: directional atomic contact analysis
G. Vriend and C. Sander
(Open source did not exist yet in 1993, so the PDF is here

I think we could have done a better job in terms of normalisations etc. But remember (or perhaps not), but back in those days the PDB had a hundred times fewer entries than today... Further, at the time DACA was conceived, WHAT IF did not yet have its symmetry modules, so DACA scores need to be held against non-intra-protein packing effects. Residues involved in crystal packing or in binding ligands will artificially score lower.

′qua′-Lists typically look like:

    1 ILE (   1 )A     -   -4.35
    2 THR (   2 )A     -   -4.43
    3 GLY (   3 )A     -   -0.21
    4 THR (   4 )A     S   -0.54
    5 SER (   5 )A     S   -3.21
    6 THR (   6 )A     S    0.94
    ...

in which the scores are per residue. The normalized per-protein scores are included in the ′sco′-Lists.

Peptide flips (flp)

Touw et al first hand-refined a few hundred structures witha special focus on so called peptide flips (situations in which the C=O and N-H between two Cαs needs rotating 180 degrees, or needs to undergo a cis-trans isomerisation operation. Using this data, software was written to check the whole PDB. newly discovered putative flips were then hand-checked and optionally added to the training set. This process continued for a while. See:
Detection of trans-cis flips and peptide-plane flips in protein structures.
Touw WG, Joosten RP, Vriend G.
Acta Crystallogr D Biol Crystallogr. 2015 Aug;71(Pt 8):1604-1614.
The PDF is here

The final product (a random Forest method) is now part of the WHAT_CHECK part of WHAT IF.

A ′flp′-Lists typically looks like:

   19 GLY (  19 )A       TT+   Unlikely
   21 GLY (  21 )A       TT+   Unlikely
   61 PHE (  61 )A       TC-   Likely
   94 LEU (  94 )A       TC-   Very likely
  140 SER ( 140 )A       TT+   Likely
  141 GLY ( 141 )A       TT+   Somewhat likely
  173 ASP ( 173 )A       TT+   Unlikely
  174 MET ( 174 )A       TT+   Somewhat likely
  190 ASP ( 190 )A       TT+   Unlikely

It is highly likely that there will be very many fewer peptide flips in PDB_REDO files than in PDB files as the flp-software is an integral part of the PDB_REDO pipeline.

Packing quality, fine (nqa)

A few years after DACA was introduced Rob Hooft was the first to realize that DACA largely looks at the distribution of hydrophobic and hydrophylic residues over the protein′s core and surface. Rob then set out to make a better packing validator. In the end, the two packing modules do different things. While DACA asks the question "Does this residue belong here?", the fine packing quality module asks the question "Given that the residue sits here, does it sit here optimally?".

DACA works better for NMR structures, low resolution structure, and other situations where whole strands might be shifted by one position or similar big errors, while the newer, fine packing quality module is the method of choice for high resolution structures.

I guess that at 1.5 Ångström  resolution or better, the results from both packing modules become meaning less as with that resolution only a blind horse can make errors at the level that the packing modules are sensitive for.

A ′nqa′-Lists typically looks like:

   17 ASN (  12 )A     T    0.11
   18 GLY (  13 )A     H   -0.28
   19 GLY (  14 )A     H   -0.37
   20 ILE (  15 )A     H    0.74
   21 THR (  16 )A     H    0.11
   22 ASP (  17 )A     H    0.58
   23 MET (  18 )A     H    0.00
   24 LEU (  19 )A     H   -0.01

The value of rotamers

No matter how well crystallographers do their work, and no matter how hard the PDB_REDO team works, there will always remain the problem that crystals form only when the proteins make crystal contacts. There is nothing we can do against that, except not using crystal-structures but structures solved by NMR or EM; but that has other disadvantages.

Residues involved in crystal-packing have -by necessity- a different conformation than they would have had if there was no crystal contact. We will one of these days make some data available that people can use to study this for themselves. For now, the rotamer-related Lists can be used to get an impression of the magnitude of the influence of crystal-packing artefacts.

Rotamer normality (rot)

Chinea et al showed the power of rotamers for modelling and structure validation:
The use of position-specific rotamers in model building by homology.
Chinea G, Padron G, Hooft RW, Sander C, Vriend G.
Proteins. 1995 Nov;23(3):415-421.
The ′rot′-Lists database holds for each protein the rotamer normality score for each amino acid. Be aware, though, that a rotamer normality score can be determined only for a residue that has two intact, covalently bound residues present at either end in the chain.

A ′rot′-Lists typically looks like:
    1 MET (   1 )A     - -999.90
    2 ILE (   2 )A     S -999.90
    3 SER (   3 )A     S    0.46
    4 LEU (   4 )A     S    0.52
    5 ASN (   5 )A     S    0.53
    6 GLY (   6 )A     S -999.00
    7 TYR (   7 )A     S    0.47

in which a score of -999.9 indicates that the rotamer normality for that residue could not be determined.

Χ12 normality (c12)

Whereas Chinea's rotamer method looks at the normality of entire side chains, the c12 Χ12 normality looks only at the correlation between Χ1 and Χ2. The rot-method and the c12-method overlap, but also are sufficiently different to use both when determining the ′quality′ of a PDB file.

A ′c12′-Lists typically looks like:

    4 LEU (   4 )A     S   -1.33
    5 ASN (   5 )A     S   -0.85
    6 GLY (   6 )A     S -999.90
    7 TYR (   7 )A     S   -0.35
    8 GLY (   8 )A     S -999.90
    9 ARG (   9 )A     S   -1.08
   10 PHE (  10 )A     S   -0.30

in which a score of -999.9 indicates that the Χ12 normality could not be determined for that residue (which is normally caused by the residue not having a Χ1 and Χ2).