Alphafold2 report

The LR-score

The LR-score was originally implemented to catch NMR structures that consist largely of unstructured loops with perhaps a few turns in the middle. In case of NMR studies, these structures do make a bit of sense as they teach us about natively unfolded proteins, but from a point of view of a homology modeller looking for a template, they are useless.

Why the name

This explains the name of the factor, a lone ranger/residue without contacts with orther rangers residues. Lone Residue -> LR.

Implementation

The LR-score calculation follows a few steps:

For all residues in a protein all residue-residue contacts are counted. No symmetry related molecules are taken into account here. Contacts are only calculated with residues that are five residues away in the sequence.
In the resulting row of numbers that typically range from zero till around ten, all stretches of five or less zeros are removed by setting them to one.
The total number of zeros left is divided by the total number of amino acids and multiplied by 100%.

The resulting number falls between 0.0% and 100%. 100% is found for fully extended proteins or proteins that consist of just one long helix. I ran this ′algorithm′ over all Alphafold models for more than 20K human proteins, and made a primitive histogram of theese 20K LR-scores. The result seemed somewhat surprising. Almost half of all models seem to have a LR-score of 50% or worse...:

   0.000 -  10.000 (  698)    =====
  10.000 -  20.000 ( 3196)    =====================
  20.000 -  30.000 ( 4511)    ==============================
  30.000 -  40.000 ( 3464)    =======================
  40.000 -  50.000 ( 2432)    ================
  50.000 -  60.000 ( 1894)    =============
  60.000 -  70.000 ( 1467)    ==========
  70.000 -  80.000 ( 1459)    ==========
  80.000 -  90.000 ( 1548)    ==========
  90.000 - 100.000 ( 2081)    ==============

Worried that there was a bug, I ran the same algorithm over 10K randomly selected Xray PDB files and got:

   0.000 -  10.000 ( 9654)    ==============================
  10.000 -  20.000 ( 1054)    ===
  20.000 -  30.000 (   97)
  30.000 -  40.000 (   17)
  40.000 -  50.000 (   13)
  50.000 -  60.000 (    5)
  60.000 -  70.000 (    1) 3s4r
  70.000 -  80.000 (    3) 1fav 2ymk 4lh9
  80.000 -  90.000 (    0)
  90.000 - 100.000 (    1) 1nyh

I checked a few of the high (=bad) scoring PDB files:

1nyh is one long helix.
1fav consists of two helices, a short one packed on the middle part of the long one.
2ymk are three helices hanging in space. But the complete molecule requires that the symmetry matrices are applied.
4lh9 consists of 1 helix and a strand that barely touches the helix. However both secondary structure elements make extensive symmetry contacts.
3s4r consists of two very long helices that just touch at one side.

So, nothing wrong with these files, but useless as modelling template, unless you want to model a close homolog.

The point is that 99% of the Xray PDB files score in the lower two bins, while for the Alphafold models this isn′t even 20%.