Alphafold2 report

The story

When the 2020 CASP results came out, AlphaFold had done so well that the protein folding problem was called ′solved′. And when we now look at several ten thousand models, we see that the majority is actually, hmm, well, very bad till totally useless.
So, what is going on? In this section I will dream-up loud my ideas on this topic. So, this section will be low on facts and observations and high on speculation.

What do we start with?

Cutting all possible corners, it can be stated that Alphafold (AF) is based on correlated mutations observed in Multiples Sequence Alignments (MSAs). In that respect it does the same as Sander and Marks, and Jones pioneered a couple years ago. AF does it better, though, because they a) can do every step that requires massive CPU efforts better than their competitors, and b) they do some smart neural network things.

Correlated mutations

So, to see how AF and its predecessor methods work, we need to understand correlated mutations.

Vriend's first two rules of bioinformatics state that:

If it is conserved, it is important
If it is very conserved, it is very important

And the converse then reads that important residues remain conserved and less important residues can mutate. Several years of mutation studies on neutral proteases by (mainly) Venema, Vd Burg, Eijsink, and Vriend, has shown that most residues in this protease actually can be mutated without any consequence for the enzyme′s stability or function. And indeed, hundreds of protein engineering studies can also be interpreted as "nature has no problems with stability; if you make a bad mutation at location A, you can compensate for it at position B". We showed an example in which a destabilizing mutation at one location could be compensated by other mutations up to 50 Å away in the structure. So, if a residue mutates, its direct neighbour in the structure will only have to mutate too if the contact between those two residues is of functional importance. And protein stability is NOT functionally important.

Which contacts are then important enough to either be conserved, or upon mutation of the one contact partner require that the other contact partner mutates too? Before I can discuss this, you need to know a few other things that we have researched over the years and that I will summarise in two text boxes below. In the first box, I discuss work by Janne Bibbe on GPCRs.

GPCRs are receptors that consist of seven transmembrane helices and a bunch
of domains that stick out at both sides of the membrane. All GPCRs bind a
ligand between the helices near the extracellular side, and this ligand
binding leads to a cascade of molecular motions that lead to the escape of
a sodium ion into the cytosol and to G-protein binding and activation. The
essence of this story is that the signal must traverse through the core of
the molecule, and that can ONLY be done by movements of amino acids. There
are hundreds of, often radically different, ligands that can all bind their
own GPCR, but there are only a few G-proteins. So, GPCRs have a wide variety
of residue types at the positions surrounding the ligand binding pocket, but
the sodium escape switch consist nearly always of the same residue types.
Clearly, the contacts made by residues among this path are of functional
importance, and thus must follow Vriend's rules of bioinformatics. And in
doing so, they will start to show correlated mutational behaviour in MSAs.
Janne beautifully showed that the most conserved aspect of the sequence
motifs in GPCRs that over the years were shown related to activity is neither
their sequence motifs, nor their structural aspects, but rather the location
of weak spots in helices that then allow for the signal to traverse from the
ligand to the sodium ion and to the G-protein. And altough a brilliant aspect in
itself, this conclusion is not important for the SF story.

The second box summarises the work by Laerte Oliveira on sequence entropy and sequence variability as observed in colums in MSAs. This work is discussed more extensively here and I will here only discuss the final conclusions.

Laerte made big MSAs for five protein families that were very well studied at
that time. He determined for each residue position in a protein the entropy
and the variability in the MSA column for that position. He made a two
dimensional plot of sequence entropy versus sequence variability and in this
plot he determine five sectors that he called 11, 12, 22, 23, and 33. I will
discuss them in a strange order. the reason for that should be clear at the
end of the table.
a. Box 11 are the most conserved residues. They have low entropy and low
  variability and are almost without exception found in the main active site.
b. Box 12 holds residues with equally low variability but higher entropy. In
 other words they seem equally conserved, but the possibilities that evolution
 found out for these residue positions are more evenly spread over the members
 of the MSA family. These residues were always found in direct contact with
 with Box 11 (active site) residues.
c. Box 33 holds the residues with the highest variability and the highest
 entropy. For these residues Laerte could seldom find mutation study results
 in the literature, and he jokingly stated that these residues were reserved by
 nature for future use.
d. Box 23 holds residues with about the same high sequence entropy as Box 33,
 but with lower variability. These residues were almost without exceptions
 found in regulatory sites or modulator sites. So, for GPCRs or Nuclear Hormone
 Receptors the Box 23 residues were found in the ligand binding sites.
e. And that leaves us with the Box 22 residues. These have the same variability
 as the Box 23 (modulator site) residues, and the same entropy as the Box 12
 (active site support) residues. And, ... Box 22 residues mostly are physically
 located between the active site and the modulator site. Often contacting
 either Box 12 or Box 23 residues 9or very sometimes both).
Daniel Rademaker showed that Entropy and Variability are indeed the two optimal
parameters to numerically describe the mutational patterns observed in columns
of an MSA.

A-B-C contacts

lets discuss three residue positions in an MSA. Creatively called A, B, and C. If A correlates with B and B with C, than you will observe that A correlates with C too. The problem now is that A does not need to contact C. A, B, and C can form a triangle in the structure, or be located on a straight line. For Sander and Marks, and Jones (and the many who followed their footsteps) this was a problem. Many groups have thought of simple or complex, crude or elegant, well working or rather failing methods to figure out if A and C are a genuine contact or not.

I haven't done any work on this topic, but I encourage you to check if entropy and variability can help untangle the A-B-C problem. After all, you expect contacts (or at least close proximities) between rsidues in adjacent Boxes in the EV-plot. Actually, I would expect more contacts between residues in adjacent Boxes than between residues observed in the same Box. If you get results feel free to publish, but don′t forget to mention this website.

Now, so what?

Well, the two text boxes above tell that if a protein has an active function that requires an active site, and it is the type of activity that requires regulation at the cellular level, then there will be Boxes 11 till 33 and there will be residues that contact each other along the path Box11->Box12->Box22->Box23->Box33->, and vice versa. As these contacts are of functional importance, they will have two adhere to Vriend′s rules of bioinformatics and they will thus show correlated behaviour.

So, the structures of enzymes and membrane receptors, will be predictable from the information that can be extracted from a good MSA (with perhaps some other information needed too, but the essence will be in the MSA). Other proteins that do their work by binding something like an actin-binding protein, or perhaps also viral coat proteins, some ribosomal proteins, proteins that form the tubulin, etc, highways in cells, membrane rigidity proteins, etc., will all not have internal contacts that are of functional importance, and will thus not show correlated mutation behaviour for the residue pairs for which it would be most important (for structure prediction) to know that they make a contact.

And so far, annecdotal evidence on AF results supports this thinking.

I suggest you use the AF models for your research, but only after you have used a molecular visualiser (I can strongly advise that you use YASARA, but I also strongly advised that everybody would get a corona vaccin, and you see what that brought us). Further, keep looking at this website. I am not Google, so I need to use my one computer for months no end to get it done, but within a year, I hope to have WHAT_CHECK reports ready for all AF models.

I can also imagine that we improve many models by cutting out big sections with a high LR factor. It might even be better to see if the AF per-residue-predicted-accuracy (that seems encoded in the B-factor; but I haven′t found anything on this yet) can be used to determine which parts of AF models are potentially useful and which parts not. For this purpose I made a raw-data table (ruwe.data; warning this file is 430 Mbyte big). If anybody is good with R and graphics, feel free to contact me and I′ll explain what the numbers in this file mean.

Disclamer

The text in this website was written by Gert Vriend who is the only one responsible for this text. And Gert has not done extensive research to draw his conclusions but merely stared at the wall thinking about the great future he once had in front of him while finishing a bottle of single-moult.