Entropy versus variability

Introduction

Entropy-Variability analysis is a technique that was invented by Laerte Oliveira in the beginning op the 21^st century. The idea is that the the entropy and the variability observed for a column in a MSA behave differently depending on the function of the residue involved.

The start

In 2003 two, back-to-back, articles in PROTEINS: Structure, Function, and Genetics introduced the world to the concept of EVA (Entropy - Variability Analysis):

Identification of Functionally Conserved Residues With the Use of Entropy-Variability Plots. PDF.
This article explains the concepts of EVA.
Sequence Analysis Reveals How G Protein-Coupled Receptors Transduce the Signal to the G Protein. PDF.
This article applies EVA to a large, manually curated, MSA (multiple sequence alignment) of Class-A GPCRs (G Protein-Coupled Receptors). Amazingly, the analysis revealed the importance of the sodium site for GPCR activation, a fact with which the GPCR research community is still grappling today.

Figure 1. In this movie I discuss aspects of sequence variability (that can be observed in a column in a MSA) in the light of sequence entropy and variability. (Click on the image to start the movie).

The five boxes in the EV-plot

Figure 2. This is one of the original, 2003, EV plots. In this plot you see variability on the abscissa and entropy on the ordinate. Each little square represents one column in a MSA.

The part of the plot that holds the data is divided in 5 sectors that Laerte called: 11, 12, 22, 23, 33. The lines between these sectors are drawn a bit arbirary. However, moving the borders a bit up or down, left or right tends to not change very much the conclusions you will draw. It seems important, though, to draw the box borders in such a way that each residue position (column in a MSA) falls in one of these five sectors. Laerte Oliveira studied five classes of proteins in very great detail and assigned a functional role to each residue in each sequence in each of these five classes. The relation between residue function and box number was then found to be:

Box11 Residues in the main active site. These can, for example, be the catalytic residues in an enzyme, or the secondary messenger binding site in a cell surface receptor.
Box12 These residues are in the direct 3D vicinity of the Box11 residues, and often sit in the structure between Box11 and Box22 residues.
Box22 These residues communicate between the main active site (often via Box12 residues) and a regulatory site. The regulatory site regulates the main activity. In GPCRs, for example, the G-protein binding site is the main site (Box11) while the ligand binding site is the regulatory site (Box23).
Box23 These residues make-up the regulatory site or sites. In GPCRs this is the ligand binding pocket, but I guess that residues that mak-up an aloosteric binding site that occurs at the same 3D position in a series of GPCR families will also fall in this box. Most enzymes have some on-off mechanism (like the calcium site in many trypsins), and residues involved in switching an enzyme on-off will be found in Box23.
Box33 Residues do nothing. Laerte joked that these were reserved by evolution for future use.

What did we do

We creatively obtained the idea for an autoencoder from: Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507, and when this got applied to ~7000000 colums in all HSSP MSAs (made by Coos Baakman) for human proteins and reduced the 20-dimensional vectors of residue frequencies in the columns of these MSA to just two dimensions, these dimensions were observed to be Entropy and Variability...
Figure 1 is hyperlinked to a short seminar that explains some MSA analysis ideas, including Entropy and Variability.