Entropy, Variability

For every column p in a MSA the entropy Sp and variability Vp are calculated. The location in an EV plot of the point (Vp,Sp) is related to its function as described at the previous page for the Boxes 11, 12, 22, 23, and 33.

Entropy

Figure 3. The entropy Sp for column p in a MSA is calculated as the sum over the 20 amino acid types i of fp,i.ln(fp,i), which resembles the Shannon entropy function rather well. fp,i is the frequency of residue type i (A,C,D,...V,W,Y : 1,2,3,...18,19,20) at sequence position p (column p) in the MSA.

Nowadays, we would take the 20-log rather than the natural logarithm to make sure the values end up between 0 and 1, but taking the natural logarithm or the 20-log differs only by a constant.
fp,i ranges from 0.0 when a residue type i is not present in column p in the MSA to 1.0 when this residue type i is fully conserved. Sp can range from 0.0 for a fully conserved residue to ln(20) when all 20 amino acid types i are observed in column p.

Variability

The variability Vp of column p in a MSA is defined as the the number of different amino acid types i are observed in column p of a MSA. Laerte, 20 years ago, demanded that a residue type should be present for at least 0.5% in a MSA (so, residue type i must be observed at position p in 1 (or more) of 200 aligned sequences. He made this rule to avoid errors cause by the many sequencing errors that were still made in those days. I guess that nowadays this 0.5% sequencing errors can be forgotten, but the 0.5% is still useful to deal with the occasional totally wrong protein that accidentally ended up in the MSA.