Question, method

We asked the question what would be the best way to reduce the dimensionality of the data in MSAs in general. And after doing the data-reduction we realised that the two dimensional representation matches rather well with EV plots.

The data

We decided that the MSAs for ~27K human sequences in the HSSP database would make for a large enough, diverse enough, and representative dataset.

Figure 4. The method to reduce the 20-dimensional alignment data to just two dimensions is explained in the movie. (Click on the image to start the movie).

The use of the data determines the type of result

The whole process of encoding and decoding can be done in two very different ways:

  1. Using frequency vectors sorted by the frequencies
  2. Using frequency vectors sorted by the amino acid types

Figure 5. The result of encoding frequency vectors sorted by frequency.

Figure 6. When the frequency vectors are not sorted by frequency, but are left in the same order as HSSP uses for the amino acid types, a very different result is obtained. From this second experiment we obtain information about the relationships between amino acid characteristics.

On the next page you find both a typed and a spoken seminar that both discuss these results.