The amino acids

Aligning

EU name: ALIGN

(From: ../EUDIR ) (Date: Aug 24 2016 ../EUDI)

After completing the "Aligning" section you will:
Be able to make (small) alignments by hand
Be able to judge the quality of sequence alignments
Know how to predict secondary structure of sequence fragments and use this information to optimize alignments
Know what is "threading".
Know that a good sequence alignment is necessary to carrying over information between proteins.
Know that putting amino acids below each other in a sequence alignment implies that you predict that they are on equivalent positions in both proteins.
Know under which conditions you can reliably transfer information.
Use all structural information available to you (measured or predicted) to optimize your sequence alignment.

Aligning sequences by hand

The most powerful weapon in the bioinformaticist's armoury is sequence alignment. Why?

Let's think about an alignment. It is a representation of a whole series of evolutionary events, which left traces in the sequences. Things that are more likely to happen during evolution (mutation of an asparagine or a serine, conservation of a tryptophan or a cysteine bridge) should be most prominently observed in your alignment.

What kind of things are important? Let's give a few examples:

it is much easier to mutate than to insert or delete (indel)
once nature decided on an indel, its length is not so important, but longer indels may be more difficult to make than shorter ones
active site residues don't mutate
residues tend to mutate into similar residues (e.g. V <-> I; S <-> T; etc)
residues mutate more easily to residues encoded by similar codons
cysteines that sit in cysteine bridges don't mutate easily
surface residues mutate more easily than core residues
core residues mutate more easily when they make fewer contacts
it is hard to mutate a glycine that sits somewhere with torsion angles that other residues cannot have
and so on...

We shall now start working on sequence alignments. We shall steadily add one rule after another, and learn a few new physicochemical properties of amino acids at the same time.

We already discussed that there are two kinds of sequence alignments. The one kind tries to align those residues that have a common ancestor. The other kind tries to align those residues that fall on top of each other when the corresponding structures get superposed three-dimensionally.

In this course we are mainly interested in this latter type of alignment. Obviously, if two residues sit at similar positions in similar structures, they are likely to have similar physico-chemical properties. So, lets start using everything we learned about amino acids in some "real" alignments.

As we are bioinformaticians, we are not just going to run an alignment program and look at the result. No, we are going to think about it and use all kinds of additional information. We can recognize several levels of sophistication in the information we can use:

No additional knowledge can be found. So, its is just you and the residues
Knowledge about proteolytic cleavage sites
Knowledge about metal binding residues
Knowledge about antibody escape mutants
Knowledge about ligand binding
Knowledge from CD, fluorescense, etc.
Knowledge about a homologous structure
Knowledge about the secondary structure
Knowledge about the predicted secondary structure
Etcetera

Residue characteristics and sequence alignment

For each of the following examples, work out which is the better alignment: the right or the left. No additional knowledge is available. The secondary structure of CPISRT or FRCW cannot be predicted reliably.

Question 1: Which is each time the better alignment, right or left (and why)? The first four are not so difficult, but after that....

CPISRTWASIFRCW    CPISRTWASIFRCW
CPISRT---LFRCW    CPISRTL---FRCW

CPISRTRASEFRCW    CPISRTRASEFRCW
CPISRTK---FRCW    CPISRT---KFRCW

CPISRTIASNFRCW    CPISRTIASNFRCW
CPISRTH---FRCW    CPISRT---HFRCW

CPISRTEASDFRCW    CPISRTEASDFRCW
CPISRT---NFRCW    CPISRTN---FRCW

CPISRTSASIFRCW    CPISRTSASIFRCW
CPISRT---TFRCW    CPISRTT---FRCW

CPISRTGASIFRCW    CPISRTGASIFRCW
CPISRTA---FRCW    CPISRT---AFRCW

CPISRTEASNFRCW    CPISRTEASNFRCW
CPISRTQ---FRCW    CPISRT---QFRCW

CPISRTFASTFRCW    CPISRTFASTFRCW
CPISRT---YFRCW    CPISRTY---FRCW

Answer

EU name: SECSTR

(From: ../EUDIR ) (Date: Aug 24 2016 ../EUDI)

Secondary structure prediction

Sometimes the secondary structure of one or more of the sequences is known. This can either be the secondary structure as derived from a PDB file (which holds 3D coordinates) or it can be a predicted secondary structure. In this section we will look at a coarse way to predict secondary structure. In the sections thereafter we will use predicted secondary structure characteristics to make better alignments. Before we use this information let's look at some aspects of secondary structure. You know that secondary structure elements fall in four categories: helix strand turn the rest. If you look at the Chou and Fasman parameters (and other useful data) you will see that there is relation between residue type and secondary structure. As always in bioinformatics, the rules suggested by these parameters aren't exactly hard and fast, and exceptions abound. Nonetheless, they do make some sense, so we shall study them.

Supplemental material

NOTE

Question 2: Using the Chou-Fasman parameters, predict the secondary structure of the following sequences:

ALMEILAQAARA
ELMKIAQLAKRGP
SNPAELLQALMKGS
TVEITFKI
VVICETTWYVEVT
VTITVEGPKITVE
SRGGEPTRHEAKE
ELLALKLLTVTVT (a loop/turn of at least one residue is needed between helix and strand)

Answer

Question 3: And now, using everything you have learned so far, select from each of these pairs the better helix:

ALQLNMQAKALL
ANQLLMQAAKLL

ARAAEALLQAAE
AEAAEALLQAAK

ALLLAALLLAL
AAEALAKALLR

Answer

Question 4: And now, using everything you have learned so far, select from each of these pairs the better strands:

VVKISVTIKSG
LLKISLTIILI

VVTTVVTTVVTT
VTVTVTVTVTVT

VVICFFWIIFVI
VKICFKSIYVRI

VKITFEITVEIR
IRVTWRGTINIE

Answer

EU name: STRALI

(From: ../EUDIR ) (Date: Aug 24 2016 ../EUDI)

The role of structure in sequence alignment

Look at the sequences:

S G V S P D Q L A A L K L I L E L A L K
G T S L E T A L L M Q I A Q K L I A G

In both cases it is clear that the left part of the sequence does not have a regular structure, whereas the right part is helical. And that is the available additional information, an easy to predict secondary structure. By shifting the sequences back and fro, we can find several reasonably poor/good alignments. But we know both contain the N-terminus of a helix. So lets find the ends of the helices. For that we go back to the table:

         -4   -3   -2   -1    1    2    3    4    5  total
          -    -    -    -    H    H    H    H    H
  ALA    143  148   99   58  189  205  187  241  268 1538
  CYS     24   31   29   22   14   17   18   33   17  205
  ASP     98  110  121  260   98  197  167   49   86 1186
  GLU     91  100   71   71  152  287  269   70  147 1258
  PHE     53   70   90   29   68   46   49  107   65  577
  GLY    207  246  166  192   96  127   99   65   60 1258
  HIS     48   50   39   46   28   36   38   24   30  339
  ILE     94   81  133   19   79   45   68  161   99  779
  LYS     99   98   80   46   98  105   69   80  154  829
  LEU    105  111  188   50  140   84  113  281  209 1281
  MET     37   20   51   13   26   22   54   61   67  351
  ASN    103   83   89  206   46   62   55   37   77  758
  PRO    143  136  121   99  240   78   40    0    0  857
  GLN     48   58   40   38   83   93  124   76  101  661
  ARG     82   63   59   51   71   75   61  114  109  685
  SER    112  128   98  292  105  126   99   48   76 1084
  THR    106   99  119  253   91   80  115   72   67 1002
  VAL    141  107  132   37  117   74  120  208  120 1056
  TRP     29   25   29   14   30   26   28   30   29  240
  TYR     66   65   75   33   58   44   56   72   48  517

We now use this table and indicate preferred positions of residues relative to the first position of the helix.

 S G V S P D Q L A A L K L I L E L A L K
-1-4-4-1-4-1 3-2 1 1-2 2
  -3-2  -3 2 5 1 2 2 1 5
     4  -2 3   4 3 3 4
         1     5 4 4 5
                 5 5
 G T S L E T A L L M Q I A Q K L I A G
-4-1-1-2 2-1 1-2
-3 3   1 3 3 2 1
       4     3 4
       5     4 5
             5

So, the optimal paths that put each residue as much as possible at its preferred position is:

 S G V S P D Q L A A L K L I L E L A L K
-1-4-4-1-4-1 3-2 1 1-2 2
  -3-2  -3 2 5 1 2 2 1 5
     4  -2 3   4 3 3 4
         1     5 4 4 5
                 5 5
 G T S L E T A L L M Q I A Q K L I A G
-4-1-1-2 2-1 1-2
-3 3   1 3 3 2 1
       4     3 4
       5     4 5
             5

So, in the top sequence the helix starts with PDQ and in the bottom sequence with LET. In total only two residues are not optimally happy with this arrangement (which two, and why isn't this so bad after all?). The optimal alignment thus must be:

S G V S P D Q L A A L K L I L E L A L K
- G T S L E T A L L M Q I A Q K L I A G

And that would be difficult to find using an alignment program. Clustal will do it right, but with only three identities it would be unhappy with the result. Checking these helix cap propensities gives you much confidence in this alignment.

Question 5: Using the same ideas as the example given on the web-site just above this question, align:

NHSGPPSTSGPAQLLAKALEIALK
PGISAEMVALKALLEALQALELLLR

Ps, no need to try Clustal on this one, it will do it wrong!

Answer

EU name: STRAL2

(From: ../EUDIR ) (Date: Oct 6 2017 ../EUDI)

The role of structure in sequence alignment (2)

In the previous example we used the predicted N-terminal start of a helix as additional information, and we had to find out the relations between residue types and secondary structure to predict this helix end. In the next example we will use the predicted secondary structure of two β-hairpins to aid with their alignment.

First, try to align the sequences TCTVTSNSITCT (A) and TCTVSTCT (B).
When using only the sequence information this may seem straightforward.

But how would you align A and B if not only the sequence but also the structure of A was known? Or, in other words: "Let's pretend that you want to predict the structure of B, with all the information you have on A."

Two possible alignments, B1 and B2 are shown below:

A   TCTVTSNSITCT    A   TCTVTSNSITCT
B1  TCTVS----TCT    B2  TCTV----STCT

Question 6:

Align the sequences A and B (see the text above) by hand, or with Clustal. Explain the result.
The bridge between the two cysteines forces the gap to fall between the two Cs. (Why?)
Think about the physico-chemical properties of S, T and I, and based on that decide which is the better alignment, red (B1) or green (B2).
Draw a schematic β-hairpin.
Indicate which residues stick out into the solvent, and which stick into the protein core.
Map the "black" sequence (TCTVTSNSITCT) onto this hairpin.
Which residues are direct neighbours across the strand?
Which residues can be "removed" without leaving a big gaping hole?
Can you explain why the green alignment is better?

Answer

One last example:

Question 7: Predict the secondary structure of the following two sequences:

CWEALALLAELALAAMKGSTPNGS
CWEALALLLEALMRGTTPNGG

Align these two sequences by taking into account the secondary structure prediction. Then make an alignment as you would expect a computer program to do it. Which of these two alignments is better and why?

Answer

So, if it is not clear by now that structural knowledge can help to fine-tune the alignment, you are in trouble...