After completing the "Aligning" section you will: |
The most powerful weapon in the bioinformaticist's armoury is sequence alignment. Why?
Let's think about an alignment. It is a representation of a whole series of evolutionary events, which left traces in the sequences. Things that are more likely to happen during evolution (mutation of an asparagine or a serine, conservation of a tryptophan or a cysteine bridge) should be most prominently observed in your alignment.
What kind of things are important? Let's give a few examples:
We shall now start working on sequence alignments. We shall steadily add one rule after another, and learn a few new physicochemical properties of amino acids at the same time.
We already discussed that there are two kinds of sequence alignments. The one kind tries to align those residues that have a common ancestor. The other kind tries to align those residues that fall on top of each other when the corresponding structures get superposed three-dimensionally.
In this course we are mainly interested in this latter type of alignment. Obviously, if two residues sit at similar positions in similar structures, they are likely to have similar physico-chemical properties. So, lets start using everything we learned about amino acids in some "real" alignments.
As we are bioinformaticians, we are not just going to run an alignment program and look at the result. No, we are going to think about it and use all kinds of additional information. We can recognize several levels of sophistication in the information we can use:
For each of the following examples, work out which is the better alignment: the right or the left. No additional knowledge is available. The secondary structure of CPISRT or FRCW cannot be predicted reliably.
Question 1: Which is each time the better alignment, right or left (and why)? The first four are not so difficult, but after that....
CPISRTWASIFRCW CPISRTWASIFRCW CPISRT---LFRCW CPISRTL---FRCW |
CPISRTRASEFRCW CPISRTRASEFRCW CPISRTK---FRCW CPISRT---KFRCW |
CPISRTIASNFRCW CPISRTIASNFRCW CPISRTH---FRCW CPISRT---HFRCW |
CPISRTEASDFRCW CPISRTEASDFRCW CPISRT---NFRCW CPISRTN---FRCW |
CPISRTSASIFRCW CPISRTSASIFRCW CPISRT---TFRCW CPISRTT---FRCW |
CPISRTGASIFRCW CPISRTGASIFRCW CPISRTA---FRCW CPISRT---AFRCW |
CPISRTEASNFRCW CPISRTEASNFRCW CPISRTQ---FRCW CPISRT---QFRCW |
CPISRTFASTFRCW CPISRTFASTFRCW CPISRT---YFRCW CPISRTY---FRCW |
Answer
Sometimes the secondary structure of one or more of the sequences is known. This can either be the secondary structure as derived from a PDB file (which holds 3D coordinates) or it can be a predicted secondary structure. In this section we will look at a coarse way to predict secondary structure. In the sections thereafter we will use predicted secondary structure characteristics to make better alignments. Before we use this information let's look at some aspects of secondary structure. You know that secondary structure elements fall in four categories: helix strand turn the rest. If you look at the Chou and Fasman parameters (and other useful data) you will see that there is relation between residue type and secondary structure. As always in bioinformatics, the rules suggested by these parameters aren't exactly hard and fast, and exceptions abound. Nonetheless, they do make some sense, so we shall study them.
Supplemental material NOTE
Question 2: Using the Chou-Fasman parameters, predict the secondary structure of the following sequences:
Answer
Question 3: And now, using everything you have learned so far, select from each of these pairs the better helix:
ALQLNMQAKALL ANQLLMQAAKLL |
ARAAEALLQAAE AEAAEALLQAAK |
ALLLAALLLAL AAEALAKALLR |
Answer
Question 4: And now, using everything you have learned so far, select from each of these pairs the better strands:
VVKISVTIKSG LLKISLTIILI |
VVTTVVTTVVTT VTVTVTVTVTVT |
VVICFFWIIFVI VKICFKSIYVRI |
VKITFEITVEIR IRVTWRGTINIE |
Answer
EU name: STRALI
(From: ../EUDIR )
(Date: Aug 24 2016 ../EUDI)
Look at the sequences:
S G V S P D Q L A A L K L I L E L A L K G T S L E T A L L M Q I A Q K L I A G |
In both cases it is clear that the left part of the sequence does not have a regular structure, whereas the right part is helical. And that is the available additional information, an easy to predict secondary structure. By shifting the sequences back and fro, we can find several reasonably poor/good alignments. But we know both contain the N-terminus of a helix. So lets find the ends of the helices. For that we go back to the table:
-4 -3 -2 -1 1 2 3 4 5 total - - - - H H H H H ALA 143 148 99 58 189 205 187 241 268 1538 CYS 24 31 29 22 14 17 18 33 17 205 ASP 98 110 121 260 98 197 167 49 86 1186 GLU 91 100 71 71 152 287 269 70 147 1258 PHE 53 70 90 29 68 46 49 107 65 577 GLY 207 246 166 192 96 127 99 65 60 1258 HIS 48 50 39 46 28 36 38 24 30 339 ILE 94 81 133 19 79 45 68 161 99 779 LYS 99 98 80 46 98 105 69 80 154 829 LEU 105 111 188 50 140 84 113 281 209 1281 MET 37 20 51 13 26 22 54 61 67 351 ASN 103 83 89 206 46 62 55 37 77 758 PRO 143 136 121 99 240 78 40 0 0 857 GLN 48 58 40 38 83 93 124 76 101 661 ARG 82 63 59 51 71 75 61 114 109 685 SER 112 128 98 292 105 126 99 48 76 1084 THR 106 99 119 253 91 80 115 72 67 1002 VAL 141 107 132 37 117 74 120 208 120 1056 TRP 29 25 29 14 30 26 28 30 29 240 TYR 66 65 75 33 58 44 56 72 48 517 |
We now use this table and indicate preferred positions of residues relative to the first
position of the helix.
S G V S P D Q L A A L K L I L E L A L K -1-4-4-1-4-1 3-2 1 1-2 2 -3-2 -3 2 5 1 2 2 1 5 4 -2 3 4 3 3 4 1 5 4 4 5 5 5 G T S L E T A L L M Q I A Q K L I A G -4-1-1-2 2-1 1-2 -3 3 1 3 3 2 1 4 3 4 5 4 5 5 |
So, the optimal paths that put each residue as much as possible at its preferred position is:
S G V S P D Q L A A L K L I L E L A L K -1-4-4-1-4-1 3-2 1 1-2 2 -3-2 -3 2 5 1 2 2 1 5 4 -2 3 4 3 3 4 1 5 4 4 5 5 5 G T S L E T A L L M Q I A Q K L I A G -4-1-1-2 2-1 1-2 -3 3 1 3 3 2 1 4 3 4 5 4 5 5 |
So, in the top sequence the helix starts with PDQ and in the bottom sequence with LET. In total only two residues are not optimally happy with this arrangement (which two, and why isn't this so bad after all?). The optimal alignment thus must be:
S G V S P D Q L A A L K L I L E L A L K - G T S L E T A L L M Q I A Q K L I A G |
And that would be difficult to find using an alignment program. Clustal will do it right, but with only three identities it would be unhappy with the result. Checking these helix cap propensities gives you much confidence in this alignment.
Question 5: Using the same ideas as the example given on the web-site just above this question, align:
NHSGPPSTSGPAQLLAKALEIALK PGISAEMVALKALLEALQALELLLR |
Ps, no need to try Clustal on this one, it will do it wrong!
Answer
EU name: STRAL2
(From: ../EUDIR )
(Date: Oct 6 2017 ../EUDI)
In the previous example we used the predicted N-terminal start of a helix as additional information, and we had to find out the relations between residue types and secondary structure to predict this helix end. In the next example we will use the predicted secondary structure of two β-hairpins to aid with their alignment.
First, try to align the sequences TCTVTSNSITCT (A) and TCTVSTCT (B).
When using only the sequence information this may seem straightforward.
But how would you align A and B if not only the sequence but also the structure of A was known? Or, in other words: "Let's pretend that you want to predict the structure of B, with all the information you have on A."
Two possible alignments, B1 and B2 are shown below:
A TCTVTSNSITCT A TCTVTSNSITCT B1 TCTVS----TCT B2 TCTV----STCT |
Question 6:
Answer
One last example:
Question 7: Predict the secondary structure of the following two sequences:
CWEALALLAELALAAMKGSTPNGS CWEALALLLEALMRGTTPNGG |
Align these two sequences by taking into account the secondary structure prediction. Then make an alignment as you would expect a computer program to do it. Which of these two alignments is better and why?
AnswerSo, if it is not clear by now that structural knowledge can help to fine-tune the alignment, you are in trouble...
![]() ![]() |