Structure Function Bioinformatics

Force fields

We have already discussed MD a bit. You can find a summary of the MD seminar-like movie in the seminars list.

Today we are going to think about making a force field ourselves.

All atoms attract each other. Period. That looks a very wrong remark, but it isn't. There is an attractive force between all pairs of atoms. However, there are additional, repulsive, forces that can be stronger in certain distance ranges. Repulsion is seen between equally charged atoms, or when two atoms come too close to each other. Although it is not correct, it is generally assumed that the attractive force behaves as 1/R⁶ with the inter atomic distance R and the repulsive force at short distances goes with 1/R¹² (actually these are potential energies, and not forces, but people in the Molecular dynamics field like to call a set of energies a force field; don't ask me why). These two effects have an opposite sign, which results in the plot shown at the below.

Figure 51. The atoms seem to be happiest at a certain distance away from each other. Closer to each other their happiness decreases very rapidly (with 1/R¹²) and further away their happiness decreases too, but only with 1/R⁶.

Knowledge of the distance of optimal happiness is important for our understanding of protein structures. The distances of optimal happiness is generally referred to as R₀.

Question 51: Go back to the seminar on molecular dynamics. Find all formulas that deal with the relation between geometric relations between atoms and their contribution to the overall energy of the protein. Write down the five sub-parts of the formula that every molecular dynamics program seems to have included in the source code.
Which (part of) the force field equations describes figure 51.
Look at the formula for figure 51 again. What is the Energy value when I fill in 1000.0 Ångström for rij?

Answer

A big experiment. We will determine a force field (parameter).

In this experiment, we will all together determine one parameter of a force field, namely the distance between the donor and the acceptor in a hydrogen bond.

There are many ways of doing this. Please think and come up with at least three really different methods.

A crude method that is doable within this course is to estimate the most occurring short distance between a pair of atoms. I.e. if you measure for an inter-atomic distance between two carbons: 3.93 3.95 4.40 3.98 4.02 3.81 3.94 4.00 3.95 and 3.94 it is reasonable to assume that the corresponding R₀ is around 3.94.

The fact that you also find longer distances is simply the result of the fact that there are more atoms in a protein than just those two you are looking at. Not all atoms and groups can have optimal pairwise packing.

Each team should take one protein structure that was solved by X-ray with a resolution of 1.2 Ångström or better.

Question 52: How do you find structures solved at better than 1.2 Ångström resolution (this is not simple, but MRS can do it, fight with the MRS help...)?
Why can we not use NMR structures?
Why should we use structures solved at 1.2 Ångström or better?

Answer

Tabulate at least 10 close donor-acceptor distances. Close is defined as clearly less than 4.5 Ångström. Try not to find the closest or the least close distances. Just take any you can find. Your choice of atom pairs will influence the final outcome. So, make sure you think about this problem before blindly starting. Neither take any β-strand - β-strand helices nor take any intra-helical hydrogen bonds (why not, actually?).

When done, ask some computer guru to make a histogram with bins of 0.1 Ångström and add up in each bin the numbers of H-bonds found in total by all groups in the course. Think about the curve that you expect. Think about a way to determine R₀ from the curve.

Question 53:

Think of a few very different ways to arrive at R₀ values.
Summarize the algorithm we chose here, in this experiment.

Answer

Supplemental material

Non-molecular force fields

Force fields are everywhere. In the movie The Elevator, an intelligent elevator goes beserk. So obviously the software designers for this elevator went one step to far. The idea to make an elevator smart so that it spends less electricity and gets people faster where they want to be is of course based on a force field.

Question 55: How would you design the software for the elevator in the Radboud parking house?
Why is this a much more difficult task than makeing a force field to predict the secondary structure of proteins?

Answer

And now for the real thing.

Lets make a force field to predict transmembrane helices in G protein-coupled receptors (GPCRs).

To do so we need data to set-up the force field. Normally, we would take a large set of membrane proteins with know structure, but as this practical is only meant to illustrate the concepts, we will use just one protein to generate the data. This protein is an opsin, so for the purists, we are going to design a computer program to predict transmembrane helices in opsins. But we will nevertheless try it out on 'something else' for validation purposes.

So, today we will not only design and implement a force field to predict transmembrane helices in GPCRs, but we will also test our force field. To achieve this double goal, we will use data only from one GPCR to set-up the force field, and we use a totally different GPCR to see how well the force field works.

But first a short intro in GPCRs:

Figure 52. The structure of opsin; the first GPCR for which the structure was solved. Nearly all GPCRs have a series of residues in common. The side chains of these conserved residues are shown in this figure. These conserved residues allow us to align GPCR sequences very easily. In the exercise files you find the YASARA scene opsin.sce. This scene produced this figure, and in this scene all residues have the same numbers as in the GPCRDB.

In the box below, you find the secondary structure determination of the opsin file shown in figure 52. In this box the conserved residues are numbered for easy reference.

                                                                 130
                                                                   |
    1 -   60 MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLY
    1 -   60   TSS TT SSTT  TTT    TTTT  333T HHHHHHHHHHHHHHHHHHHHHHHHHHH
 
                             220 224                        305
                               |   |                          |
   61 -  120 VTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLG
   61 -  120 HHHHTTT  THHHHHHHHHHHHHHHHHHHTHHHHHHHHHHTT TTHHHHHHHHHHHHHHH
 
                         340                       420
                           |                         |
  121 -  180 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIP
  121 -  180 HHHHHHHHHHHHHHHHHHHT  TTT    HHHHHHHHHHHHHHHHHHHTHHHHTTT SSS
 
                                             520     528
                                               |       |
  181 -  235 EGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAA
  181 -  235 STTTTSSSS  T   TTTTHHHHHHHHHHHTTHHHHHHHHHHHHHTTTT  T
 
                                      620
                                        |
  236 -  295 SATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSA
  236 -  295   TTTTHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHHTTT    HHHHHHHHHHHHHHH
 
              730
                |
  296 -  323 VYNPVIYIMMNKQFRNCMVTTLCCGKNPSTTVSKTETSQVAPA
  296 -  323 HHHHHHHHHH HHHHHHHHHHHTTT    TTTTT

Question 56: Do the helices that you see with YASARA, and the helices indicated in the box agree? And if they don't, why?

Answer

The GPCR force field

And now the important part. Work in groups of three or four students, and come up with a plan for the force field design. The elements will be:
1) Which data do you need?
2) What will be your null-model? In other words, which two things are you going to compare?
3) How are you going to get the data?
4) How are you going to make the counting easy? (We are not going to manually count the thousands of residues we need to get a reliable answer).
5) Which formula are you going to use to convert the data into something that can be used predictive?
6) What should the software look like that does the prediction in the end?

When working on this project, discuss frequently with the assistant about what you are doing!

All the counting has already been done for you, albeit that the counting results have a somewhat clumsy format: the HSSP format. HSSP files consist of a header:

HSSP       HOMOLOGY DERIVED SECONDARY STRUCTURE OF PROTEINS , VERSION 1.1 2001
PDBID      1f88
DATE       file generated on 17-May-10
SEQBASE    UniProtKB/Swiss-Prot(20-Apr-2010) UniProtKB/TrEMBL(20-Apr-2010)
PARAMETER  SMIN: -0.5  SMAX:  1.0
PARAMETER  gap-open:  3.0 gap-elongation:  0.1
PARAMETER  conservation weights: YES
PARAMETER  InDels in secondary structure allowed: YES
PARAMETER  alignments sorted according to :DISTANCE
THRESHOLD  according to: t(L)=(290.15 * L ** -0.562) +  5
REFERENCE  Sander C., Schneider R. : Database of homology-derived protein structures. Proteins, 9:56-68 (1991).
CONTACT    Maintained at www.CMBI.ru.nl by Elmar.Krieger.cmbi.umcn.nl
AVAILABLE  Free academic use. Commercial users must apply for license.
Etcetera.

The section that lists all aligned sequences:

## PROTEINS : EMBL/SWISSPROT identifier and alignment statistics
  NR.    ID         STRID   %IDE %WSIM IFIR ILAS JFIR JLAS LALI NGAP LGAP LSEQ2 ACCNUM     PROTEIN
    1 : OPSD_BOVIN  3DQB    1.00  1.00  565  648  243  326   84    0    0  348  P02699     RecName: Full=Rhodopsin;
    2 : OPSD_BOVIN  3DQB    1.00  1.00  342  483    1  142  142    0    0  348  P02699     RecName: Full=Rhodopsin;
    3 : OPSD_BOVIN  3DQB    1.00  1.00    1  235    1  235  235    0    0  348  P02699     RecName: Full=Rhodopsin;
    4 : OPSD_BOVIN  3DQB    1.00  1.00  237  324  240  327   88    0    0  348  P02699     RecName: Full=Rhodopsin;
    5 : OPSD_BOVIN  3DQB    1.00  1.00  485  563  148  226   79    0    0  348  P02699     RecName: Full=Rhodopsin;
    6 : OPSD_SHEEP          0.99  1.00  342  483    1  142  142    0    0  348  P02700     RecName: Full=Rhodopsin;
    7 : OPSD_FELCA          0.99  0.99  342  483    1  142  142    0    0  348  Q95KU1     RecName: Full=Rhodopsin;
    8 : OPSD_PHOGR          0.99  0.99  342  483    1  142  142    0    0  348  O62795     RecName: Full=Rhodopsin;
    9 : OPSD_PHOVI          0.99  0.99  342  483    1  142  142    0    0  348  O62794     RecName: Full=Rhodopsin;
   10 : Q2KND7_URSMA        0.98  0.99  358  483    1  126  126    0    0  322  Q2KND7     SubName: Full=Rhodopsin;
   11 : Q2KND6_PHOHI        0.98  0.99  358  483    1  126  126    0    0  324  Q2KND6     SubName: Full=Rhodopsin;
   12 : OPSD_CANFA          0.98  0.99  342  483    1  142  142    0    0  348  P32308     RecName: Full=Rhodopsin;
Etcetera

The alignment part is important to find the residue numbers back for the helices (I coloured the secondary structure assignments red; a H in this column, obviously, indicates a helix).

## ALIGNMENTS    1 -   70
 SeqNo  PDBNo AA STRUCTURE BP1 BP2  ACC NOCC  VAR  ....:....1....:....2....:....3....:....4....:....5....:....6....:....7
     1    1 A M              0   0  192  292    3    M                        M    M     MMMMM M     MM   MMM    M MMMM
     2    2 A N        +     0   0   72  430   27    N                        N    N     NNNNN N     NN   NNN    N NNNN
     3    3 A G  S    S-     0   0    5  431   33    G                        G    G     GGGGG G     GG   GGG    G GGGG
     4    4 A T  E     -A   11   0A  28  453   23    T                        T    T     TTTTT T     TT   TTT    T TTTT
     5    5 A E  E     -A   10   0A 112  465   21    E                        E    E     EEEEE E     EE   EEE    E EEEE
     6    6 A G        -     0   0   10  474   21    G                        G    G     GGGGG G     GG   GGG    G GGGG
     7    7 A P  S    S+     0   0  108  466   51    P                        P    P     PPPPP P     PP   PPP    P PPPP
     8    8 A N  S    S+     0   0   72  493   51    N                        N    N     NNNNN N     NN   NNN    N DNNN
     9    9 A F        -     0   0   18  503   17    F                        F    F     FFFFF F     FF   FFF    F FFFF
    10   10 A Y  E     -A    5   0A  10  518   39    Y                        Y    Y     YYYYY Y     YY   YYY    Y YYYY
    11   11 A V  E     -A    4   0A   0  547   32    V                        V    V     VVVVV V     VVV  VVV    V IVVV
    12   12 A P  H    S+     0   0    6  525   33    P                        P    P     PPPPP P     PPP  PPP    P PPPP
    13   13 A F  H    S-     0   0   61  605   34    F                        F    F     FFFFF F     FFF  FFF    F MFFF
    14   14 A S  H     -     0   0   45  605   42    S                        S    S     SSSSS S     SSS  SSS    S SSSS
    15   15 A N  H >   +     0   0   68  609   20    N                        N    N     NNNNN N     NNNHHNNN    N NNNN

And the last section holds the profile (see the sequence alignment seminar and practical) from which you can get an idea about the counting statistics.

## SEQUENCE PROFILE AND ENTROPY
 SeqNo PDBNo   V   L   I   M   F   W   Y   G   A   P   S   T   C   H   R   K   Q   E   N   D  NOCC NDEL NINS ENTROPY RELENT WEIGHT
    1    1 A   3   2   0  95   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   292    0    0   0.234      8  0.94
    2    2 A   0   0   0   0   0   0   0   0   0   0   5   0   0   0   3   0   1  11  65  15   430    0    0   1.153     38  0.47
    3    3 A   0  10   0   1   0   0   0  69   0   0  14   0   0   0   0   3   0   0   0   0   431    0    0   1.057     35  0.36
    4    4 A   0   0   0   3   0   0   0   0   0   9   1  78   0   0   0   0   2   4   0   2   453    0    0   0.912     30  0.56
    5    5 A   0   1   0   0   0   0   0   0   0   1   3   0   0   0   5   0  10  78   0   1   465    0    0   0.890     30  0.62
    6    6 A   0   0   0   0   0   0   0  70   3   2   7   0   0   0   0   0   0   7   0  11   474    0    0   1.062     35  0.56
    7    7 A   0   1   2   0  13   0   0   0   0  34  17   1   0   0   0  14   2   9   4   3   466   17    1   1.943     65  0.10
    8    8 A   0   0  10   0  13   7  10   0   5   0   0   0   0   0   0   5   0   5  33  11   493   17    0   2.036     68  0.10
    9    9 A   0   1  10   1  80   1   1   0   0   0   0   0   0   4   0   0   0   1   0   1   503   17    0   0.817     27  0.65
Etcetera.

Obviously, we would like to have the counts for each residue type inside the helices and inbetween the helices. Unfortunately, the profile gives you alignment occupancies and percentages per residue type. The Web server called teller can be used to convert the HSSP profile values into raw amino acid counts. Just cut-n-paste the values listed in the box into the 'teller' software, and check that you get (and understand):

aminozuur aantal
A 39
C 0
D 204
E 533
F 527
G 629
H 20
I 109
K 103
L 63
M 300
N 461
P 213
Q 69
R 36
S 213
T 358
V 9
W 40
Y 54

Now count what you want to count and get the force field parameters. (I am sorry, the 20 divisions and natural logarithms you will have to do by pocket calculator).

First check your force field in terms of plausibility. Do the hydrophobic residues have a high potential for TM helices, and do Arg, Lys, Glu, and ASP indeed score very low as you would expect? If not dicuss with the assistant what went wrong and fix it.

We did not make any software to score things with your force field, but you can see what happens if you write the potentials next to the amino acids in this short sequence:

N G T E G P N F Y V P F S N K T G V V R S P F E A P Q Y Y L A
 
E P W Q F S M L A A Y M F L L I M L G F P I N F L T L Y V T V
 
Q H K K L R T P L N Y

And if all goes well, you will find one TM helix...

The final one.

Question 57: How would you write software to predict if a residue is an active site residue or not?
Don't forget that active sites are always at the surface of the protein.
Think of the data-set you need.
Think of the null-model.
Think of the real model, and how to convert counting into scoring.
Think about testing the method.

Answer