The sequence format

The pre-aligned protein sequences should be provided in naked PIR format explained below line by line:

  1. >P1;text
    where 'text' can be any (short) text you want
  2. title
    where 'title' is a title line (not an empty line!!!!)
  3. sequence
    This sequence has to be given in one letter code, maximally 80 character per line, no empty lines, numbers or other non-residues allowed inbetween. Insertions should be indicated with an dash ("-")

Example model sequence file:
Example template sequence file:

The structure format

The PDB format coordinate file should hold exactly those residues that are given in the PIR-style sequence file that you gave as template sequence file. So, if the template sequence holds 183 residue characters (not counting dashes) then the PDB file should hold coordinates for exactly 183 residues. In practice, things will of course work the other way around. If the structure has 183 residues, then the template sequence has 183 residues. If the sequence that you want to model holds only 100 amino acids, then you should ofcourse not delete 83 residues from the structure file, but add 83 - signs to the sequence of the model.

Some notes on the alignment

It was already discussed above that both sequences should start with

title line
in which 'text' can be anything and the 'title line' can be left empty.

How does one now get one alignment in two sequence files? Suppose the alignment looks like:

and the PDB file contains the following residues:
the two sequence files should look like:
template sequence

model sequence

Why? Well, first of all remember that you are not aligning two sequences, but you want to map the sequence of the model on the template structure. The sequence of the template is only there to tell the modelling program where the gaps will be.

So, the little HCHCHC fragment at the N-terminal end of the structure must also be present in the 'old' sequence. The C-terminal D in the template sequence is actually not present in the structure and should thus be replaced by a - in the 'old' sequence file.

Why are the two sequence lines not equally long? Well, that really is not important, as long as no line is longer than 80 characters everything is OK. The two sequence files would also be OK if they were given as:

template sequence

model sequence
And even the following files would be acceptable:
template sequence


- - - - - G F H D F G H K L M N V C W E R T Y I P L K H

Some notes on the results

There are three insertions in the model sequence relative to the template. The server will NOT model the two insertions for which there really is no template structure. So the residues ER and AC will not be found in the model structure that the server returns to you. The insertion G at the beginning of the model sequence will be modelled because it really is not an insertion relative to the template structure. In the model this G will occupy the position of the the last C of the HCHCHC motif found at the start of the template structure.

There is one deletion to be made in the template. The residues HY occur in the structure, but no equivalent residues are present in the model sequence. The present version of the model server will in this case leave a gap in the model structure. In the near future we will add some software to automatically close the gap if this can be done in a straightforward manner. Please keep in mind that the technology is still missing (not only in our lab, but everywhere) to do this gap-closing really well.

What can all go wrong?

What can all go wrong? Well, everything! The error we have seen most often sofar is that people produce sequence files with MSword or a similar text editor. Those programs leave often funny characters such as ^M or so in the file. The server has no means of dealing with this problem.

One Israeli portal does something funny with empty title lines. Other sites or country portals might have similar problems; so always put at least on space on the title line.

Some people call it a bug in the program if there is a gap in the structure to be modelled. However, if the alignment is wrong, a deletion will and up at the wrong spot, and a big gap will be the result.

Insertions are not yet modelled. In the future we will also model short insertions, but presently we are lacking the CPU power to provide this service, and there are no good algorithms on the market that reliably build insertions.

Another nice error to make is to put the model sequence all on one line that is longer than 80 characters. The characters beyond position 80 are NOT read, and only WHAT IF knows what the resulting model will look like.

A nice error that you can look at for hours without seeing it is if you forget the title line!

How to submit things from a Mac

Well, that is a problem. The Mac puts funny characters at teh end of each line. These characters confuse the hell out of WHAT IF. there are two solutions. The best one is to buy a real computer. A Mac is not the right platform for 3D protein research. The second solution was found by Kurt Giles who wrote:
"For example, so far, the only way I've found of doing it on Mac is using BBEDit. Use BBEdit or BBEDit Lite, when saving the file choose 'Save As...'. Select the 'Options' button, and change 'Line Breaks' from 'Macintosh' to either 'Unix' or 'DOS'".

Picture of modelling process

Last updated Oct 1 2007