The pre-aligned protein sequences should be provided in naked PIR format explained below line by line:
>P1;model model ---GRCELAAAMK---LDNYRGYSLGNWVCAAKFESNFNSQAT NRNTDGSTD--VLQINSR--*Example template sequence file:
>P1;template template KVYGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFESNFNTHATNRNTD GSTDYGILQINSRWW*
The PDB format coordinate file should hold exactly those residues that are given in the PIR-style sequence file that you gave as template sequence file. So, if the template sequence holds 183 residue characters (not counting dashes) then the PDB file should hold coordinates for exactly 183 residues. In practice, things will of course work the other way around. If the structure has 183 residues, then the template sequence has 183 residues. If the sequence that you want to model holds only 100 amino acids, then you should ofcourse not delete 83 residues from the structure file, but add 83 - signs to the sequence of the model.
It was already discussed above that both sequences should start with
>P1;text title linein which 'text' can be anything and the 'title line' can be left empty.
How does one now get one alignment in two sequence files? Suppose the alignment looks like:
template -ASDFGHKLFNVCW--TYIPLKHGFRTREDSACVNMPIYHWRFEDWSCFVGNHHKLPIKLHYTRTREDS--VNMPIYHTRFED model GFHDFGHKLMNVCWERTYIPLKHGFRSREDSLCVNMPIYHTRFEDWGCFVGNHMKLPIKL--TRTREDSACVNMPIYHTR---and the PDB file contains the following residues:
HCHCHCASDFGHKLFNVCWTYIPLKHGFRTREDSACVNMPIYHWRFEDWSCFVGNHHKLPIKLHYTRTREDSVNMPIYHTRFEthe two sequence files should look like:
>P1;old template sequence HCHCHCASDFGHKLFNVCW--TYIPLKHGFRTREDSACVNM PIYHWRFEDWSCFVGNHHKLPIKLHYTRTREDS--VNMPIYHTRFE >P1;new model sequence -----GFHDFGHKLMNVCWERTYIPLKHGFRSREDSLCVNM PIYHTRFEDWGCFVGNHMKLPIKL--TRTREDSACVNMPIYHTR--
Why? Well, first of all remember that you are not aligning two sequences, but you want to map the sequence of the model on the template structure. The sequence of the template is only there to tell the modelling program where the gaps will be.
So, the little HCHCHC fragment at the N-terminal end of the structure must also be present in the 'old' sequence. The C-terminal D in the template sequence is actually not present in the structure and should thus be replaced by a - in the 'old' sequence file.
Why are the two sequence lines not equally long? Well, that really is not important, as long as no line is longer than 80 characters everything is OK. The two sequence files would also be OK if they were given as:
>P1;old template sequence HCHCHCASDFGHKLFNVCW--TYIPLKHGFRTREDSACVNMPIY HWRFEDWSCFVGNHHKLPIKLHYTRTREDS--VNMPIYHTRFE >P1;new model sequence -----GFHDFGHKLMNVCWERTYIPLKHGFRSREDSLCVNM PIYHTRFEDWGCFVGNHMKLPIKL--TRTREDSACVNMPIYHTR--And even the following files would be acceptable:
>P1;old template sequence HCHCHC ASDFGH KLFNVC W--TY IPLK HGF RTR EDSA CVNMP IYHWRF EDWSCFV GNHHKLPI KLHYTRTRE DS--VNMPIYHTRFE >P1;new - - - - - G F H D F G H K L M N V C W E R T Y I P L K H GFRSREDSLCVNM PIYHTRFEDWGCFVGNH MKLPIKL--TRTREDSACVNMPIYHTR--
There are three insertions in the model sequence relative to the template. The server will NOT model the two insertions for which there really is no template structure. So the residues ER and AC will not be found in the model structure that the server returns to you. The insertion G at the beginning of the model sequence will be modelled because it really is not an insertion relative to the template structure. In the model this G will occupy the position of the the last C of the HCHCHC motif found at the start of the template structure.
There is one deletion to be made in the template. The residues HY occur in the structure, but no equivalent residues are present in the model sequence. The present version of the model server will in this case leave a gap in the model structure. In the near future we will add some software to automatically close the gap if this can be done in a straightforward manner. Please keep in mind that the technology is still missing (not only in our lab, but everywhere) to do this gap-closing really well.
What can all go wrong? Well, everything! The error we have seen most often sofar is that people produce sequence files with MSword or a similar text editor. Those programs leave often funny characters such as ^M or so in the file. The server has no means of dealing with this problem.
One Israeli portal does something funny with empty title lines. Other sites or country portals might have similar problems; so always put at least on space on the title line.
Some people call it a bug in the program if there is a gap in the structure to be modelled. However, if the alignment is wrong, a deletion will and up at the wrong spot, and a big gap will be the result.
Insertions are not yet modelled. In the future we will also model short insertions, but presently we are lacking the CPU power to provide this service, and there are no good algorithms on the market that reliably build insertions.
Another nice error to make is to put the model sequence all on one line that is longer than 80 characters. The characters beyond position 80 are NOT read, and only WHAT IF knows what the resulting model will look like.
A nice error that you can look at for hours without seeing it is if you forget the title line!
Well, that is a problem. The Mac puts funny characters at teh end of
each line. These characters confuse the hell out of WHAT IF. there are
two solutions. The best one is to buy a real computer. A Mac is not
the right platform for 3D protein research. The second solution was
found by Kurt Giles who wrote:
"For example, so far, the only way I've found of doing it on Mac is using BBEDit. Use BBEdit or BBEDit Lite, when saving the file choose 'Save As...'. Select the 'Options' button, and change 'Line Breaks' from 'Macintosh' to either 'Unix' or 'DOS'".