HSSP

Explanation

Every HSSP file contains a series of blocks of information:

The header block provides meta data for the HSSP file
The sequence and coverage information
The actual alignment
The alignment profile
The so-called insertion list

The information contained in HSSP files will be explained using the HSSP file for 1crn ( crambin) as an example.

The header block with meta data

HSSP       HOMOLOGY DERIVED SECONDARY STRUCTURE OF PROTEINS , VERSION 3.0 2017
PDBID      1CRN
THRESHOLD  according to: t(L)=(290.15 * L ** -0.562) + 5
REFERENCE  Sander C., Schneider R. : Database of homology-derived protein structures. Proteins, 9:56-68 (1991).
CONTACT    Maintained at http://www.cmbi.umcn.nl/ by Coos Baakman 
DATE       file generated on 2013-02-04
HEADER     PLANT PROTEIN                           30-APR-81   1CRN
COMPND      MOLECULE: CRAMBIN;
SOURCE      ORGANISM_SCIENTIFIC: CRAMBE HISPANICA SUBSP. ABYSSINICA;
AUTHOR     W.A.HENDRICKSON,M.M.TEETER
DBREF      1CRN A    1    46  UNP    P01542   CRAM_CRAAB       1     46
SEQLENGTH    46
NCHAIN        1 chain(s) in 1CRN data set
NALIGN       64
NOTATION : ID: EMBL/SWISSPROT identifier of the aligned (homologous) protein
NOTATION : STRID: if the 3-D structure of the aligned protein is known, then STRID is the Protein Data Bank identifier as taken
NOTATION : from the database reference or DR-line of the EMBL/SWISSPROT entry
NOTATION : %IDE: percentage of residue identity of the alignment
NOTATION : %SIM (%WSIM):  (weighted) similarity of the alignment
NOTATION : IFIR/ILAS: first and last residue of the alignment in the test sequence
NOTATION : JFIR/JLAS: first and last residue of the alignment in the alignend protein
NOTATION : LALI: length of the alignment excluding insertions and deletions
NOTATION : NGAP: number of insertions and deletions in the alignment
NOTATION : LGAP: total length of all insertions and deletions
NOTATION : LSEQ2: length of the entire sequence of the aligned protein
NOTATION : ACCNUM: SwissProt accession number
NOTATION : PROTEIN: one-line description of aligned protein
NOTATION : SeqNo,PDBNo,AA,STRUCTURE,BP1,BP2,ACC: sequential and PDB residue numbers, amino acid (lower case = Cys), secondary
NOTATION : structure, bridge partners, solvent exposure as in DSSP (Kabsch and Sander, Biopolymers 22, 2577-2637(1983)
NOTATION : VAR: sequence variability on a scale of 0-100 as derived from the NALIGN alignments
NOTATION : pair of lower case characters (AvaK) in the alignend sequence bracket a point of insertion in this sequence
NOTATION : dots (....) in the alignend sequence indicate points of deletion in this sequence
NOTATION : SEQUENCE PROFILE: relative frequency of an amino acid type at each position. Asx and Glx are in their
NOTATION : acid/amide form in proportion to their database frequencies
NOTATION : NOCC: number of aligned sequences spanning this position (including the test sequence)
NOTATION : NDEL: number of sequences with a deletion in the test protein at this position
NOTATION : NINS: number of sequences with an insertion in the test protein at this position
NOTATION : ENTROPY: entropy measure of sequence variability at this position
NOTATION : RELENT: relative entropy, i.e.  entropy normalized to the range 0-100
NOTATION : WEIGHT: conservation weight

Most of these lines are self-explanatory

HSSP This is the file header, including versioning information
PDBID Indicates the PDB file for which this HSSP file has been made
THRESHOLD When Schneider and Sander designed the HSSP concept they first determined the relation between sequence length, percentage identity, and the significance of the alignment. This led to the so-called Sander and Schneider curve. This curve follows roughly t(L)=(290.15 * L ** -0.562) in which L is the alignment length and t(L) the percentage sequence identity to be 50-50 certain that the alignment is relevant. In the HSSP files there is an additional +5 at the end of the formula, indicating that 5% more sequence identity is needed for any sequence to enter the alignment. So HSSP tries to stay at the safe side.
REFERENCE The original HSSP reference, but we like you to cite:
Nucleic Acids Research 2011 January; 39(Database issue): D411-D419.
A series of PDB related databases for everyday needs.
Robbie P. Joosten, Tim A.H. te Beek, Elmar Krieger, Maarten L. Hekkelman,
Rob W.W. Hooft, Reinhard Schneider, Chris Sander, and Gert Vriend.
CONTACT Contact address of the HSSP author.
DATE File generation date.
HEADER Copied from the PDB file
COMPND Copied from the PDB file
SOURCE Copied from the PDB file
AUTHOR Copied from the PDB file
DBREF Copied from the PDB file
SEQLENGTH Length of the protein sequence/structure used in the alignment
NCHAIN Number of chains found in the PDB file
NALIGN Number of sequences aligned
NOTATION These lines explain the content of the rest of the header block / meta data

The sequence list

## PROTEINS : identifier and alignment statistics
  NR.    ID         STRID   %IDE %WSIM IFIR ILAS JFIR JLAS LALI NGAP LGAP LSEQ2 ACCNUM     PROTEIN
    1 : CRAM_CRAAB  1YV8    0.98  1.00    1   46    1   46   46    0    0   46  P01542     Crambin OS=Crambe hispanica subsp. abyssinica GN=THI2 PE=1 SV=2
    2 : Q9S979_CRAAB        0.86  0.93    3   46    9   52   44    0    0  118  Q9S979     Crambin=THIONIN variant THI2CA5 (Precursor) OS=Crambe hispanica subsp. abyssinica PE=4 SV=1
    3 : Q9S976_CRAAB        0.57  0.82    2   45   26   69   44    0    0  134  Q9S976     Crambin=THIONIN variant THI2CA10 (Precursor) OS=Crambe hispanica subsp. abyssinica PE=4 SV=1
    4 : Q43227_TULGE        0.56  0.78    2   46   14   58   45    0    0  112  Q43227     Thionin class 1 (Precursor) OS=Tulipa gesneriana GN=Thi1-4 PE=2 SV=1
.....
   62 : I1H3P5_BRADI        0.40  0.60    2   46   30   74   45    0    0  135  I1H3P5     Uncharacterized protein OS=Brachypodium distachyon GN=BRADI1G57296 PE=4 SV=1
   63 : Q9S9D7_HORVU        0.40  0.71    2   46   30   74   45    0    0  137  Q9S9D7     Thionin OS=Hordeum vulgare PE=4 SV=1
   64 : THN6_HORVU          0.40  0.71    2   46   30   74   45    0    0  137  P09618     Leaf-specific thionin BTH6 OS=Hordeum vulgare PE=2 SV=3
## ALIGNMENTS    1 -   64

This block holds the meta data per sequence, and some vital alignment statistics that have been explained in the NOTATION records of the first block. The ID column holds the name of the sequences, and the STRID column holds the name of the corresponding PDB file (if existing), etcetera.

The actual alignment

## ALIGNMENTS    1 -   64
 SeqNo  PDBNo AA STRUCTURE BP1 BP2  ACC NOCC  VAR  ....:....1....:....2....:....3....:....4....:....5....:....6....:....7 CHAIN AUTHCHAIN
     1    1 A T              0   0   75    2    0  T                                                                          A         A
     2    2 A T  E     -A   34   0A  21   61   10  T SSSSSTSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS  SSSSS SSSSSSSSSSSSSSSS           A         A
     3    3 A a  E     -A   33   0A   0   65    0  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC           A         A
     4    4 A b        -     0   0    0   65    4  CCCCCCCCCCFCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC           A         A
     5    5 A P  S    S+     0   0   54   65   51  PPPPKPPKKRPPPPPRRPPPPPPPPPKPKKPRKKPPPPPPPPKPPPPPKKRKPKPPPKKKKKKK           A         A
     6    6 A S  S  > S-     0   0   49   65   60  SNSSDTSDDNSSNNSNNTTSSSSSSSNSNNSNNNSNSSSSRSTSSSSRTTTSNNTNSNNNNNDD           A         A
     7    7 A I  H  > S+     0   0  120   65   30  ITITITIDTTTTTTTTTMPTTTTTTTTTTTTTTTTTTTTTTETTTTTTTTTTTTKTTTTTTTTT           A         A
....
    44   44 A Y  G <  S+     0   0   68   48   18  YYYYYYLYYYYYYYWYYYYW      YYYY YYYWY    Y F    YYLHWYYL YYYYYYYY           A         A
    45   45 A A    <         0   0   71   46   47  AAPPP PPPPDSPPTEE PB      PPPP PPPNP    T P    TPRPPPPP VPPPPPPP           A         A
    46   46 A N              0   0   76   41   54  NN KK  KKK KKKNKK HH      KKKK RKKHK    H      HKKKKKK  HKSSSSKK           A         A
## SEQUENCE PROFILE AND ENTROPY

The most confusing thing about the alignment always is the vertical orientation of the individual sequences. The columns till ACC are copied from the corresponding DSSP file. VAR and NOCC are explained in the header block.

The profile

## SEQUENCE PROFILE AND ENTROPY
 SeqNo PDBNo   V   L   I   M   F   W   Y   G   A   P   S   T   C   H   R   K   Q   E   N   D  NOCC NDEL NINS ENTROPY RELENT WEIGHT  CHAIN AUTHCHAIN
    1    1 A   0   0   0   0   0   0   0   0   0   0   0 100   0   0   0   0   0   0   0   0     2    0    0   0.000      0  1.00       A         A
    2    2 A   0   0   0   0   0   0   0   0   0   0  95   5   0   0   0   0   0   0   0   0    61    0    0   0.196      6  0.90       A         A
    3    3 A   0   0   0   0   0   0   0   0   0   0   0   0 100   0   0   0   0   0   0   0    65    0    0   0.000      0  1.00       A         A
    4    4 A   0   0   0   0   2   0   0   0   0   0   0   0  98   0   0   0   0   0   0   0    65    0    0   0.079      2  0.96       A         A
....
   45   45 A   2   0   0   0   0   0   0   0   7  70   2   7   0   0   2   0   0   4   2   2    46    0    0   1.161     38  0.53       A         A
   46   46 A   0   0   0   0   0   0   0   0   0   0  10   0   0  15   2  63   0   0  10   0    41    0    0   1.115     37  0.45       A         A
## INSERTION LIST

The profile holds per amino acid type its percentage in the list of residues observed at that position. Be aware that these are frequencies scaled to 100. In the crambin example you see 100 for T ( threonine) at position 1 but if you look in the actual alignment you see that this is 100% of just 1 amino acid because only the first sequence has a residue at this position. NOCC is actually 2 at this position because the query sequence (from 1crn.pdb) is also part of the alignment...

The insertion block

HSSP alignments rigorously follow the sequence of the PDB file. That is easy to do in case of deletions . Deletions are represented by a period if in the middle of the sequence and by a blank at the termini . Insertions are more complicated because throwing them away is the same as throwing away information, and that always hurts.

In the section of the alignment listed in the box below, you find two pairs of residues that are in lower case and in red gi and gr. If you study the whole example hssp file for 1crn you will see that these are not the only two lower case pairs; but anyway, these lower case pairs indicate that between them there is an insertion.

    19   19 A P  T 3 5S-     0   0  109   65   65  PPPPPPPPPPPTTTRPPLTTTTTTTTATAAAPAATGAAAATYGAAAATATGAAATGAALAALAA
    20   20 A G  T < 5 +     0   0   52   65    4  GGGGGGGGGggGGGLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
    21   21 A T      < -     0   0   39   64   53  TTTTTTTTTirTAA.TTTSTAATAATGSGGGTGGAGGGGGAAGGGGGAGATAGGTGGGGGGGGG
    22   22 A P    >>  -     0   0   83   65   43  PAPPPPPPPSPSPPPPPPTSSSSSSSSSSSSPSSSSSSSSSSASSSSSSSPPSSSSSSTSSTSS

The actual sequences of those insertions are found in the last block of the HSSP file, for example like:

## INSERTION LIST
 AliNo  IPOS  JPOS   Len Sequence
    10    20    21     1 gTi
    11    20    39     1 gCr
//