Intro Bioinformatics

• SwissProt

EU name: SWISS

(Date: 9 Aug 24 2016 SWISS)

After completing the SwissProt section and the corresponding MRS exercises you will:
Know that the SwissProt database is a highly curated database of protein sequences. Each entry in SwissProt contains a a well annotated, validated, and hyperlinked protein sequence.
Know two very important fields of the SwissProt database entries: The Crossreferences to all other databases and the Features section describing all important sequence elements in the protein sequence.

SwissProt is the brain-child of Amos Bairoch. SwissProt is a database of well annotated, carefully checked protein sequences. SwissProt uses a large team of internal and external annotation and curation experts to ensure that their data are of high quality. The price paid for this quality is, of course, completeness. There is no way SwissProt can be 'complete'. Nevertheless, it should always be your first sequence database to visit when you need sequence data.

How to obtain a SwissProt file

Figure 19. You can obtain SwissProt files using MRS (see the list of useful links). You can also get access to SwissProt files directly from the SwissProt pages.

What does a SwissProt file look like

The file in the supplementary material listed below is the SwissProt file for Crambin. Further down we will explain some of the more important records. SwissProt files are so-called keyword-organised flat-files. That means, the file is human readable (which tends to be called a flat-file or an ASCII file) and every line starts with a keyword (in SwissProt that is a two letter code). These keywords explain what kind of data follow.

Supplemental material

If you find a sequence using the MRS server or one of the SwissProt search engines, the file will be hyper-linked to many other databases. A few of these lines are more important than others and will be discussed below. You will see the line extracted from the Crambin file in the supplemental material, the description as given at the SwissProt WWW site, and our comments.

The ID record

ID   CRAM_CRAAB     STANDARD;      PRT;    46 AA.

The ID line

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

ID   ENTRY_NAME   DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.

The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence. The entry name consists of up to ten uppercase alphanumeric characters. SwissProt uses a general purpose naming convention that can be symbolized as X_Y, where:

X is a mnemonic code of at most 4 alphanumeric characters representing the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is for Hemoglobin alpha chain and INS is for Insulin;
The '_' sign serves as a separator;
Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. This code is generally made of the first three letters of the genus and the first two letters of the species. Examples: PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea.

However, for species most commonly encountered in the database, self- explanatory codes are used.

Supplemental material

The AC record

AC   P01542;

The AC (ACcession number) line lists the accession number(s) associated with an entry. The format of the AC line is:

AC   AC_number_1;[ AC_number_2;]...[ AC_number_N;]

Semicolons separate the accession numbers and terminate the list. If necessary, more than one AC line can be used. The purpose of accession numbers is to provide a stable way of identifying entries from release to release of the database. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of SwissProt entries.

Usually there is only one accession code per sequence. The accession code is your only unique, long term guaranteed way to get at the data. Accession codes should be used when referring to a sequence in a publication.

The DE record

DE   CRAMBIN.

The DE (DEscription) lines contain general descriptive information about the stored sequence. This information is generally sufficient to identify the protein precisely. The format of the DE line is:

DE   DESCRIPTION.

The description is given in plain English (using US-spelling) and is free-format. In cases where more than one DE line is required, the text is only divided between words and only the last DE line is terminated by a period (full stop). The description always starts with the proposed 'official name' of the protein. Synonyms are indicated between brackets. Example:

DE   ANNEXIN V (LIPOCORTIN V) (ENDONEXIN II) (CALPHOBINDIN I) (CBP-I)

In summary, the DE line holds the 'official' name of the molecule. DE lines are also a good way to find molecules if you only have a common molecule name available.

References

RN   [1]
RP   SEQUENCE.
RX   MEDLINE; 82046542.
RA   TEETER M.M., MAZER J.A., L'ITALIEN J.J.;
RT   "Primary structure of the hydrophobic plant protein crambin.";
RL   Biochemistry 20:5437-5443(1981).

The RN (Reference Number) line gives a sequential number to each reference citation in an entry. This number is used to indicate the reference in comments and feature table notes. The format of the RN line is:

RN [N]

where 'N' denotes the n-th reference for this entry. The reference number is always enclosed in square brackets.

The RP (Reference Position) line describes the extent of the work carried out by the authors of the reference cited. The format of the RP line is:

RP COMMENT.

The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited.

The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database. The format of the RX line is:

RX   BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.

Where the valid bibliographic database names and their associated identifier are:

Name:          MEDLINE

Database: Medline from the National Library of Medicine (NLM) Identifier: Eight-digit Medline Unique Identifier (UID) Example of RX line:

RX   MEDLINE; 82046542.

The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the authors are included, and are listed in the order given in the paper.

The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as possible given the limitations of the computer character set.

The RL (Reference Location) lines contain the conventional citation information for the reference.

The DR record

DR   PIR; A01805; KECX.
DR   PDB; 1CRN; 16-APR-87.
DR   PDB; 1CBN; 31-JAN-94.
DR   PDB; 1CCM; 31-OCT-93.
DR   PDB; 1CCN; 31-JAN-94.
DR   PDB; 1CNR; 31-AUG-94.
DR   PDB; 1AB1; 12-AUG-97.
DR   PFAM; PF00321; plant_thionins; 1.
DR   PROSITE; PS00271; THIONIN; 1.

The DR (Database cross-Reference) lines are used as pointers to information related to SwissProt entries and found in data collections other than SwissProt. For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in the Brookhaven Protein Data Bank (PDB) there will be DR line(s) pointing to the corresponding entry(ies) in that database. For a sequence translated from a nucleotide sequence there exist DR lines pointing to the relevant entries in the EMBL/GenBank/DDBJ database which correspond to the DNA or RNA sequence(s) from which it was translated. The format of the DR line is:

DR   DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER.

The first item on the DR line, the 'DATABASE_IDENTIFIER', is the abbreviated name of the data collection to which reference is made. The currently defined database identifiers are listed in the supplemental material.

Supplemental material

The pointers to other databases normally are hyperlinked when you find a SwissProt file via SRS or another WWW-based search engine.

The FT record

FT   DISULFID      3     40
FT   DISULFID      4     32
FT   DISULFID     16     26
FT   VARIANT      22     22       P -> S.
FT   VARIANT      25     25       I -> L.
FT   STRAND        2      3
FT   HELIX         7     16
FT   TURN         17     19
FT   HELIX        23     30
FT   TURN         31     31
FT   STRAND       33     34
FT   TURN         42     43

Most FT records are self-explanatory

The SQ record (the actual sequence)

SQ   SEQUENCE   46 AA;  4736 MW;  F6ADE458 CRC32;
     TTCCPSIVAR SNFNVCRLPG TPEAICATYT GCIIIPGATC PGDYAN

One would almost forget, but the SwissProt file does also contain a sequence. The format of this sequence part of the file occasionally depends on the search machine used to get that sequence.