After completing the SwissProt section and the corresponding MRS exercises you will: |
SwissProt is the brain-child of Amos Bairoch. SwissProt is a database of well annotated, carefully checked protein sequences. SwissProt uses a large team of internal and external annotation and curation experts to ensure that their data are of high quality. The price paid for this quality is, of course, completeness. There is no way SwissProt can be 'complete'. Nevertheless, it should always be your first sequence database to visit when you need sequence data.
![]() |
Figure 19. You can obtain SwissProt files using MRS (see the list of useful links). You can also get access to SwissProt files directly from the SwissProt pages. |
The file in the supplementary material listed below is the SwissProt file for Crambin. Further down we will explain some of the more important records. SwissProt files are so-called keyword-organised flat-files. That means, the file is human readable (which tends to be called a flat-file or an ASCII file) and every line starts with a keyword (in SwissProt that is a two letter code). These keywords explain what kind of data follow.
Supplemental materialIf you find a sequence using the MRS server or one of the SwissProt search engines, the file will be hyper-linked to many other databases. A few of these lines are more important than others and will be discussed below. You will see the line extracted from the Crambin file in the supplemental material, the description as given at the SwissProt WWW site, and our comments.
ID CRAM_CRAAB STANDARD; PRT; 46 AA. |
The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:
ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH. |
The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence. The entry name consists of up to ten uppercase alphanumeric characters. SwissProt uses a general purpose naming convention that can be symbolized as X_Y, where:
However, for species most commonly encountered in the database, self- explanatory codes are used.
Supplemental material
AC P01542; |
The AC (ACcession number) line lists the accession number(s) associated with an entry. The format of the AC line is:
AC AC_number_1;[ AC_number_2;]...[ AC_number_N;] |
Semicolons separate the accession numbers and terminate the list. If necessary, more than one AC line can be used. The purpose of accession numbers is to provide a stable way of identifying entries from release to release of the database. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of SwissProt entries.
Usually there is only one accession code per sequence. The accession code is your only unique, long term guaranteed way to get at the data. Accession codes should be used when referring to a sequence in a publication.
DE CRAMBIN. |
The DE (DEscription) lines contain general descriptive information about the stored sequence. This information is generally sufficient to identify the protein precisely. The format of the DE line is:
DE DESCRIPTION. |
The description is given in plain English (using US-spelling) and is free-format. In cases where more than one DE line is required, the text is only divided between words and only the last DE line is terminated by a period (full stop). The description always starts with the proposed 'official name' of the protein. Synonyms are indicated between brackets. Example:
DE ANNEXIN V (LIPOCORTIN V) (ENDONEXIN II) (CALPHOBINDIN I) (CBP-I) |
In summary, the DE line holds the 'official' name of the molecule. DE lines are also a good way to find molecules if you only have a common molecule name available.
RN [1] RP SEQUENCE. RX MEDLINE; 82046542. RA TEETER M.M., MAZER J.A., L'ITALIEN J.J.; RT "Primary structure of the hydrophobic plant protein crambin."; RL Biochemistry 20:5437-5443(1981). |
The RN (Reference Number) line gives a sequential number to each reference citation in an entry. This number is used to indicate the reference in comments and feature table notes. The format of the RN line is:
RN [N] |
where 'N' denotes the n-th reference for this entry. The reference number is always enclosed in square brackets.
The RP (Reference Position) line describes the extent of the work carried out by the authors of the reference cited. The format of the RP line is:
RP COMMENT. |
The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited.
The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database. The format of the RX line is:
RX BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER. |
Where the valid bibliographic database names and their associated identifier are:
Name: MEDLINE |
Database: Medline from the National Library of Medicine (NLM) Identifier: Eight-digit Medline Unique Identifier (UID) Example of RX line:
RX MEDLINE; 82046542. |
The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the authors are included, and are listed in the order given in the paper.
The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as possible given the limitations of the computer character set.
The RL (Reference Location) lines contain the conventional citation information for the reference.
DR PIR; A01805; KECX. DR PDB; 1CRN; 16-APR-87. DR PDB; 1CBN; 31-JAN-94. DR PDB; 1CCM; 31-OCT-93. DR PDB; 1CCN; 31-JAN-94. DR PDB; 1CNR; 31-AUG-94. DR PDB; 1AB1; 12-AUG-97. DR PFAM; PF00321; plant_thionins; 1. DR PROSITE; PS00271; THIONIN; 1. |
The DR (Database cross-Reference) lines are used as pointers to information related to SwissProt entries and found in data collections other than SwissProt. For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in the Brookhaven Protein Data Bank (PDB) there will be DR line(s) pointing to the corresponding entry(ies) in that database. For a sequence translated from a nucleotide sequence there exist DR lines pointing to the relevant entries in the EMBL/GenBank/DDBJ database which correspond to the DNA or RNA sequence(s) from which it was translated. The format of the DR line is:
DR DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER. |
The first item on the DR line, the 'DATABASE_IDENTIFIER', is the abbreviated name of the data collection to which reference is made. The currently defined database identifiers are listed in the supplemental material.
Supplemental materialThe pointers to other databases normally are hyperlinked when you find a SwissProt file via SRS or another WWW-based search engine.
FT DISULFID 3 40 FT DISULFID 4 32 FT DISULFID 16 26 FT VARIANT 22 22 P -> S. FT VARIANT 25 25 I -> L. FT STRAND 2 3 FT HELIX 7 16 FT TURN 17 19 FT HELIX 23 30 FT TURN 31 31 FT STRAND 33 34 FT TURN 42 43 |
Most FT records are self-explanatory
SQ SEQUENCE 46 AA; 4736 MW; F6ADE458 CRC32; TTCCPSIVAR SNFNVCRLPG TPEAICATYT GCIIIPGATC PGDYAN |
One would almost forget, but the SwissProt file does also contain a sequence. The format of this sequence part of the file occasionally depends on the search machine used to get that sequence.