Intro Bioinformatics

• EMBL

EU name: EMBLDB

(From: ../EUDIR ) (Date: Jan 27 17:59 ../EUDI)

After completing the EMBL section and the corresponding MRS exercises you will:
Know that EMBL is a database with nucleotide sequences (DNA, RNA).
Know that each EMBL entry is a keyword-driven flatfile.
Know that the size of the EMBL database grows exponentionally.
Understand the records that are being used in an EMBL file.

The EMBL nucleotide database has a confusing name, because it has the same name as the European Molecular Biology Laboratory ( EMBL). This is not by accident, because the EMBL started the EMBL database. Nowadays, the EMBL database is maintained at the EBI, and they collaborate with GenBank (USA) and DDJB (Japan).

What does an EMBL file look like

The supplemental material holds an EMBL entry.

Supplemental material

The nomenclature and abbreviations for records in the file look very much like those used by SwissProt. That is not surprising because the same people were involved. Some of the more important records will be described below.

The ID record

ID   BPTIIICH   standard; RNA; INV; 226 BP.

This is the file name. The rest of the information is not needed in this course.

The AC record

AC   X82313;

This is the accession code or accession number. The purpose of accession numbers is to provide a stable way of identifying entries from release to release of the database. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of entries.

The DE record

DE   B.pahangi mRNA for type III collagen homologue;

This is the name of the molecule. In principle, this name should be the same as the corresponding DE record in SwissProt. This is indeed the case at the level of human reading, but computers might have problems finding the SwissProt DE record given the EMBL DE record or vice versa. In the following example-pairs the SwissProt DE record comes first with the EMBL DE record immediately below.

DE   100 KD PROTEIN (EC 6.3.2.-).
DE   R.norvegicus mRNA for 100 kDa protein

DE   CYSTATIN RELATED PROTEIN 2 PRECURSOR (PROSTATIC 22 KD GLYCOPROTEIN P22K15).
DE   Rat prostatic 22-kD glycoprotein mRNA, complete cds.

DE   3-HYDROXYANTHRANILATE 3,4-DIOXYGENASE (EC 1.13.11.6) (3-HAO) (3-HYDROXYANTHRANILIC ACID
     DIOXYGENASE) (3-HYDROXYANTHRANILATE OXYGENASE).XXSE).
DE   Rat mRNA for 3-hydroxyanthranilate 3,4-dioxygenase, complete cds.

DE   5-HYDROXYTRYPTAMINE 1A RECEPTOR (5-HT-1A) (SEROTONIN RECEPTOR) (5-HT1A).
DE   Rat 5-hydroxytryptamine-1a receptor (5-HT-1a) gene, complete cds.

Literature

RN   [1]
RX   MEDLINE; 95364849.
RA   Martin S.A.M., Thompson F.J., Devaney E.;
RT   "The construction of spliced leader cDNA libraries from the filarial
RT   nematode Brugia pahangi.";
RL   Mol. Biochem. Parasitol. 70:241-245(1995).

The literature references part is similar to the one in SwissProt.

The FT record

FT   source          1..226
FT                   /db_xref="taxon:6280"
FT                   /organism="Brugia pahangi"
FT                   /dev_stage="adult"
FT                   /clone_lib="SL1 cDNA library"
FT   CDS             58..>226
FT                   /db_xref="SPTREMBL:Q17277"
FT                   /product="type III collagen homologue"
FT                   /protein_id="CAA57756.1"
FT                   /translation="MIQACPPKGERGVAGERDPPGVKGVRGPQGEMGPPGREGDVGLPG
FT                   MPGPRDQWDRR"

The FT records indicate the source of the genetic material and the part of the genetic material that is translated into protein. Be aware that this translation is often based on a prediction made by a computer program and not on experimental determinations.

The SQ record

SQ   Sequence 226 BP; 67 A; 46 C; 69 G; 44 T; 0 other;
     attgtcaaca ccaatgcagc aaaacattct ttatcactga ttttctgcgc tattttcatg        60
     attcaagctt gtccacctaa gggagagaga ggagttgcag gagagaggga ccccccagga       120
     gtgaaaggag tgagaggacc tcaaggagag atgggaccac ctggaagaga aggcgatgta       180
     ggattgccag gtatgcctgg accgagagac caatgggacc gcaggt                      226

The SQ record gives the ATCG distribution, and is followed by the actual sequence.