Data:

EU name: BIOROX

(Date: Aug 24 2016 BIOROX )

After completing this section you will:
Know the most important datatypes in bioinformatics: sequencese (protein and DNA) and structures.
Know the 3 major data collections in bioinformatics: SwissProt, EMBL and PDB.
Know the 5 data elements all good databases minimally contain.

EU name: BIODAT

(From: ../EUDIR ) (Date: Jan 27 17:59 ../EUDI)

The three major databases

In this course we will mainly use data from three databases. Be aware, though that there are thousands of databases available to you! The three databases we will most often look at are:

Figure 10. SwissProt is a well-curated database of protein sequences.

Figure 11. PDB is more a databank than a database. The PDB was started by people at the Brookhave national lab. Nowadays the PDB is kept at Rutgers; with mirror systems at the EBI and in Japan. It holds macromolecular structures solved mainly by X-ray or NMR. These are mainly protein structures but also DNA, RNA, and all kinds of complexes.

Figure 12. EMBL is not only the name of a research institute, but it is also the name of the international depository for nucleic acid sequences, the EMBL database.

Some other databases

Although we will mainly use Swissprot, the PDB, and the EMBL database, we will also briefly use OMIM and Prosite, and you need to know that UniProt exists. These databases will be discussed at some later time during the course and are here mentioned just for sake of completeness.

The OMIM databank is the brainchild of Victor A McKusick. This databank holds description of phenotypes for a whole series of disease causing SNPs / mutations in the human genome. OMIM stands for On-line Mendelian Inheritance in Man.

The Prosite database holds information about sequence patterns that indicate potential post-translational modification sites, cleavage sites, active sites, etc.

Figure 13. UniProt is a much larger protein sequence database that one normally should use if SwissProt doesn't hold what you are looking for. During this course SwissProt alone will always be enough to answer the questions.

Discussion of the data

All good databases, in principle, should contain the five data elements:

Figure 14. Unique identifier, or accession code
The unique identifier, or accession code is needed to track data through the years. When the PDB started all entries were kept on punched cards, and much of the annotation was in a filing cabinet. One had to call the PDB to ask for a photocopy of the additional information. Thanks to the fact that the PDB uses the four-letter-code as unique identifier (in the PDB the file name and the unique identifier are the same), we can now still retrieve these very old data, and all information around it.

Figure 15. Name of depositor
The name of the depositor is needed to give proper credit, but also to know who to contact in case of problems, or questions.

Figure 16. Literature references
Literature references help track down the nitty-gritty data that normally is only available in the materials and methods section of the article.

Figure 17. Deposition date
The deposition date is useful for patent claims, but also provides some help to the users. E.g., the first refinement program became available in the early eighties. So, coordinates deposited before 1980 are surely unrefined, and that is important knowledge.

Figure 18. The real data
It seems trivial that database entries contain data. But it isn't. There is a conflict between industrial secrets and patents the one side and the dissemination of science on the other. In the past it was permissible to publish articles about a sequence or structure without depositing the data in a database. It was also possible to deposit an entry that contained no data. This practice is no longer allowed.

EU name: BIODQ1

(From: ../EUDIR ) (Date: Jan 27 17:59 ../EUDI)

Question 1:
1) Which three databases are being used in this course and what kind of data do they contain?
2) Find out for each of the three databases if the 5 'essential' data elements are really present.
3) Look at the GPCRDB and SwissProt. Both provide biological data to the user. What do these systems have in common and what are the major differences?
(Hint 1: Perhaps you can first answer the question which of the two systems is called an information system and which is called a database? Hint 2: Think about the data-types they provide, and think about the completeness of the stored data per data-type. Hint 3: Think of the types of questions that the systems might help answer.)

Answer

Question 2: Later we will teach you how to use MRS to find the so-called flat-file version of the EMBL entry for human lysozyme with accession code X14008. For now, use this local copy that was in August 2012 stored at X14008.docx.
At the left-hand side you see many two letter codes. These are the so called keys. Try to find out what all important keys mean (so complete at least the table below):

Two letter key       Write here in a few words which information this key points at
ID
AC
DT
DE
KW
R*
DR
FT
SQ

The question what the XX record is good for is both very simple and very complicated at the same time. If there is time left, give it a shot...

Answer