NewProt: A European protein engineering project
Home Partners Contact Project Products Deliverables Restricted
Additional information
Something else
Press release
Legal note

WP 2: Databases and query system


To collect all databases relevant for protein engineering, correct and reformat these databases for optimal use, and to extract from those databases the information relevant for protein engineering purposes.


Very few databases are publicly available that are specifically designed for protein engineering purposes. A scan through the list of databases in the Nucleic Acid Research journal special volume on databases revealed only one database that explicitly mentions its protein engineering purposes. However, this same volume lists many databases that potentially hold great value for protein engineering purposes. Many mutation databases, for example, are mentioned, and although none of these were designed for protein engineering purposes, most of them contain information that can be put to use in protein engineering. The variation information in Ensemble, for example, has great value for protein engineering, despite that it is hidden in an enormous volume of 'other' information. Mutations in kinases, for example in, that are known to cause a disease in man or mouse can of course very well be attempted in an in vitro mutation study; for one thing, we know already that that mutation will be accepted otherwise it would never have been detected in man or mouse in the first place.

Task 1: The following two (types of) primary databases will be incorporated and 'processed':

The PDB holds all publicly known protein structures. It is one of the two most important databases for protein engineering. Unfortunately, a series of reasons cause PDB files to not be optimal for use in a protein engineering environment. It has been shown ( that the PDB holds millions of small and thousands of large errors many of which can be corrected ( We will provide a curated copy of the PDB that has as many errors removed as is possible (by re-refining both the X-ray and the NMR files in other, already funded, international projects). However, missing side chains are not an error when they cannot be seen in the X-ray density, but still, it is better for many applications if the missing side chains are modelled in (e.g. electrostatic calculations will work considerably better when a charge group is somewhat misplaced than when that charged group is totally missing). We will for each PDB file provide a completed copy. The PDB operations will be based on EBI's PISA system that provides natural multimers rather than crystallographic multimers. From NMR ensembles one copy will be selected that is technically correct and complete, and that is most representative for the ensemble.

The SwissProt/UniProt protein sequence databases are the other important databases for protein engineering. These databases are already available at the CMBI including easy to use interfaces, and these facilities will be fully integrated in the SSP.

Task 2: Derived databases

Sequence oriented databases like SwissProt, UniProt, or Ensemble, but also ProTherm and other specialized systems all hold large amounts of variation data. Natural variants, for example, are useful information to have available because if a natural variant exists, no computations are needed to know for sure that the mutation will be accepted by the protein. Partner CMBI will its HOPE software to extract from these sources the variability data and to make it available to the SSP users.

The HSSP database (, that holds for every PDB file a multiple sequence alignment against UniProt, will be newly produced using new and improved software that will be validated by partner BIOP. This database will be the responsibility of WP4.

Task 3: Other database systems

The MRS software suite will be fully integrated in the SSP (using its SOAP interface) so that it can be used for queries in the primary databases: PDB (and PDB_REDO and other PDB variants), SwissProt, UniProt; in the secondary databases (DSSP, HSSP, PDBFINDER, etc), and in the derived databases such as the lists of mutations and variants that will be collected from many sources. The CMBI PDB query system (still based on SQL but to be converted to RDF) will be made available for advanced structure queries.

All databases will be made accessible through SOAP based Web services that will allow users to perform Web service based queries and retrieval operations. The EDAM ontology will be used for the semantic and syntactic annotation of these Web services.

  The NewProt project is funded by the European Commission within its FP7 Programme, under the thematic area KBBE-2011-5 with contract number 289350.