The following workpackages (with lead participant indicated) describe all work that will be done in the NewProt project.

WP #Workpackage titleLead
1 Protein Engineering PortalFLUID
2Databases and query systemsCMBI
3Software collection and integrationCMBI
4Sequence analysis and visualisationBIOP
5Structure analysis and visualisationYAS
6Experimental validationEMAUG
7Project managementCMBI

WP 1: Design and implementation of interactive protein engineering website, SSP


The main objective of WP1 is to create a portal - (called SSP, for Self-Service protein engineering Portal) - that will facilitate all NewProt's software, databases, interactive operations, dissemination, and management activities. The SSP will combine aspects of a classical portal with that of an interactive workbench that enables simple access to all NewProt resources through a single, homogeneous interface. The SSP will additionally hold a series of tabs that will facilitate direct access to the databases, software, query systems, protocols, experimental results, course material, and management tools.

Description of work


WP1 can be logically divided in two main tasks: 1) Design and implementation of the SSP; and 2) Validation of the technical aspects of the SSP. These two tasks will have their focal points in year 1 and year 2 of the NewProt project, respectively. Population and usage of the portal will be the tasks of WP2-5 and WP6, respectively.

Task 1. Design and implementation of the SSP framework.

The SSP will be designed based on the open source Information Workbench from fluidOps. The SSP will provide a working environment to easily interact with all produced NewProt resources. To this end, the SSP will hold tabs for database access (for WP2), software access (for WP3), system facilities (svn access, ftp access, virtual machine, etc), dissemination (e.g. course material, software documentation) help facilities, and project management facilities. Users will be able to interactively run NewProt software within the SSP workbench. Software and databases will be fully interoperable, i.e., users will not need to store in-between results and will not need to worry about file formats etc. This interoperability requires that all data types will be syntactically and semantically described (WP2) and that all software that can operate on those data will need to 'know' about the syntactic and semantic annotations (WP3-5). The use of common standards (RDF, Linked Data, as well as relevant domain ontologies such as EDAM, OBO and the GO Gene Ontology)) will ensure the semantic interoperability of data and software. Much of the data and software the partners intend to incorporate in the SSP already use the SOAP protocol for interoperability and adhere to these common standards and ontologies.

The SSP will be provided as a hosted portal (hosted at the CMBI), but it will be designed such that all partners can instantiate private copies (e.g. for in-house use) if desired. Two mechanisms will be used for this purpose. First, all resources (except third party databases) will be kept in the svn version control system, and second, the whole database and software suite will be made available as a Virtual Machine image that can be instantiated on private infrastructures (e.g. using VMware) or, if needed, on public clouds (e.g. Amazon EC2). Partner FLUID will design the SSP keeping in mind that they will be able to include their template based provisioning tools when, in due time, they deliver technological support to SSP-using industries that request such additions. The SSP-mages will be equipped with an update mechanism that will allow the partners to frequently obtain the current versions of the entire suite while retaining the state of the user data. Partners FLUID and CMBI will work together in the first weeks of the project to make a functional design for the SSP. Partners YAS, BIOP, ENTIS, and SAFAN, and especially SAC member T Schwede (who has long-standing experience in running portals) will be involved to consult in this task.

Task 2. Validation of the interoperability of the SSP with other software.

One of the goals of the portal is that third parties (i.e. bioinformatics SMEs) can easily use the central portal to enhance the quality, and thus the value, of their web-based products. The partners CMBI, BIOP, YAS, and ENTIS will each validate a different aspect of the interoperability of the SSP with their in-house software. Partners CMBI, YAS, BIOP, and ENTIS have software products that can be of interest to academic and industrial researchers in protein engineering and drug design Together, these partners are paradigmatic for anything a bioinformatics software specialist in academia or industry might want to do with the NewProt products.

Partner CMBI will add a large number of software and database facilities. This process will mainly take place between months 7 and 24 (after which more time will be spend on improvements steered by the validation experiments). However, CMBI will keep adding new products throughout the whole period, and will thus continuously validate the ease of upgrading the SSP. Partner BIOP produces molecular class specific information systems that hold much information about a class of molecules. This information is collected, validated, annotated, and computationally enriched. The curated systems, that are called 3DM systems, are presented at the user as a classical portal system. BIOP does not distribute any data or software, but BIOP's customers obtain access to their 3DMs at the BIOP computer systems. BIOP will collaborate with FLUID to achieve a full integration between their molecular class specific 3DMs with the fully generic SSP. 3DM will obtain an in-house copy of the SSP as a virtual machine and it will, based on advice from partner FLUID, make the SSP and their in-house system fully interoperable. BIOP will write a detailed report about this integration process. This report will be detailed enough to function as a recipe for other, non-partner SMEs to produce similar, bi-directional interactivity with the SSP. BIOP will validate that the SSP can be downloaded and used as a virtual machine and that it can easily be made fully interoperable with their in-house molecular class specific information systems. This will require full two-way communication between the systems.

Partner BIOP would like to get fully integrated two-way interoperability with the SSP; YAS will validate that the results from the CMBI-hosted SSP can easily be transferred to, and used in its YASARA View software (which implicitly also means that it can be used in its commercial YASARA software). Details of the YASARA View - SSP interoperability are discussed more extensively in WP5.

ENTISĄŻ Hotspot Wizard is a software tool that automatically identifies the functional residues for engineering catalytic properties of enzymes and for estimating their mutability. For this purpose, HotSpot Wizard integrates several bioinformatics databases (RCSB PDB, UniProt, PDBSWS, Catalytic Site Atlas and nr NCBI) and computational tools (CASTp, CAVER, BLAST, CD-HIT, MUSCLE and Rate4Site). Structural analyses are conducted to identify the residues that potentially come into contact with the substrates or products. The mutability of individual amino acid residues is derived from their conservation level. (HotSpot Wizard: a Web Server for Identification of Hot Spots in Protein Engineering. Pavelka A, Chovancova E, Damborsky J. 2009 Nucl. Acids Res. 37 W376-W383). Partner ENTIS will validate that the SSP can be used to enhance their in-house bioinformatics products, but without maintaining a full in-house SSP copy.

So, in summary, partner BIOP will validate full two-way, in-house integration of its (SME) portal with the SSP; Partner ENTIS will validate that it can actually use obtain information from the hosted SSP to enrich its (SME) products; Partner YAS will validate that SSP users can directly use its (SME) products; and partner CMBI will validate that it actually is easy to add its (academic) products to the SSP.

WP 2: Databases and query system


To collect all databases relevant for protein engineering, correct and reformat these databases for optimal use, and to extract from those databases the information relevant for protein engineering purposes.

Description of work


Very few databases are publicly available that are specifically designed for protein engineering purposes. A scan through the list of databases in the Nucleic Acid Research journal special volume on databases revealed only one database that explicitly mentions its protein engineering purposes. However, this same volume lists many databases that potentially hold great value for protein engineering purposes. Many mutation databases, for example, are mentioned, and although none of these were designed for protein engineering purposes, most of them contain information that can be put to use in protein engineering. The variation information in Ensemble, for example, has great value for protein engineering, despite that it is hidden in an enormous volume of 'other' information. Mutations in kinases, for example in, that are known to cause a disease in man or mouse can of course very well be attempted in an in vitro mutation study; for one thing, we know already that that mutation will be accepted otherwise it would never have been detected in man or mouse in the first place.

Task 1: The following two (types of) primary databases will be incorporated and 'processed':

The PDB holds all publicly known protein structures. It is one of the two most important databases for protein engineering. Unfortunately, a series of reasons cause PDB files to not be optimal for use in a protein engineering environment. It has been shown ( that the PDB holds millions of small and thousands of large errors many of which can be corrected ( We will provide a curated copy of the PDB that has as many errors removed as is possible (by re-refining both the X-ray and the NMR files in other, already funded, international projects). However, missing side chains are not an error when they cannot be seen in the X-ray density, but still, it is better for many applications if the missing side chains are modelled in (e.g. electrostatic calculations will work considerably better when a charge group is somewhat misplaced than when that charged group is totally missing). We will for each PDB file provide a completed copy. The PDB operations will be based on EBI's PISA system that provides natural multimers rather than crystallographic multimers. From NMR ensembles one copy will be selected that is technically correct and complete, and that is most representative for the ensemble.

The SwissProt/UniProt protein sequence databases are the other important databases for protein engineering. These databases are already available at the CMBI including easy to use interfaces, and these facilities will be fully integrated in the SSP.

Task 2: Derived databases

Sequence oriented databases like SwissProt, UniProt, or Ensemble, but also ProTherm and other specialized systems all hold large amounts of variation data. Natural variants, for example, are useful information to have available because if a natural variant exists, no computations are needed to know for sure that the mutation will be accepted by the protein. Partner CMBI will its HOPE software to extract from these sources the variability data and to make it available to the SSP users.

The HSSP database (, that holds for every PDB file a multiple sequence alignment against UniProt, will be newly produced using new and improved software that will be validated by partner BIOP. This database will be the responsibility of WP4.

Task 3: Other database systems

The MRS software suite will be fully integrated in the SSP (using its SOAP interface) so that it can be used for queries in the primary databases: PDB (and PDB_REDO and other PDB variants), SwissProt, UniProt; in the secondary databases (DSSP, HSSP, PDBFINDER, etc), and in the derived databases such as the lists of mutations and variants that will be collected from many sources. The CMBI PDB query system (still based on SQL but to be converted to RDF) will be made available for advanced structure queries.

All databases will be made accessible through SOAP based Web services that will allow users to perform Web service based queries and retrieval operations. The EDAM ontology will be used for the semantic and syntactic annotation of these Web services.

WP 3:


Software relevant for protein engineering will be installed and made interoperable. This will mainly involve CMBI products, several open source packages, and YASARA View.

Description of work


The software efforts will consist of five main tasks and a series of small activities. The main tasks will be:

In principle the responsible researchers can work on these five tasks in parallel, albeit that the actual integration in the portal can start experimentally around month 6 and in production mode in month 13. All tasks include ensuring that the software can communicate through the SOAP protocol using XML that complies with the commonly agreed-on ontologies and standards. Partner CMBI is involved in the SeqAhead COST action that brings together a large consortium of European bioinformaticians that will coordinate these activities for sequence related software and databases. The SeqAhead recommendations will be followed in the NewProt project. The EDAM ontology will be adopted for protein structure data and for computational methods.

Task 1: Make WHAT IF options interactively available

The WHAT IF software is for many years the de facto standard in rational protein engineering research, either directly, or indirectly as integral part of, or supporting tool for other software like, for example, foldx ( WHAT IF has been kept up-to-date for the past decades, and recently it has been fully incorporated in the YASARA modelling and visualisation software. WHAT IF was designed with a state-of-the-art 1987 user interface that today's scientists consider hard to learn how to operate. The WHAT IF options relevant for protein engineering will be made more accessible, for example, by collecting them in a YASARA protein engineering menu, by building Web servers around them, or by making them available as Web services. Many WHAT IF options will be made callable from the HOPE software (see below). A large series of WHAT IF options will be made available through the main interactive workbench of the SSP. These will for example include a series of WHAT_CHECK structure validation options, crystal packing options, and mutability prediction options. Many of these scientifically complicated options are the result of previous large, collaborative projects; WHAT_CHECK, for example, was the result of an fifth Framework EU project. The past investments in these scientific options add up to tens of person years of work. A pilot project performed in the framework of the sixth Framework EMBRACE NoE showed that making WHAT IF options fully interoperable with other software will be doable in reasonable time.

Task 2: Put the HOPE molecule-specific data collection software on the SSP

The HOPE software ( is a system designed to be used by medical researchers to get a molecular explanation for the observed phenotypic effects of a mutation in the human genome. The HOPE software was designed to explain one single mutation at the time, and is too simple for protein engineering purposes. HOPE's underlying software that collects all kinds of data for each residue in a protein, however, can be recycled to make for each protein of interest a simple spreadsheet with massive amounts of elementary data for each residue; e.g. HSSP variability, accessibility, rotameric freedom, crystal contacts, DNA/RNA contacts, ion contacts, active site location, secondary structure, known variants, known variants in homologs, underlying codons, codon conservation, location relative to splice sites, etcetera. All this data is obtained using Web service calls to WHAT IF and SwissProt/UniProt, and using DAS servers that were produced in the sixth Framework BioSapiens project. The HOPE database will be fully integrated in the interactive workbench of the SSP.

The HOPE software contains a decision tree module that employs a simple form of artificial intelligence to analyse the possible phenotypic effect of point mutations that have been found related to genetic disorders. An attempt will be made to convert this decision tree module in HOPE to allow it to function as a supervisor system that analyses the mutations that the SSP user, after using all other SSP facilities, finally decides to make. This approach, obviously, will have many limitations in terms of experiments for which it will be applicable, but it will certainly be useful for predicted point mutations. Further applicability needs to be studied.

Task 3: Make a connection to the PMP homology modelling portal

Homology modelling is a key process in most protein engineering projects. We will collaborate with the Protein Model Portal (PMP) group at the Biozentrum in Basel that is headed by T Schwede. T Schwede will also be a member of the NewProt advisory board to optimize this portal-portal collaboration (and to make NewProt benefit from his extensive experience in protein modelling, portals, and user interactions). The PMP will be used to obtain homology models for the NewProt users. Obviously, NewProt users can go directly to the PMP, but by taking the route through the NewProt portal a) the modeller gets anonymized and b) all administrative problems such as storage of the model at the NewProt portal will be dealt with automatically c) the user doesn't need to worry.

The homology modelling procedure will include the possibility to run energy minimisations and (very short) molecular dynamics simulations with the GROMACS software. All scripts and files necessary to continue simulations for longer CPU times on in-house computers will be made available to the users. The GROMACS interface will be based on the WHAG software that is the result of a long-standing collaboration between the CMBI and the Biozentrum (joint article in preparation).

Task 4: Make YASARA View available and usable for SSP users (with WP5)

ASARA View scenes will be produced that allow users to map most data on the structure for visual inspection. YASARA View can be obtained freely from, and the SSP will hold a mapping to this download site. YASARA View will be used as the visualisation engine for nearly all SSP results that are, or map on, 3D structures. The WHAT IF options that produce output that is suitable for 3D visualisation and that produces results that cannot be obtained with YASARA View will need to produce output that YASARA View can read, understand, and convert into visual effects. The operations needed to achieve this interaction must remain totally hidden for the users.

Task 5: Integrate MRS in the SSP (with WP2)

The MRS data collection and database search engine will be used for all keyword driven database searches. This database query system is simple to use, easy to integrate in other software, very fast and flexible. The CMBI MRS search engine is providing thousands of queries per day in almost 30 databases. This can easily be handled by a simple PC. MRS was designed with applications such as integration in systems like a SSP in mind and all required interoperability facilities are already in-place in the MRS software. MRS does include its own internal (re-engineered) version of the well-known BLAST database query software. A facility will be added to MRS' BLAST to flexibly limit the search to proteins from, for example, thermophilic species.

Validation and documentation

All tasks will be followed by validation (both software wise in WP1 to validate the technical aspects and by mutation experiments in WP6 to validate the scientific aspects of the products), and extensive documentation (explanation, help facility, course material).

Additional software activities

he SSP will also hold a series of other software packages. These are packages that occasionally might be of use to protein engineers, but will not be needed routinely. These will not be fully integrated in the SSP (unless there will be popular demand for such integration), but will be made interactively executable, and downloadable. The CMBI will be responsible for these installations. Examples are:

The WHAT_CHECK (very extensive) structure validation suite;

The BioMeta database and search engine that, in due time, will provide the most likely metabolite docked 'by homology' in PDB files that hold the structure of a substrate or product analog. BioMeta also provides sub-structure search facilities for the ligands found in the PDB; sub-structures can be sketched using the JME software.

WP 4:


To provide pre-calculated, annotated multiple sequence alignments (MSAs) for all proteins for which the 3D structure is available in the protein structure database, PDB. To provide, on-the-fly, high-quality multiple sequence alignments for all other sequences and structures not (yet) in the PDB. Additionally, software will be provided to visualize the (MSAs) and data derived from them in multiple, different ways.

Description of work

The HSSP database contains for each PDB file a multiple sequence alignment against the UniProt protein sequence database. HSSP has been updated weekly for the past 15 years, and despite the growth of both the PDB and UniProt, the weekly update regime will be maintainable for the foreseeable future, and will be continued throughout the duration of the NewProt project.

The HSSP database contains for each PDB file a multiple sequence alignment against the UniProt protein sequence database. HSSP has been updated weekly for the past 15 years, and despite the growth of both the PDB and UniProt, the weekly update regime will be maintainable for the foreseeable future, and will be continued throughout the duration of the NewProt project.

HSSP files hold valuable information because they hold sequence conservation and variability data. HSSP files also tend to produce better alignments than the often used combination of BLAST and CLUSTAL. The HSSP software will be rewritten to incorporate a series of novel insights obtained over the past decade (this will be done in a collaboration with Chris sander and Reinhard Scheider; the original HSSP authors). The quality (improvement) will be validated by comparing the HSSP alignments with a series of structure based MSAs produces by partner BIOP. (It is unfortunately still far too time consuming to produce structure based MSAs for all PDB files).

The HSSP derived variability information will be mapped on the structure coordinates and made available as a YASARA View scene (such scenes can be executed directly by the YASARA View version on the user's in-house computer). In case only very few residue types make up the majority at a certain residue position (i.e. residue 17 is 72% Asp, 16% Glu and 2% something else), the predicted optimal rotamers for the majority residue types will be made visible, and correlations between such positions will be detected and made visible. Additionally, the MSAs will be made available in a series of commonly used formats to ease the in-house usage with other software packages by the SSP users.

WP 5:


To provide an interface for the visualisation of protein structures, predicted and observed mutations, protein-protein and protein-ligand interactions, and other computational results produced by SSP software.

Description of work

This WP logically separates into two tasks. First, the incorporation of YASARA View functionality in the SSP, and second, the access to SSP functionality from within YASARA View.

Task 1

YASARA View is a freely available molecular modelling and visualization software that has been distributed by YASARA Biosciences GmbH since 2003 at There are currently >20000 registered users, many of whom decided to support YASARA development by switching to one of the higher (commercial) stages with additional features (like molecular simulations). For the purpose of the SSP, the free YASARA View provides all the functions needed, and YASARA Biosciences GmbH has granted permission to include the software in any virtual machine distributed by the NewProt project.

The SSP will often need to visualize residues coloured by characteristics such as mutability, conservation, or involvement in protein-protein and protein-ligand contacts. While this could in principle be achieved with any molecular graphics software, the SSP will also require advanced visualisations that extend far beyond simple colouring and styling. These will include side-chain rotamers, cavities, surfaces, electrostatic potentials etcetera. In Task 1, we will develop open source YASARA scripts that create these visualizations on the fly, and output them either as ray-traced images (for display on the SSP website) or as annotated YASARA scene files. When the user clicks on such a hyperlinked YASARA scene, it will be opened in his/her local YASARA View to provide a fully immersive, photorealistic environment for detailed analysis.

Some of the results obtained from the various SSP software packages may have to be processed or converted before they can be visualized. This will hold mainly for WHAT IF results. Partner CMBI will, with advice from partner YAS, perform this task. The order in which WHAT IF calculations will be made YASARA scene friendly will be determined by the (experimental) partners who will use the WHAT IF facilities at the SSP.

Task 2

As mentioned in the introduction, the SSP will not only provide interactive access via a web browser, but also support automatic access via a SOAP interface. To demonstrate the use of this interface, partner YASARA will develop an open source Python plug-in for YASARA View, that integrates SSP functionality directly into the user interface. This permits users to query the SSP from within YASARA View, while analyzing their protein structure of interest, without having to open a browser window in parallel. This principle has already been successfully applied to include a large number of external web services into YASARA View: structural alignments using CE, MultiProt, and SHEBA, mutant analysis with FoldX, the PDBFINDER-II database, visualization of conserved surface residues with ConSurf, or molecular class-specific information systems (MCSIS).

WP 6:


WP6 will be responsible for the experimental validation of the predictions made by the SSP software. Additionally, WP6 will reflect on the user-friendliness of the SSP.

Description of work


WP6 will experimentally validate mutation predictions made with the SSP software and with BIOP's 3DM systems. The primary aim of these experiments is validation of the in silico predictions to the extent that the software and protocols can be improved using the validation results. WP6 has as secondary aim the actual production of improved enzymes.

The partners in WP6 will aim at different aspects of experimental validation. Partner EMAUG will provide its extensive knowledge in protein engineering in general as support to all three experimental SMEs. ENZYM, INGEN, and LEAD will use the SSP to tailor-design the proteins, which are of key interests to them, hence partner ENZYM will concentrate on transaminases and Baeyer-Villiger-monooxygenases (BVMO); partner INGEN will focus on aminotransferases, amine oxidases, and carboxylic acid reductases; while partner LEAD will base its in-house ongoing drug-design related protein engineering efforts on SSP results and will provide feedback on the outcome of those experiments.

Validation planning

EMAUG will use the bioinformatics tools available from the SSP and from BIOP to further accelerate the development of enzymes for biocatalysis. This will be performed for enzymes from the ~/~ hydrolase fold super-family where they aim for the inter-conversion of enzyme activities, e.g. convert an epoxide hydrolase into synthetically useful dehalogenases. Novel 3DM-systems will be developed for different enzyme families to further expand the research in this area to enable the identification of further useful transaminases, transferases, BMVO, oxidases, and reductases with respect to substrate range, enantioselectivity, and stability.

Thus, in close cooperation with partners 1, 3, and 5 targets for biocatalysis will be identified for the different enzyme classes. Bioinformatics information will be used by partners 2, 4, and 6 to design, create, and analyze mutant libraries. The best hits thus identified experimentally in the laboratory will be biochemically characterized to confirm predictions and their applicability in biocatalysis. This will be performed in close collaboration between the academic partner 2 and the experimental SMEs to take advantage of the high-throughput screening facilities and the protein engineering knowledge at EMAUG.

Task 1

Enzymicals will use the protein engineering SSP to tailor-design its biocatalyst on emerging issues. The primary focus will be on activity, selectivity and/or stability of transaminases and Baeyer-Villiger monooxygenases. Secondary targets are adjustments of the substrate spectra of representative catalysts from these two classes. Predicted mutations are compared with results of other available tools and selected positions are transferred to rational or evolutionary protein design. The obtained experimental results will be used to validate and refine the prediction tools and helps to understand the enzyme mechanisms. Improved variants will be used for the company~s catalogue business and incorporated in the in-house biocatalytic toolbox for the production of fine-chemicals.

Task 2

Ingenza will use the software tools available from the SSP and from BIOP for two particular classes of experiments. First the focus will be on the activity, selectivity, and thermostability of amino-transferase and carboxylic acid reductase enzyme super-families. These experiments will mainly use the multiple sequence alignment based tools. BIOP 3DM systems will be used to generate the experimental designs. Partners 1, 3, and 6 will carefully analyse the quality of these predictions and partner 3 will use this as input to a round of improvement of the software (and perhaps the science) they use for the production of 3DM systems. It takes partner BIOP days till weeks to produce one 3DM system. This is too big an effort in terms of CPU usage to be made freely available, so that EMBL's HSSP system (that is nowadays maintained by partner CMBI) will be used on the SSP for the same multiple sequence based purposes. HSSP alignments can be produced much faster than 3DM systems but will be less accurate. Partners 1 and 3 will carefully analyse the quality of HSSP based predictions and will try to improve the HSSP alignment system further, if possible, and if improvements will not unreasonably increase the CPU time efforts. The expected limitations of the predictions will be clearly documented (in collaboration with partner 7).

Ingenza will (need to) stabilize a few of its target enzymes. For this purpose they will use both the multiple sequence alignment based stability prediction techniques that are part of BIOP's 3DM system, and they will use the 'classical' protein structure and energy calculation based WHAT IF methods that will be incorporated in the SSP. The experimental results will be carefully analysed and these analyses will be made available to the users of the SSP. The experimental results will obviously also be used to generate ideas for better algorithms, protocols, or parameterisations for the prediction of stabilizing mutations.

Task 3

LeadPharma performs protein engineering experiments in many of its drug-design related projects. LeadPharma will not perform specially designed experiments for the purpose of SSP validation, but they will rather use the SSP as a normal tool in their in-house experimental design, and they will, in close consultation with partner 7 (SAFAN) report back to the NewProt team all its experiences. LeadPharma will be responsible for the validation of the SSP in a pharmaceutical (drug design) context.


Prediction and validation need to iterate continuously. WP6 will therefore continuously communicate with WP2-5. It is envisaged that most exchange of NewProt staff between the partner's labs will be related to this iteration process.