Problems in PDB files

PDB Errors

EU name: ECCODE

(Date: Aug 24 2016 ECCODE )

EC-codes

The PDB staff itself is probably also to blaim for a series of errors. While looking for EC-codes we ran into the following problems (I won't give full references as these are most likely not errors made by the depositors):

HEADER    TRANSFERASE                             05-OCT-05   2B7O
TITLE     THE STRUCTURE OF 3-DEOXY-D-ARABINO-HEPTULOSONATE 7-
TITLE    2 PHOSPHATE SYNTHASE FROM MYCOBACTERIUM TUBERCULOSIS
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: 3-DEOXY-D-ARABINO-HEPTULOSONATE 7-PHOSPHATE
COMPND   3 SYNTHASE AROG;
COMPND   4 CHAIN: A, B;
COMPND   5 SYNONYM: DAH7PS, DAHP SYNTHETASE, PHENYLALANINE-
COMPND   6 REPRESSIBLE;
COMPND   7 EC: EC 2.5.1.54;
COMPND   8 ENGINEERED: YES

That red coloured EC is extra. If the parser is coded to say "the first word after EC: is the EC-code", you get two EC codes; one being EC and one being 2.5.1.54;. As there are rather many PDB files that have chains with more EC codes, it seems likely that people will write disambiguation code, and try to disambiguate EC from 2.5.1.54;...

So we write something that says: If the first word after EC: doesn't start with a digit or number, skip that word and then we get:

HEADER    TOXIN                                   27-JUN-03   1PVJ
TITLE     CRYSTAL STRUCTURE OF THE STREPTOCOCCAL PYROGENIC EXOTOXIN B
TITLE    2 (SPEB)- INHIBITOR COMPLEX
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: PYROGENIC EXOTOXIN B;
COMPND   3 CHAIN: A, B, C, D;
COMPND   4 FRAGMENT: STREPTOCOCCAL PYROGENIC EXOTOXIN B (SPEB);
COMPND   5 EC: E.C.3.44.22.10

Which requires that we eat the line character by character till we hit a digit and assume that that is the beginning of the EC-code. There are still a few hundred permutations I can think of, but lets hope that the number of errors is so small that we can fix those manually.

Supplemental material

While lookin for EC-codes We found REMARK 900. This PDB-file remark contains pointers to similar PDB files. And we found a case where the EC-code for a similar protein was given. Hopeful, we wrote some code, only to find out that this information is provided in only very few cases. See the supplemental material below.

Supplemental material

By the way, in the same block of PDB annotation records we found several more syntactic inconcistencies. For example,

HEADER    TRANSFERASE/HYDROLASE                   01-JUN-01   1JB1
TITLE     LACTOBACILLUS CASEI HPRK/P BOUND TO PHOSPHATE
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: HPRK PROTEIN;
COMPND   3 CHAIN: A;
COMPND   4 FRAGMENT: C-TERMINUS (RESIDUES 128-319);
COMPND   5 EC: 2.7.1.-/3.1.3.-;

Has a slash between the EC codes. That can seriously screw computer programs...

Most EC-codes end with a semi-colon (;); also if one chain has two EC-codes. But:

HEADER    HYDROLASE                               03-APR-07   2YS0
TITLE     SOLUTION STRUCTURE OF THE SOMATOMEDIN B DOMAIN OF HUMAN
TITLE    2 ECTONUCLEOTIDE PYROPHOSPHATASE/PHOSPHODIESTERASE FAMILY
TITLE    3 MEMBER
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: ECTONUCLEOTIDE PYROPHOSPHATASE/PHOSPHODIESTERASE
COMPND   3 FAMILY MEMBER 1;
COMPND   4 CHAIN: A;
COMPND   5 FRAGMENT: SOMATOMEDIN_B;
COMPND   6 SYNONYM: 'ECTONUCLEOTIDE PYROPHOSPHATASE/PHOSPHODIESTERASE
COMPND   7 1, E-NPP 1, PHOSPHODIESTERASE I/NUCLEOTIDE PYROPHOSPHATASE
COMPND   8 1, PLASMA-CELL MEMBRANE GLYCOPROTEIN PC-1;
COMPND   9 EC: 3.1.4.1, 3.6.1.9;
                      ^

uses a comma (,).

In 2R2D:

JRNL        AUTH   D.LIU,P.W.THOMAS,J.MOMB,Q.HOANG,G.A.PETSKO,D.RINGE,
JRNL        AUTH 2 W.FAST
JRNL        TITL   STRUCTURE AND SPECIFICITY OF A QUORUM-QUENCHING
JRNL        TITL 2 LACTONASE (AIIB) FROM AGROBACTERIUM TUMEFACIENS
JRNL        REF    BIOCHEMISTRY                               2007

we find as EC-code:

COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: ZN-DEPENDENT HYDROLASES;
COMPND   3 CHAIN: A, B, C, D, E, F;
COMPND   4 SYNONYM: AGR_PTI_140P;
COMPND   5 EC: 3.1.1.B1;
COMPND   6 ENGINEERED: YES

And that is also not a software-friendly syntax error.