The PDB staff itself is probably also to blaim for a series of errors. While looking for EC-codes we ran into the following problems (I won't give full references as these are most likely not errors made by the depositors):
HEADER TRANSFERASE 05-OCT-05 2B7O TITLE THE STRUCTURE OF 3-DEOXY-D-ARABINO-HEPTULOSONATE 7- TITLE 2 PHOSPHATE SYNTHASE FROM MYCOBACTERIUM TUBERCULOSIS COMPND MOL_ID: 1; COMPND 2 MOLECULE: 3-DEOXY-D-ARABINO-HEPTULOSONATE 7-PHOSPHATE COMPND 3 SYNTHASE AROG; COMPND 4 CHAIN: A, B; COMPND 5 SYNONYM: DAH7PS, DAHP SYNTHETASE, PHENYLALANINE- COMPND 6 REPRESSIBLE; COMPND 7 EC: EC 2.5.1.54; COMPND 8 ENGINEERED: YES |
That red coloured EC is extra. If the parser is coded to say "the first word after EC: is the EC-code", you get two EC codes; one being EC and one being 2.5.1.54;. As there are rather many PDB files that have chains with more EC codes, it seems likely that people will write disambiguation code, and try to disambiguate EC from 2.5.1.54;...
So we write something that says: If the first word after EC: doesn't start with a digit or number, skip that word and then we get:
HEADER TOXIN 27-JUN-03 1PVJ TITLE CRYSTAL STRUCTURE OF THE STREPTOCOCCAL PYROGENIC EXOTOXIN B TITLE 2 (SPEB)- INHIBITOR COMPLEX COMPND MOL_ID: 1; COMPND 2 MOLECULE: PYROGENIC EXOTOXIN B; COMPND 3 CHAIN: A, B, C, D; COMPND 4 FRAGMENT: STREPTOCOCCAL PYROGENIC EXOTOXIN B (SPEB); COMPND 5 EC: E.C.3.44.22.10 |
Which requires that we eat the line character by character till we hit a digit and assume that that is the beginning of the EC-code. There are still a few hundred permutations I can think of, but lets hope that the number of errors is so small that we can fix those manually.
Supplemental materialWhile lookin for EC-codes We found REMARK 900. This PDB-file remark contains pointers to similar PDB files. And we found a case where the EC-code for a similar protein was given. Hopeful, we wrote some code, only to find out that this information is provided in only very few cases. See the supplemental material below.
Supplemental materialBy the way, in the same block of PDB annotation records we found several more syntactic inconcistencies. For example,
HEADER TRANSFERASE/HYDROLASE 01-JUN-01 1JB1 TITLE LACTOBACILLUS CASEI HPRK/P BOUND TO PHOSPHATE COMPND MOL_ID: 1; COMPND 2 MOLECULE: HPRK PROTEIN; COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: C-TERMINUS (RESIDUES 128-319); COMPND 5 EC: 2.7.1.-/3.1.3.-; |
Has a slash between the EC codes. That can seriously screw computer programs...
Most EC-codes end with a semi-colon (;); also if one chain has two EC-codes. But:
HEADER HYDROLASE 03-APR-07 2YS0
TITLE SOLUTION STRUCTURE OF THE SOMATOMEDIN B DOMAIN OF HUMAN
TITLE 2 ECTONUCLEOTIDE PYROPHOSPHATASE/PHOSPHODIESTERASE FAMILY
TITLE 3 MEMBER
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: ECTONUCLEOTIDE PYROPHOSPHATASE/PHOSPHODIESTERASE
COMPND 3 FAMILY MEMBER 1;
COMPND 4 CHAIN: A;
COMPND 5 FRAGMENT: SOMATOMEDIN_B;
COMPND 6 SYNONYM: 'ECTONUCLEOTIDE PYROPHOSPHATASE/PHOSPHODIESTERASE
COMPND 7 1, E-NPP 1, PHOSPHODIESTERASE I/NUCLEOTIDE PYROPHOSPHATASE
COMPND 8 1, PLASMA-CELL MEMBRANE GLYCOPROTEIN PC-1;
COMPND 9 EC: 3.1.4.1, 3.6.1.9;
^
|
uses a comma (,).
In 2R2D:
JRNL AUTH D.LIU,P.W.THOMAS,J.MOMB,Q.HOANG,G.A.PETSKO,D.RINGE, JRNL AUTH 2 W.FAST JRNL TITL STRUCTURE AND SPECIFICITY OF A QUORUM-QUENCHING JRNL TITL 2 LACTONASE (AIIB) FROM AGROBACTERIUM TUMEFACIENS JRNL REF BIOCHEMISTRY 2007 |
we find as EC-code:
COMPND MOL_ID: 1; COMPND 2 MOLECULE: ZN-DEPENDENT HYDROLASES; COMPND 3 CHAIN: A, B, C, D, E, F; COMPND 4 SYNONYM: AGR_PTI_140P; COMPND 5 EC: 3.1.1.B1; COMPND 6 ENGINEERED: YES |
And that is also not a software-friendly syntax error.