Introduction
Peptide bonds can have two conformations. The torsion angle ω
(Cαi-1-Ci-1-Ni-Cαi)
can be around 0°, cis, or around 180°, trans.
The peptides in the protein structures in the Protein
Data Bank (PDB)
almost exclusively
contain the trans conformation.
However, some of those conformations are incorrect and should actually
be cis. Other trans peptide planes should stay trans but need to be
rotated ~180° about the Cα-Cα axis. This website
details the methodology used to the predict cis↔trans flips and
peptide plane flips in the backbone of protein structures, and contains
information supplementary to the manuscript 'Detection of
trans–cis flips and peptide plane flips in protein
structures' by Wouter G. Touw, Robbie P. Joosten and Gert Vriend.
The method predicts ~70K peptide plane flips and ~5K trans →
cis flips.
Predict flips
There are several options to predict flips:
WHAT_CHECK
The FLPCHK validation routine of WHAT IF's CHECK menu.
Web server
The flip checks are on the WHAT IF web servers
cis ↔ trans and
peptide plane flips
We have adopted a systematic naming system for flips and non-flips.
The flip type is indicated by three characters. The first character
indicates the starting omega conformation (either t for trans or
c for cis). The second character indicates the correct omega
conformation (again t or c). The third character indicates whether the
carbonyl 'flips' (+) or not (-). For cis-trans flips the third
character implies the reverse for the N-H, i.e. tc- involves a NH-flip.
Both an NH-flip and CO-flip occur when peptide planes are flipped (tt+).
Theoretically there are 6 possible flip types and tt- and cc- designate
correct trans and cis peptides:
Flip type |
Explanation |
tt- |
The peptide conformation is correct and should stay trans |
tt+ |
A peptide plane flip (~180° crankshaft motion about
Cα-Cα axis); both CO-flip and NH-flip. |
tc- |
trans to cis with NH-flip |
tc+ |
trans to cis with CO-flip |
cc- |
The peptide conformation is correct and should stay cis |
cc+ |
The entire Cα-C-N-Cα unit theoretically would rotate ~180° |
ct- |
cis to trans with NH-flip |
ct+ |
cis to trans with CO-flip |
We found all flip types except cc+ in the PDB. The different flip
classes are best illustrated by
examples found
in structures deposited to the PDB. Click on the pictures to get
detailed information (
help
).
The residues mentioned are the ones directly
after the peptide bond that needs to be flipped.
tt+
tc-
tc+
ct-
ct+
Prediction
Training data
Pairs of X-ray structures solved at 3.5 Å or better, containing
at least 25 amino acids in at least one chain, for which a DSSP file
(Kabsch & Sander, 1981) exists, and that contained at least one trans -
cis or peptide plane flip between PDB and PDB_REDO were obtained from
the releases of 20-10-2014. From these PDB files stretches of four
canonical residues were selected that had all atoms present with
non-zero B-factor and full occupancy; no covalently bound atoms were
allowed other than the continuation of the chain; all torsion angles
and the DSSP secondary structure must be determinable; the four amino
acids were neither N- or C-terminal, nor adjacent to a chain break. A
training set was obtained by comparing peptide conformations in the
pairs of PDB and PDB_REDO structures. The procedure calculates three
values 1) ΔC=O, which is the angle between the PDB carbonyl and
the PDB_REDO carbonyl after optimal superposition; 2) ΔN-H, which
is the angle between the N-H pair; 3) Δω which is the
ω torsion angle difference . If Δω is big, a
cis-trans NH-flip (tc- or ct-) is assigned when ΔN-H is big and
ΔC=O is small, and a CO-flip (tc+ or tc-) is assigned when
ΔC=O is big and ΔN-H is small. If Δω is small
and both ΔN-H and ΔC=O are big a peptide plane flip is
assigned. It was found that the best assignments were obtained when
‘big’ was defined as being greater than 120° and
‘small’ was defined as being less than 60°. Irregular
cases were excluded from the training examples. For irregular cases a)
ΔN-H or ΔC=O or Δω is big but other criteria
are not met; b) either one of the Cα atoms flanking the peptide
plane has been superposed with more than 1 Å displacement. Click
on the bar to show/hide the pseudo-code for determining the flip
types.
- Pseudocode
# The angle between the carbonyls
oang = calcAngle(CO_PDB, CO_REDO)
# The angle between the amides
hang = calcAngle(NH_PDB, NH_REDO)
# Omega difference
opdb = calcTorsion(CACNCA_PDB)
oredo = calcTorsion(CACNCA_REDO)
odif = abs(opdb - oredo)
# C-alpha displacement
cadif_i-1 = distance(CA_i-1_PDB, CA_i-1_REDO)
cadif_i = distance(CA_i_PDB, CA_i_REDO)
cadif = max(cadif_i-1, cadif_i)
if oang > 120 or hang > 120 or odif > 120:
if cadif > 1.0:
appendFlipType('displaced_')
endif
if odif > 120:
# cis-trans flip
if abs(oredo) > abs(opdb):
appendFlipType('ct')
else:
appendFlipType('tc')
endif
# CO-flip, NH-flip, or hard to determine automatically?
if oang > 120 and hang < 60:
appendFlipType('+')
else if hang > 120 and oang < 60:
appendFlipType('-')
else
appendFlipType('_irregular')
endif
else if odif < 60:
# peptide plane flip
if oang > 120 and hang > 120:
appendFlipType('tt+')
else
appendFlipType('tt+_irregular')
endif
else
appendFlipType('omega_irregular')
endif
endif
Obviously, the different flip classes and the correct classes (tt- and
cc-) are not equally distributed since the overwhelming majority of the
PDB peptides have the correct conformation. It is well known that it is
notoriously hard for machine learning algorithms to deal with highly
skewed data, for example because always predicting the majority class
still gives almost perfect prediction accuracy. This is also known as
the imbalance
problem. Popular strategies include under-sampling of the
majority class, over-sampling of the minority class, a combination of
both. We found that randomly downsampling the majority class to the
size of the minority class to obtain a balanced training set worked
well for training Random Forests (see below).
When repeated with different random seeds, very similar results were
obtained. Training with unbalanced data using estimated class priors did
not work well in our hands.
Download training data
Test data
The test cases were manually validated and re-refined when
necessary.
Note that some of the validated peptides do not conform to all of
the training set criteria. For example, some residues in the tetra
peptides may be incomplete, are bound to something, etc. When
determining the performance in the validation process, these cases have
not been included.
Download test data
An incorrectly modeled peptide usually causes problems for the local
backbone. The peptide has to be accomodated, also if little space is
available, causing a fight between the X-ray data and refinement
restraints. This causes strain that shows up in several features
describing the local backbone conformation. If the peptide conformation
is corrected by a cis↔trans flip or a peptide plane flip and the
structure is re-refined, the strain will be relieved and the backbone
parameters will adopt their normal values. The figures below show the
change in continuous features (described in the Features section) upon flipping and
re-refinement by
PDB_REDO
for x-X-Xnpg-x (X: any residue; Xnpg: any residue except Pro and Gly)
tt-/tt+/tc-/tc+, x-X-Pro-x tt-/tc-/tc+, and x-X-Gly-x tt-/tt+ peptides
in the test data. The lines are Gaussian kernel
density estimates for the number of cases indicated in the legend
for each flip class. The feature distributions in PDB structures are
indicated with solid lines. The PDB_REDO distributions are indicated
with dotted lines. For example, in the ψi (psi) plot for
X-Xnpg the PDB_REDO distributions have been shifted for the 56 tt+, 49
tc- and 10 tc+ cases with respect to the PDB ψi
distribution. The 'difference' plots show the feature distributions for
PDB_REDO - PDB (i.e. 0 means no difference before and after
correction). Click on the bar to show/hide the figures.
- Test data before
and after correction
Features
The features that can capture the difference between an incorrect
and correct peptide bond conformation belong to several feature
groups and have been calculated for four amino acids surrounding the
peptide bond. The feature groups are angles, torsion angles,
distances, chiral volumes, B-factors, secondary structure, and a few
other groups explained in the table. Rather than B-factors
from PDB files, we used B-factors from BDB files. The BDB is a databank that
contains PDB files with consistent B-factors
(Touw & Vriend, 2014).
Most features were calculated by WHAT IF.
Click on the bar to show/hide the comprehensive list of all features
and their explanation. Note that the definition of ω for
residue i in WHAT IF is equal to the standard definition of ω
for residue i+1.
- Feature
explanation
Features
WH indicates whether the feature is part of
the Weiss & Hilgenfeld (1999)
algorithm. The angle subcript indicates which residue
contributes most atoms.
Feature |
Code |
Type |
Explanation |
WH |
∠φi-2 |
phi_m2 |
backbone torsion |
|
|
∠ψi-2 |
psi_m2 |
backbone torsion |
|
|
∠ωi-2 |
omega_m2 |
backbone torsion |
|
|
∠φi-1 |
phi_m1 |
backbone torsion |
|
y |
∠ψi-1 |
psi_m1 |
backbone torsion |
|
y |
∠ωi-1 |
omega_m1 |
backbone torsion |
|
y |
∠φi |
phi |
backbone torsion |
|
y |
∠ψi |
psi |
backbone torsion |
|
y |
∠ωi |
omega |
backbone torsion |
|
|
∠φi+1 |
phi_p1 |
backbone torsion |
|
|
∠ψi+1 |
psi_p1 |
backbone torsion |
|
|
∠ωi+1 |
omega_p1 |
backbone torsion |
|
|
∠N-Cα-Ci-2 |
ncac_m2 |
bond angle |
|
|
∠Cα-C-Ni-2 |
cacn_m2 |
bond angle |
|
|
∠Cα-C-Oi-2 |
caco_m2 |
bond angle |
|
|
∠O-C-Ni-2 |
ocn_m2 |
bond angle |
|
|
∠N-Cα-Cβi-2 |
ncacb_m2 |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-Cα-Cβi-2 |
ccacb_m2 |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-N-Cα i-2 |
cnca_m3 |
bond angle |
|
|
∠N-Cα-Ci-1 |
ncac_m1 |
bond angle |
|
y |
∠Cα-C-Ni-1 |
cacn_m1 |
bond angle |
|
y |
∠Cα-C-Oi-1 |
caco_m1 |
bond angle |
|
y |
∠O-C-Ni-1 |
ocn_m1 |
bond angle |
|
y |
∠N-Cα-Cβi-1 |
ncacb_m1 |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-Cα-Cβi-1 |
ccacb_m1 |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-N-Cαi-1 |
cnca_m2 |
bond angle |
|
|
∠N-Cα-Ci |
ncac |
bond angle |
|
y |
∠Cα-C-Ni |
cacn |
bond angle |
|
|
∠Cα-C-Oi |
caco |
bond angle |
|
|
∠O-C-Ni |
ocn |
bond angle |
|
|
∠N-Cα-Cβi |
ncacb |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-Cα-Cβi |
ccacb |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-N-Cαi |
cnca_m1 |
bond angle |
|
y |
∠N-Cα-Ci+1 |
ncac_p1 |
bond angle |
|
|
∠Cα-C-Ni+1 |
cacn_p1 |
bond angle |
|
|
∠Cα-C-Oi+1 |
caco_p1 |
bond angle |
|
|
∠O-C-Ni+1 |
ocn_p1 |
bond angle |
|
|
∠N-Cα-Cβi+1 |
ncacb_p1 |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-Cα-Cβi+1 |
ccacb_p1 |
bond angle |
a pseudo-Cβ is calculated for Gly |
|
∠C-N-Cαi+1 |
cnca |
bond angle |
|
|
N-Cαi-1 |
nca_m1 |
bond length |
|
y |
Cα-Ci-1 |
cac_m1 |
bond length |
|
y |
C-Oi-1 |
co_m1 |
bond length |
|
y |
C-Ni |
cn |
bond length |
|
y |
N-Cαi |
nca |
bond length |
|
y |
Cα-Ci |
cac |
bond length |
|
y |
Features
Feature |
Code |
Type |
Explanation |
WH |
Cαi-1-Cαi-1 |
cam2_cam1 |
backbone distance |
|
|
Cαi-1-Cαi |
cam1_ca |
backbone distance |
|
y |
Cαi-Cαi+1 |
ca_cap1 |
backbone distance |
|
|
Cβi-2-Cβi-1 |
cbm2_cbm1 |
distance |
a pseudo-Cβ is calculated for Gly |
|
Cβi-1-Cβi |
cbm1_cb |
distance |
a pseudo-Cβ is calculated for Gly |
|
Cβi-Cβi+1 |
cb_cbp1 |
distance |
a pseudo-Cβ is calculated for Gly |
|
Oi-2-Oi-1 |
om2_om1 |
distance |
|
|
Oi-1-Oi |
om1_o |
distance |
|
|
Oi-Oi+1 |
o_op1 |
distance |
|
|
Cαi-2 cv |
imp_m2 |
improper dihedral |
chiral volume |
|
Cαi-1 cv |
imp_m1 |
improper dihedral |
chiral volume |
|
Cαi cv |
imp |
improper dihedral |
chiral volume |
|
Cαi+1 cv |
imp_p1 |
improper dihedral |
chiral volume |
|
∠COi-2-COi-1 |
ooang1 |
angle |
|
|
∠COi-1-COi |
ooang2 |
angle |
|
|
∠COi-COi+1 |
ooang3 |
angle |
|
|
∠COi-2-COi+1 |
foang1 |
angle |
|
|
B Ni-2 |
bn_m2 |
B-factor |
|
|
B Cαi-2 |
bca_m2 |
B-factor |
|
|
B Ci-2 |
bc_m2 |
B-factor |
|
|
B Oi-2 |
bo_m2 |
B-factor |
|
|
B Ni-1 |
bn_m1 |
B-factor |
|
|
B Cαi-1 |
bca_m1 |
B-factor |
|
|
B Ci-1 |
bc_m1 |
B-factor |
|
|
B Oi-1 |
bo_m1 |
B-factor |
|
|
B Ni |
bn |
B-factor |
|
|
B Cαi |
bca |
B-factor |
|
|
B Ci |
bc |
B-factor |
|
|
B Oi |
bo |
B-factor |
|
|
B Ni+1 |
bn_p1 |
B-factor |
|
|
B Cαi+1 |
bca_p1 |
B-factor |
|
|
B Ci+1 |
bc_p1 |
B-factor |
|
|
B Oi+1 |
bo_p1 |
B-factor |
|
|
δBi-1 |
bdif1 |
B-factor gradient |
backbone to side-chain (0.0 for Gly) |
|
δBi |
bdif2 |
B-factor gradient |
backbone to side-chain (0.0 for Gly) |
|
DSSPi-2 |
dssp_m2 |
secondary structure |
|
|
DSSPi-1 |
dssp_m1 |
secondary structure |
|
|
DSSPi |
dssp |
secondary structure |
|
|
DSSPi+1 |
dssp_p1 |
secondary structure |
|
|
Dφ+ |
phip |
WH |
0.0 if φi<0° else sin(φi) |
y |
Dtotal |
dtot |
WH |
final WH score |
y |
Oi-1 bump score |
om1_bump |
bump |
WHAT IF bump score |
|
Oi-1 aligned |
helali |
other |
whether the carbonyl of the peptide plane is close to a
helix and aligned with the hydrogen bonding helix
carbonyls.
Why?
|
|
Training
4 different training data sets were constructed: X-Xnpg tc-, X-Xnpg
tt+, X-Pro tc+ and X-Gly tt+. For all these training sets the negative
class is tt-. For all 4 training sets a Random forest (RF; Breiman,
2001) classifier was constructed. The flip type-specific
classifiers were later combined into one classifier per residue class.
This strategy optimally made use of the available training examples
(e.g. many more tt- and tt+ cases could be used for X-Xnpg flips than
could have been used with a multi-class training with a balanced data
set). A description of RF is available here.
In short, an RF is an ensemble of decision trees. Each classification
tree is constructed using a random subset of training examples and a
random subset of features. The individual trees are so-called weak
learners; the correlation between them is relativey small. Every tree
predicts the class of a training example. The ensemble individual
trees, the RF, classifies new cases by collecting votes from each tree.
A threshold determines the fraction of votes that is needed to predict
the class. The combination of enough sufficiently uncorrelated weak
learners increases the classification strength of the ensemble and
usually leads to a robust classifier. The RF models were trained using
the randomForest
and caret R packages.
The RF parameter
mtry
was tuned using
5-fold cross-validation
that was repeated 20 times with different
random seeds.
The classification performance on the training examples was
measured by and/or optimized against the cutoff-independent
area under the
ROC
curve
, against the area under the
precision-recall
curve, and/or against the
Matthews correlation coefficient
and euclidian distance to perfect
specificity
and
sensitivity
at the optimal threshold determined by tuning across 40 different
threshold values between 0.5 and 1 (code examples
here
and
here).
The final classifiers were constructed using all training data with
parameters from the best-performing cross-validation models. The
resampling performance for the final classifiers is shown in the table
below (click on the bar to show/hide the table). The AUC under the ROC
curve has been calculated by the ROCR package. The rocplus package
was used to obtain the AUC under the precision-recall curve.
- Training
performance
Training performance
|
minimum |
1st quartile |
median |
mean |
3rd quartile |
maximum |
ROC AUC |
X-Xnpg tc- |
0.906 | 0.950 | 0.964 | 0.962 | 0.975 | 0.997 |
X-Xnpg tt+ |
0.976 | 0.982 | 0.984 | 0.984 | 0.985 | 0.989 |
X-Pro tc+ |
0.858 | 0.933 | 0.954 | 0.953 | 0.975 | 1 |
X-Gly tt+ |
0.943 | 0.959 | 0.965 | 0.964 | 0.971 | 0.983 |
Precision-recall AUC |
X-Xnpg tc- |
0.906 | 0.950 | 0.963 | 0.962 | 0.976 | 0.997 |
X-Xnpg tt+ |
0.976 | 0.982 | 0.984 | 0.984 | 0.985 | 0.989 |
X-Pro tc+ |
0.858 | 0.932 | 0.954 | 0.953 | 0.975 | 1 |
|
X-Gly tt+ |
0.943 | 0.959 | 0.965 | 0.964 | 0.971 | 0.983 |
Matthews correlation coefficient |
X-Xnpg tc- |
0.626 | 0.753 | 0.789 | 0.794 | 0.839 | 0.921 |
X-Xnpg tt+ |
0.843 | 0.869 | 0.878 | 0.877 | 0.886 | 0.899 |
X-Pro tc+ |
0.52 | 0.835 | 0.869 | 0.864 | 0.897 | 1 |
X-Gly tt+ |
0.74 | 0.788 | 0.811 | 0.810 | 0.831 | 0.873 |
Topleft distance
|
X-Xnpg tc- |
0.0608 | 0.139 | 0.169 | 0.174 | 0.218 | 0.390 |
X-Xnpg tt+ |
0.0758 | 0.0887 | 0.0950 | 0.0953 | 0.102 | 0.120 |
X-Pro tc+ |
0 | 0.0786 | 0.115 | 0.126 | 0.167 | 0.404 |
X-Gly tt+ |
0.0981 | 0.127 | 0.142 | 0.141 | 0.155 | 0.192 |
Accuracy |
X-Xnpg tc- |
0.792 | 0.875 | 0.890 | 0.893 | 0.917 | 0.959 |
X-Xnpg tt+ |
0.921 | 0.934 | 0.938 | 0.938 | 0.942 | 0.949 |
X-Pro tc+ |
0.75 | 0.917 | 0.932 | 0.929 | 0.946 | 1 |
X-Gly tt+ |
0.869 | 0.893 | 0.905 | 0.904 | 0.914 | 0.936 |
False positive rate (fallout) |
X-Xnpg tc- |
0.0000 | 0.1110 | 0.167 | 0.155 | 0.194 | 0.389 |
X-Xnpg tt+ |
0.0230 | 0.0326 | 0.0363 | 0.0365 | 0.0400 | 0.0496 |
X-Pro tc+ |
0 | 0.0556 | 0.111 | 0.111 | 0.167 | 0.278 |
X-Gly tt+ |
0.0333 | 0.0571 | 0.0667 | 0.0698 | 0.0810 | 0.129 |
False discovery rate |
X-Xnpg tc- |
0.0000 | 0.1070 | 0.145 | 0.138 | 0.182 | 0.286 |
X-Xnpg tt+ |
0.0245 | 0.0344 | 0.0382 | 0.0385 | 0.0423 | 0.0518 |
X-Pro tc+ |
0 | 0.0556 | 0.100 | 0.0996 | 0.143 | 0.217 |
X-Gly tt+ |
0.0355 | 0.0611 | 0.0730 | 0.0733 | 0.0843 | 0.127 |
True positive rate (sensitivity/recall) |
X-Xnpg tc- |
0.806 | 0.917 | 0.946 | 0.942 | 0.972 | 1 |
X-Xnpg tt+ |
0.890 | 0.906 | 0.912 | 0.912 | 0.919 | 0.933 |
X-Pro tc+ |
0.611 | 0.944 | 1.000 | 0.969 | 1 | 1 |
X-Gly tt+ |
0.824 | 0.862 | 0.876 | 0.879 | 0.895 | 0.933 |
False negative rate (miss rate) |
X-Xnpg tc- |
0 | 0.0278 | 0.0541 | 0.0577 | 0.0833 | 0.194 |
X-Xnpg tt+ |
0.0665 | 0.0810 | 0.0883 | 0.0878 | 0.0943 | 0.110 |
X-Pro tc+ |
0 | 0 | 0.0000 | 0.0308 | 0.0556 | 0.389 |
X-Gly tt+ |
0.0667 | 0.105 | 0.124 | 0.121 | 0.138 | 0.176 |
True negative rate (specificity) |
X-Xnpg tc- |
0.611 | 0.806 | 0.833 | 0.845 | 0.889 | 1.000 |
X-Xnpg tt+ |
0.950 | 0.96 | 0.964 | 0.963 | 0.967 | 0.977 |
X-Pro tc+ |
0.722 | 0.833 | 0.889 | 0.889 | 0.944 | 1 |
X-Gly tt+ |
0.871 | 0.919 | 0.933 | 0.930 | 0.943 | 0.967 |
Positive predictive value (precision) |
X-Xnpg tc- |
0.714 | 0.818 | 0.855 | 0.862 | 0.893 | 1.000 |
X-Xnpg tt+ |
0.948 | 0.958 | 0.962 | 0.962 | 0.966 | 0.976 |
X-Pro tc+ |
0.783 | 0.857 | 0.900 | 0.900 | 0.944 | 1 |
X-Gly tt+ |
0.873 | 0.916 | 0.927 | 0.927 | 0.939 | 0.964 |
Negative predictive value |
X-Xnpg tc- |
0.811 | 0.914 | 0.942 | 0.939 | 0.968 | 1 |
X-Xnpg tt+ |
0.896 | 0.911 | 0.916 | 0.917 | 0.922 | 0.935 |
X-Pro tc+ |
0.696 | 0.944 | 1.000 | 0.970 | 1 | 1 |
X-Gly tt+ |
0.842 | 0.874 | 0.884 | 0.885 | 0.899 | 0.932 |
F-score |
X-Xnpg tc- |
0.817 | 0.880 | 0.897 | 0.899 | 0.921 | 0.961 |
X-Xnpg tt+ |
0.918 | 0.932 | 0.937 | 0.936 | 0.941 | 0.948 |
X-Pro tc+ |
0.71 | 0.919 | 0.934 | 0.932 | 0.948 | 1 |
X-Gly tt+ |
0.864 | 0.891 | 0.902 | 0.902 | 0.912 | 0.934 |
Features important for correct classification of training
examples are shown below. These figures show either the overall mean
and sd
permutation importance
or scaled class-specific importance.
The classifiers have been converted (automatically) to FORTRAN
IF/ELSE statements for inclusion in WHAT_CHECK:
X-Pro prediction
All X-Pro tc- cases in the
test set derived from the PDB-PDB_REDO comparision had a positive
φ
i angle and could be separated perfectly from tt- and
tc+ cases, for which the average φ is always around -60°. The
rule φ > 0° misclassifies
two tc+ instances . When
this rule was applied to the entire PDB we also found examples of X-Pro
with positive φ angles other than tc- cases. Incorrect chirality
of the nitrogen atoms resulted for example from strain in the local
backbone because residues i+1 or i-1 needed to be flipped. The class of
trans X-Pro residues with N-chirality problems was called nch. The
average φ is 96° for tc- and 12° for nch. The nch cases
could be separated from tc- cases by a simple rule: if the angle
N-Cα-C is large (> 112.47°) and the bump score of the oxygen
in the peptide plane is large (> 0.26 WHAT IF bump score units), then
the X-Pro with a positive φ is not a cis peptide in need of a tc-
flip but a trans-Pro with N-chirality problems. This rule was found by
visual inspection of the data using
RFScout,
a tool that allows the creation of Simple Decision Models. The
WHAT_CHECK code for this classifier has been hand-written and not
generated automatically.
Cis → trans
We detected only 44 cis → trans flips in the entire PDB. This
means not enough data was available to create accurate classifiers.
Our method therefore does not include cis → trans flip
prediction. Nevertheless we observed that the
Cα
i-1-Cα
i distance and the
C
i-1-N
i-Cα
i angle tend to be
larger for ct+ and ct- cases than for normal cc- tetramers. This can be
expected because the data works against the cis restraints. For X-Pro
we also observed that the angle τ
(N
i-C&alpha
i-C
i) may help to separate
ct+, ct-, and cc- cases:
Validation
The five remaining classifiers were tested against an independent
test set. The performance of both individual and combined prediction
models are shown in the tables below. These tables also show the
performance on the subset of test cases that does not include
NCS-related flip examples (only the first example in a structure is
retained). We deliberately included a few NCS-related examples to
observe how sensitive the RF are to small inter-chain variation.
Test performance for all flip-types.
The performance of
dual-class (vs. tt-) RF classifiers is shown for the all test
cases and for a subset of non-redundant cases without
NCS-related flips (separated by a forward slash). The
threshold-dependent metrics are at the highest MCC. Note that
the metrics are sensitive to class imbalance, except for the
AUC and MCC values.
WH: Weiss & Hilgenfeld (1999)
method with original threshold for Dtot (143.10).
WH’: WH with cut-off re-determined in this study (82.256)
|
X-Xnpg tt+ |
X-Xnpg tc- |
X-Xnpg WH’ |
X-Xnpg WH |
X-Pro tc+ |
X-Gly tt+ |
ROC AUC |
0.995/0.994 | 0.983/0.976 | 0.966/0.979 | 0.966 | 0.941/0.938 | 0.982/0.977 |
Precision-recall AUC |
0.995/0.994 | 0.983/0.976 | 0.966/0.979 | 0.966 | 0.941/0.938 | 0.982/0.977 |
Matthews correlation coefficient |
0.911/0.897 | 0.892/0.829 | 0.822/0.779 | 0.315 | 0.852/0.844 | 0.928/0.907 |
Accuracy |
0.979/0.978 | 0.963/0.968 | 0.935/0.954 | 0.803 | 0.924/0.926 | 0.971/0.965 |
True positive rate (sensitivity/recall) |
0.952/0.938 | 0.884/0.837 | 0.909/0.884 | 0.124 | 0.824/0.795 | 0.964/0.952 |
False positive rate (fallout) |
0.012/0.016 | 0.014/0.016 | 0.057/0.038 | 0.0 | 0.0/0.0 | 0.026/0.031 |
True negative rate (specificity) |
0.983/0.984 | 0.986/0.984 | 0.943/0.962 | 1.0 | 1.0/1.0 | 0.974/0.969 |
False negative rate (miss rate) |
0.048/0.062 | 0.116/0.163 | 0.091/0.116 | 0.876 | 0.176/0.205 | 0.036/0.048 |
Positive predictive value (precision) |
0.894/0.882 | 0.947/0.857 | 0.821/0.731 | 1.0 | 1.0/1.0 | 0.931/0.909 |
Negative predictive value |
0.993/0.992 | 0.967/0.981 | 0.973/0.986 | 0.798 | 0.881.0.895 | 0.987/0.984 |
X-Xnpg test set confusion table.
Combination of tt+ and tc-
classifiers. Inclusion of a tc+ classifier would result in
overfitting.
The classification accuracy is 93.2% including tc+ cases
and 95.0 % without tc+ cases. Note that all tc+ cases found in
the PDB have been included in the test set.
|
Actual class |
Predicted class |
tt- |
tt+ |
tc- |
tc+ |
tt- |
405 |
3 |
12 |
6 |
tt+ |
7 |
59 |
2 |
5 |
tc- |
6 |
0 |
107 |
1 |
tc+ |
0 |
0 |
0 |
0 |
X-Pro test set confusion table.
Combination of tc-/nch and tc+
classifiers. The classification accuracy is 93.3%. nch means
an incorrect N chirality of trans-Pro.
Note that all 40 tc- cases that could be corrected by PDB_REDO
have been included. The classification accuracy is 93.1% when nch cases are
excluded.
|
Actual class |
Predicted class |
tt- |
tt+ |
tc- |
tc+ |
nch |
tt- |
89 |
0 |
0 |
12 |
0 |
tt+ |
0 |
0 |
0 |
0 |
0 |
tc- |
0 |
0 |
58 |
0 |
1 |
tc+ |
0 |
0 |
0 |
54 |
0 |
nch |
0 |
1 |
0 |
2 |
22 |
Table VII in the Weiss & Hilgenfeld
paper
lists the 20 peptides with the highest D
tot score in their
25% database. Our method agrees with half of those tc- preditions (5
cases could not be predicted because they did not pass our data
selection and quality criteria). Even though some of the cases we
predicted to be tt- were just below the tc- threshold, a detailed
comparison of the predictions made by the two methods probably makes no
sense because for none of the listed entries structure factors are
available. The only entry for which structure factors became available
at a later stage was 1aak, which was superseded by 2aak. The WH method
predicted tc- for the Arg 6 - Lys 7 peptide bond in 1aak and our method
predicted tt-. In 1aak the backbone around these residues is very
distorted and φ
i is positive, which is probably the
reason for the high D
tot score. In 2aak the peptide bond has
a 'normal' trans and the electron density suggests that trans is the
correct conformation. Furthermore, the residues are located inside an
α-helix and have the expected hydrogen-bonding pattern in the
trans conformation.
Usage
Flips can be predicted with the FLPCHK validation routines in
WHAT_CHECK,
the ShowPepFlips WHAT IF
web services
option, or the WHAT IF
web server.
The response of our RF classifiers was translated to qualitative
measures of severity based on the validation set. All available methods
therefore also report if it is ‘highly unlikely’, ‘unlikely’, ‘somewhat
likely’, ‘likely’, or ‘highly likely’ that a peptide needs to be
flipped or requires the attention of a structural biologist. The
'highly likely' category has no FP in the test set, the 'likely'
category has between 0 and 2% FP in the test set, the 'somewhat likely'
category has up to 10% FP, the 'unlikely' category has beteen 0 and 2%
FN in the test set, and the 'highly unlikely' category has 0 FN in the
test set.
Simulation
In an effort to classify the small classes (and possibly address the
imbalance problem further), X-Xnpg tc+, ct-, and ct+ errors were
simulated. The errors were simulated for well-defined peptides selected
from medium-sized single chain structures solved at a resolution better
than 2.0 Å.
WHAT
IF (Vriend, 1990) flipped the peptides and relaxed the strain in
a 20-residue window spanning the local backbone of the peptides.
Subsequently, the flipped peptides were refined. After refinement only
a small fraction of the peptide conformations still had a wrong
conformation. Classifiers were constructed with 114 fabricated tc+
cases, but a two-class classifier could only correctly classify a third
of the true tc+ cases, and none of the true tc+ cases could be
distinguished by the four-class classifier. Furthermore, the
classifiers did not pick up any previously unrecognized tc+ in the PDB.
Finally, the simulated tc+ cases had a broad distribution and showed
overlap with all flip classes when they were projected in the Principal
Component Analysis space of validated tt-, tt+, tc-, and tc+ cases.
Although the number of true tc+ cases very low, the simulated tc+ cases
seemed to be more similar to tc- and tt- cases than to tc+ cases in
most dimensions. In summary, the simulated flips were not
representative of actual flips and could therefore not be used to train
classifiers.
Re-building and re-refinement:
changes in reciprocal- and real-space coefficients
Single peptide and cis-trans flips will probably have a small
effect on the R-factors. The local changes in the protein backbone are
expected to lead to an increase in local real-space correlation
coefficient. We here present the flip correction and re-refinement of
1hi8,
1i6n,
2z81
and
1pe9 as examples.
1hi8
The RNA dependent RNA polymerase from dsRNA
bacteriophage φ6 PDB structure
1hi8
has been solved at 2.50 Å. The reported R-/R-free factors are
0.280/0.316.
Two cis-peptides are reported in the PDB file:
CISPEP 1 ILE A 96 PRO A 97 0 0.03
CISPEP 2 ILE B 96 PRO B 97 0 0.04
both Ile-Pro peptide planes fit the electron density well:
However, 8 tt+ and Pro tc+ flips are necessary in both chain A and B
(a total of 16 flips):
GLY 92 - A tt+
GLY 92 - B tt+
ASP 137 - A tt+
ASP 137 - B tt+
PRO 154 - A tc+
PRO 154 - B tc+
ALA 210 - A tt+
ALA 210 - B tt+
LEU 391 - A tt+
LEU 391 - B tt+
MSE 406 - A tt+
MSE 406 - B tt+
ARG 584 - A tt+
ARG 584 - B tt+
LYS 627 - A tt+
LYS 627 - B tt+
These screenshots show for either the A or the B chain their PDB conformation.
The WHAT_CHECK
validation report
flags Asp 137 and Pro 154 for having unusual C-N-Cα bond angles
and Leu 391 for having unusual torsion angles. Furthermore, Gly 92,
Als 210 and Leu 391 have unusual φ/ψ combinations. The buried
hydrogen bond donor Gly 92 N (see figure above) is picked up, as well.
10 cycles TLS refinement and 50 cycles restrained refinement in REFMAC
with these parameters results
in the following re-refinement statistics:
Re-refinement statistics
CC: reciprocal-space correlation coefficient
Structure |
R-work reported |
R-free reported |
R-work initial |
R-free initial |
R-work final |
R-free final |
Work CC initial |
Free CC initial |
Work CC final |
Free CC final |
Work CC Z-score |
Free CC Z-score |
1hi8 |
0.280 |
0.316 |
0.2760 |
0.3082 |
0.2579 |
0.2868 |
0.8893 |
0.8636 |
0.9025 |
0.8836 |
11.84 |
3.46 |
1hi8 rebuilt |
N/A |
N/A |
0.2749 |
0.3044 |
0.2556 |
0.2825 |
0.8903 |
0.8660 |
0.9043 |
0.8864 |
12.72 |
0.07 |
Even though REFMAC automatically performs the Ala 210 tt+ flip in 1hi8
in both chain A and B also without rebuilding, all global reciprocal
space metrics improve when the incorrect peptides are corrected. The
following image shows the improvement in real-space correlation
coefficient (RSCC) of rebuilding/refinement over /refinement only, in
the 5 residues before and 5 residues after the flipped peptide bonds in
chain A. The RSCC is calculated by the Perl script edstats.pl, a
wrapper around the CCP4 program edstats
(
Tickle, 2012).
The RSCC values are weighted by the number of grid points that are
covered by the groups of main-chain (solid lines) and main-chain +
side-chain (dashed lines) atoms. The grey dashed line indicates zero
change in RSCC between final re-built and re-refined 1hi8 and 1hi8
deposited in the PDB.
The corrections lead to positive RSCC difference around most flipped
peptide bonds, which indicates that the local fit is improved.
The improvement is dominated by the backbone atom increase in RSCC. As
expected, the fit around Ala 210 is not better in the
rebuilt/re-refined structure than in the re-refined structure because
the conformation is correct in both structures.
1i6n
The crystal structure of Bacillus subtilis ioli
1i6n
has been solved at 1.80 Å. The reported R-/R-free
factors are 0.201/0.238. CISPEP records are absent in the PDB file.
The following trans → cis NH-flip will correct many local problems:
1i6n ALA 67 - A tc-
The local backbone around Ala 67 is also wrong in 1i60 (see above), the
model that was used to solve 1i6n by molecular replacement. The
difference density shows many peaks around Asn 66, Ala 67 and Leu68
from the PDB model (first picture).
The WHAT_CHECK validation
report flags the
unusual backbone conformation and bond angles. Re-refinement in REFMAC
of the original PDB structure with
these parameters improves the
backbone significantly (picture 2 and 3). However, ω is 72°
after re-refinement, and some difference density is still present
(click on the images above for a larger version), probably because
REFMAC does not treat the peptide as a cis-peptide yet. Re-refinement
of the flipped structure removes all difference density around the
flipped peptide bond (picture 4). Before re-refinement, the better fit
of the rebuilt structure is visible in the global reciprocal-space
metrics. After re-refinement, the global metrics are virtually
identical, while the local real-space fit is only optimal after
re-refinement of the rebuilt structure (compare pictures 2 and 4).
Clearly, a single flip doesn't make summer.
Re-refinement statistics
CC: reciprocal-space correlation coefficient
Structure |
R-work reported |
R-free reported |
R-work initial |
R-free initial |
R-work final |
R-free final |
Work CC initial |
Free CC initial |
Work CC final |
Free CC final |
Work CC Z-score |
Free CC Z-score |
1i6n |
0.201 |
0.238 |
0.1946 |
0.2280 |
0.1762 |
0.2089 |
0.9294 |
0.9039 |
0.9394 |
0.9170 |
10.19 |
2.26 |
1i6n rebuilt |
N/A |
N/A |
0.1929 |
0.2257 |
0.1761 |
0.2089 |
0.9305 |
0.9051 |
0.9393 |
0.9169 |
9.03 |
2.05 |
This is also shown by the extra increase in real-space correlation
coefficient:
2z81
The crystal structure of the Toll-like receptor 2
2z81
has been solved at 1.80 Å and refined to reported R-/R-free
factors of 0.213/0.232. CISPEP records are absent in the PDB file.
1 Pro tc+ and 5 tt+ flips are necessary in 2z81:
2z81 SER 42 - A tt+
2z81 ARG 87 - A tt+
2z81 GLY 384 - A tt+
2z81 GLN 396 - A tt+
2z81 PRO 540 - A tc+
2z81 ARG 541 - A tt+
These screenshots show their PDB conformation.
The WHAT_CHECK
validation report
reports τ (N-Cα-C) angle problems around these bonds as well
as poor φ/ψ combinations. Arg 87, Gly 384 and Gln 396 are
listed as buried unsatisfied hydrogen bond donors.
The re-refinement statistics after 10 cycles of
TLS refinement and 50 cycles restrained refinement in REFMAC with
these parameters are slightly
better for the rebuilt/re-refined structure than the re-refined-only
structure:
Re-refinement statistics
CC: reciprocal-space correlation coefficient
Structure |
R-work reported |
R-free reported |
R-work initial |
R-free initial |
R-work final |
R-free final |
Work CC initial |
Free CC initial |
Work CC final |
Free CC final |
Work CC Z-score |
Free CC Z-score |
2z81 |
0.213 |
0.232 |
0.2088 |
0.2224 |
0.1747 |
0.2135 |
0.9418 |
0.9381 |
0.9594 |
0.9435 |
31.30 |
1.84 |
2z81 rebuilt |
N/A |
N/A |
0.2056 |
0.2181 |
0.1733 |
0.2094 |
0.9437 |
0.9403 |
0.9602 |
0.9457 |
30.12 |
1.91 |
The Arg 541 tt+ flip is automatically performed by REFMAC during
refinement of the original 2z81 structure. Therefore, the real-space
correlation coefficient is not higher for the rebuilt/re-refined
Arg 541. The other RSCC values are however much higher if the
necessary peptides are flipped in the 2z81 structure prior to
re-refining it. The side-chain Pro 541 atoms are also much better
modeled in the re-built and re-refined structure:
1pe9
The crystal structure of pectate lyase A
1pe9
has been solved at 1.60 Å and refined to reported R-/R-free
factors of 0.198/0.213. 1 CISPEP record is present in the PDB file:
CISPEP 1 ALA B 242 PRO B 243 0 -0.11
However, there are two molecules of pectate lyase A in the asymmetric
unit, and the same Ala-Pro dipeptide in needs to be flipped in chain A:
1pe9 PRO 243 - A tc+
These screenshots show both chains. There are still some difference
density peaks around B Pro 243, but they have vanished after
re-refinement.
The WHAT_CHECK
validation report
reports an unusually small Cδ-N-Cα angle for Pro 243 A and
unusually short τ (N-Cα-C) angles for Ala 242 and Arg 244.
Furthermore, the improper dihedral of the Pro 243 nitrogen deviates
almost 10 σ from normal values, indicating distorted chirality.
The puckering amplitude is also very high and the puckering phase is
unusual. The torsion angles around Pro 243 are also unusual and the
many bumps are present.
The following re-refinement statistics were obtained after 10 cycles of
TLS refinement and 50 cycles restrained refinement in REFMAC with
these parameters:
Re-refinement statistics
CC: reciprocal-space correlation coefficient
Structure |
R-work reported |
R-free reported |
R-work initial |
R-free initial |
R-work final |
R-free final |
Work CC initial |
Free CC initial |
Work CC final |
Free CC final |
Work CC Z-score |
Free CC Z-score |
1pe9 |
0.198 |
0.213 |
0.1906 |
0.2078 |
0.1619 |
0.1815 |
0.9482 |
0.9432 |
0.9598 |
0.9541 |
28.20 |
5.48 |
1pe9 rebuilt |
N/A |
N/A |
0.1894 |
0.2065 |
0.1612 |
0.1816 |
0.9489 |
0.9439 |
0.9601 |
0.9544 |
27.51 |
5.32 |
The re-refinement statistics are virtually the same for the flipped and
original 1pe9 structure. After re-refinement, the original structure is
almost fixed (picture 1), but the ω is still 77°. Only
re-refinement with the flipped Ala-Pro peptide resolves the difference
density problems (picture 3):
There is no extra improvement in chain B, but in chain A there an extra
increase in RSCC is visible at Ala 242 - Pro 243.