Introduction

The PDB can hold hundreds of copies of nearly the same molecule. That is not bad, because the hundreds of lysozyme mutant structures, for example, teach us a lot about protein structures and stability.

Bioinformaticians often want to be able to work with a representative dataset. It wouldn't be wise to train a computational method on a dataset that consists for more than ten percent of lysozyme structures because with that dataset the method doesn't learn about proteins, but learns what lysozyme looks like.

The PDB structures stored in the WHAT IF relational database are a representative set of sequence-unique structures generated from the X-ray protein PDB files available at a certain moment.

The procedure used to generate this database is similar to the method designed in the early 90′s by Hobohm and Sander (Protein Science Volume 3, Issue 3, pages 522-524 1994), but rather than focusing on maximum size of the subset, our algorithm focuses on getting representative structures of the highest available quality. For the selection an empirical quality value is defined. This is a composite score depending on the Resolution and the R-factor published (funny enough) in:

Verification of protein structures: Side-chain planarity.
R.W.W. Hooft, C.Sander and G.Vriend, J. Appl. Cryst. (1996) 29, 714-716.

However, here we use a sequence identy percentage cutoff of 30% and the resolution and R-factor criteria are as indicated with the datasets.

Each structure is identified by the 4-letter PDB identifier, plus a one-letter chain identifier. Structures are ordered by decreasing 'quality value' as described in the article cited above.

In 2014 we started providing lists of non-culled PDB files solved by X-ray. These are listed under ′Without culling′.

And then we realized that we had no longer any users. So the updating stopped. In 2017 we received a request to update the files. So we did. But from now on, I am not doing all the manual cut-n-pasting to put the results in the big list of lists, but rather the newer results are available grouped per date.
So the directory 2018_aug_8 contains files calculated around (it takes more than a day to do the calculations) that date. The file names contain the date, resolution, and R-factor as part of their names.