The data-set needed is a large set of proteins with known active sites.
You then have to determine how often each residue type occurs at the surface in
the whole set. Suppose you have 7.2% Ala, 1.3% Cys, 5.5% Asp, etc), then this is your
null-model. If you now find a protein you expect any local area to have roughly this
same distribution.
Now you count the residue types in the active site pockets. Of course, it requires a bit of thinking where
the active site stops and the 'rest' begins, but that is a matter of definition. The number of active
site residues is going to be much lower than the number of surface residues, so these two
rows of 20 numbers must be scaled on each other. After that, you can divide the two rows of
20 numbers on each other, and take the logarithm, and those logaritms are your scores.
Now, if you have a protein with a nice dent in its surface, you can count the residue types
present at the surface of this dent, and perhaps also count the homologous positions in a
couple dozen homologs to get better statistics. But however you count, this score is related to
the likelyhood that your protein's dent is not just a dent, but an active site.