Answer:


The data-set needed is a large set of proteins with known active sites.
You then have to determine how often each residue type occurs at the surface in the whole set. Suppose you have 7.2% Ala, 1.3% Cys, 5.5% Asp, etc), then this is your null-model. If you now find a protein you expect any local area to have roughly this same distribution.
Now you count the residue types in the active site pockets. Of course, it requires a bit of thinking where the active site stops and the 'rest' begins, but that is a matter of definition. The number of active site residues is going to be much lower than the number of surface residues, so these two rows of 20 numbers must be scaled on each other. After that, you can divide the two rows of 20 numbers on each other, and take the logarithm, and those logaritms are your scores.
Now, if you have a protein with a nice dent in its surface, you can count the residue types present at the surface of this dent, and perhaps also count the homologous positions in a couple dozen homologs to get better statistics. But however you count, this score is related to the likelyhood that your protein's dent is not just a dent, but an active site.