Answer:


This is the difficult bit. Perhaps the best model would be to only take TM proteins and use the residues that go through the membrane versus the ones (in the same protein) that don't. You can also justify to take the TM parts versus the full protein (in some implementations these two methods will even be the same except for a scale factor).

Null-model will now be that all residue types (AA) are distributed randomly over the TM and non-TM bits. So, P(ala)=0.073, P(asp)=0.052, etc. All chances P added up should give you 1.0. The occurence of amino acids in nature is given in this table.