Answer for question in bioinformatics course

Answer:

You need to know how big is the chance of finding an alanine in a helix, in a strand, and in any of the other structure classes (turn, loop, etc). Lets call these P(Ala,H), P(Ala,S), P(Ala,R) in wich H,S, and R stand for Helix, Strand, and Rest. Obviously P(X,H)+P(X,S)+P(X,R)=1.0 for each of the twenty amino acids X.
So, how big is P(Ala,H)? Well, P(Ala,H)=P(Ala)*P(H). And those two chances we can obtain from counting in the whole dataset all residues, all Ala, and all H. Typical numbers could be: data set size = 407128. Number of Ala = 28777. Number of H is 122991. This gives us P(Ala) = 28777/407128 = 7.1% and P(H) = 122991/407128 = 30.2%. So that P(Ala,H) = 2.1%. And the expected number of Ala in a helix F_pred(Ala,H) = 0.021 * 407128 = 8693 (check that that is 30.2% of 28777, and check that you understand why that should be). This you now do three times for all 20 amino acids. So you want P(Ala,H), P(Ala,S), P(Ala,R), P(Cys,H), P(Cys,S), etcetera, 60 numbers in total. These chances can be converted into F_pred values. And that is your null-model.