For FF1: get a set of proteins that have at least one TM helix and a set of
proteins that do not have any TM helices. These proteins must NOT have been used in the design
of FF1! Predict TM helices and count the correctly predicted TM helices, the
missed TM helices, and the over-predicted TM helices. Find a statistical
method that makes you method look better than anybody else's, and publish.
And for FF2, it sounds rather similar, just that you now only need one test-set.