Thursday, 21 May 2015

How much is enough?

How much is enough? This question seems to crop up very frequently when analysing data. For example:
  • "How much data do I need to label in order to train a machine learning algorithm to recognise place names that locate newspaper articles?"
  • "Is my metagenome assembly good enough or do we need longer/fewer contigs?"
  • "What BLAST/RAPSearch threshold is close enough?"
  • "Are the k-mers long enough or short enough? (for taxon identification, for sequence assembly)"
Sadly there's no absolute answer to any of these. It depends. It depends on what your data looks like and what you want from the result. It also depends on how much time you have. It depends what question you really wanted to answer. What's the final goal of the work?

Sometimes there are numbers to report, measures that give us an idea of whether the process was good enough, after we've done the expensive computation. We can report various statistics about how good the result is, such as the N50 and its friends for sequence assembly, or the predictive accuracy for a newspaper article place name labeller. Which statistics to report are highly questionable. Does a single figure such as the N50 really tell us anything useful about the assembled sequence? It can't tell us which parts were good and which parts were messy. Do we really need lots of long contigs if we're assembling a metagenome? Perhaps the assembly is just an input to many further pipeline stages, and actually, choppy short contigs will do just fine for the next stage.

PAC learning theory was an attempt in 1984 by Leslie Valiant to address the questions about what was theoretically possible with data and machine learning. For what kinds of problem can we learn good hypotheses in a reasonable amount of time (hypotheses that are Probably Approximately Correct)? This led on to the question of how much data is enough to make a good job of machine learning? Some nice blog posts describing PAC learning theory and how much data is needed to ensure low error do a far better job than I could of explaining the theory. However, the basic theory assumes nice clean noise-free data and assume that the problem is actually learnable (it also tends to overestimate the amount of data we'd actually need). In the real world the data is far from clean, and the problem might never be learnable in the format that we've described or in the language we're using to create hypotheses. We're looking for a hypothesis in a space of hypotheses, but we don't know if the space is sensible. We could be like the drunk looking for his keys under the lamppost because the light is better there.

Perhaps there will be more theoretical advances in the future that tell us what kinds of genomic analysis are theoretically possible, and how much data they'd need, and what parameters to provide before we start. It's likely that this theory, like PAC theory, will only be able to tell us part of the story.

So if theory can't tell us how much is enough, then we have to empirically test and measure. But if we're still not sure how much is enough, then we're probably just not asking the right question.