Saturday, 3 March 2012

De Bruijn graphs

De Bruijn graphs are currently a cornerstone of several genome sequence assembly algorithms. Nicolaas de Bruijn was a Dutch mathematician who died this year in February 2012. In 1946, he created the idea of the graph that is now named after him. The picture of him here is Copyright:MFO, and others can be seen at the Oberwolfach photo collection. I like to think he might be drawing graphs while resting on this bench. It looks like there's a football beside him, though he's hardly dressed for a game of football, in a tweed suit and tie.

The nodes of de Bruijn graphs are short sequences. There is an edge between two nodes if you could convert one sequence into another by shifting it along by one character and adding a new character at the end. So the nodes "AACAA" and "ACAAB" are allowed to be joined by an edge. In Haskell, you might say that tail node1 == init node2. In Python you might say node1[1:] == node2[:-1]. A very readable article in Nature Biotechnology describes how these graphs are used in sequence assembly. Basically, a subsequence size is chosen (for example, 32 characters), and all subsequences of this size that are found in your read fragment data then become nodes in a de Bruijn graph. We search for a good path through this graph after cleaning up some of the noise and errors and deciding what to do about loops. The path through the graph will tell you the final sequence.

At the moment, sequence assembly is still a big problem. A new genome is sequenced as a collection of short overlapping sequence pieces (short is usually anywhere between 32 and 500 bases at the moment), which then have to be painstakingly pieced together. There are errors in the sequencing, and repetitive regions, and this complicates the problem. The size of the data is also a problem. Plant genomes can have around 20 billion bases, so working with files and having enough memory to store data structures adds to the problem.

So much must have changed in de Bruijn's lifetime. Back in 1946 the structure of DNA was not clear. Watson and Crick's paper in 1953 was yet to come. Computers were yet to come. He wouldn't have known that more than 60 years later we'd be using his graphs to assemble whole genomes that were sequenced using the detection of fluoresence. What will we be doing 60 years from now that links maths, computer science, physics, chemistry and biology?