Monday 28 March 2016

Goldilocks: census your genomes

Goldilocks is a new tool written by Sam Nicholls for counting interesting properties of genomes. It's very easy to install ("pip install goldilocks") and has a detailed user manual.

So, let's have a look at GC count across each of the chromosomes of Sorghum. Sorghum is a plant that is a reasonably close relative to Miscanthus, which is extensively studied here in Aberystwyth. I downloaded the chromosome assembly of sorghum from RefSeq. Here's the plot, showing amount of GC on the y-axis and position along the chromosome on the x-axis. The 10 Sorghum chromosomes are all shown stacked up in one plot panel.
The Python code for this using Goldilocks to do this plot is as simple as:

sequence_data = { 
    "sorghum" : {"file": "./sorghum.fna.fai"},
}
g = Goldilocks(GCRatioStrategy(), sequence_data, length="500K",
               stride = "1000K", is_faidx = True)
g.plot("sorghum", title="GC content of sorghum chromosomes")

The dip in GC for the centromere of each chromosome is obvious, except for chromosomes 2 and 6.

A similar but inverted pattern can be seen if we look at the number of Ns along the genome:

So, what's different about the centromeres in chromosomes 2 and 6? Why are they not so visible? Another way to spot them would be to look for a motif known to be in the centromeres. Centromeres have many repeats, and a repeat region known to be found in sorghum centromeres is CEN38. Let's choose a short motif from the sequence for CEN38, say "CCTAATG", and census that.

There's clearly plenty of this motif found in chromosomes 2 and 6, and found where we might expect a centromere to be (also lots of this motif in the centromeres of chr 3 and chr 5 too). But it's not found in all chromosomes. Could it be that CEN38 varies its sequence in the other chromosomes, and so doesn't have precisely that motif? Or that too many Ns in the other chromosomes stop CEN38 being characterised?

This is just a simple demonstration of how Golidlocks can be used to explore questions. And questions lead to more questions, and then many a happy hour can be spend browsing your genomes. Goldilocks can also be used to export details about which regions are the most interesting (hence the name: it finds regions that are "just right", for whatever your "just right" criterion might be).

Enjoy browsing your genomes! Goldilocks paper, Goldilocks docs, Goldilocks source code.