Amanda Clare: Python

Showing posts with label Python. Show all posts

Monday, 28 March 2016

Goldilocks: census your genomes

Goldilocks is a new tool written by Sam Nicholls for counting interesting properties of genomes. It's very easy to install ("pip install goldilocks") and has a detailed user manual.

So, let's have a look at GC count across each of the chromosomes of Sorghum. Sorghum is a plant that is a reasonably close relative to Miscanthus, which is extensively studied here in Aberystwyth. I downloaded the chromosome assembly of sorghum from RefSeq. Here's the plot, showing amount of GC on the y-axis and position along the chromosome on the x-axis. The 10 Sorghum chromosomes are all shown stacked up in one plot panel.

The Python code for this using Goldilocks to do this plot is as simple as:

sequence_data = { 
    "sorghum" : {"file": "./sorghum.fna.fai"},
}
g = Goldilocks(GCRatioStrategy(), sequence_data, length="500K",
               stride = "1000K", is_faidx = True)
g.plot("sorghum", title="GC content of sorghum chromosomes")

The dip in GC for the centromere of each chromosome is obvious, except for chromosomes 2 and 6.

A similar but inverted pattern can be seen if we look at the number of Ns along the genome:

So, what's different about the centromeres in chromosomes 2 and 6? Why are they not so visible? Another way to spot them would be to look for a motif known to be in the centromeres. Centromeres have many repeats, and a repeat region known to be found in sorghum centromeres is CEN38. Let's choose a short motif from the sequence for CEN38, say "CCTAATG", and census that.

There's clearly plenty of this motif found in chromosomes 2 and 6, and found where we might expect a centromere to be (also lots of this motif in the centromeres of chr 3 and chr 5 too). But it's not found in all chromosomes. Could it be that CEN38 varies its sequence in the other chromosomes, and so doesn't have precisely that motif? Or that too many Ns in the other chromosomes stop CEN38 being characterised?

This is just a simple demonstration of how Golidlocks can be used to explore questions. And questions lead to more questions, and then many a happy hour can be spend browsing your genomes. Goldilocks can also be used to export details about which regions are the most interesting (hence the name: it finds regions that are "just right", for whatever your "just right" criterion might be).

Enjoy browsing your genomes! Goldilocks paper, Goldilocks docs, Goldilocks source code.

Tuesday, 10 March 2015

Python for Scientists

This year, 2014/2015, we started a new MSc course: Statistics for Computational Biology. We can see that there's a huge demand for bioinformaticians, for statisticians who can read biology, and for programmers who know about statistics and can apply stats to biological problems. So this new MSc encompasses programming, statistics and loads of the current hot topics in biology. It's the kind of MSc I would have loved to have done when I was younger.

As part of this degree, I'm teaching a brand new module called Programming for Scientists, which uses the Python programming language. This is aimed at students who have no prior programming knowledge, but have some science background. And in one semester we teach them the following:

The basics of programming: variables, loops, conditionals, functions
File handling (including CSV)
Plotting graphs using matplotlib
Exceptions
Version control using Git/Github
SQL database (basic design, queries, and using from SQLite from Python)
XML processing
Accessing data from online APIs

We taught it as a hands-on module, lectures held in a room full of computers, programming as we go through the slides, with exercises interspersed and demonstrators on hand to help.

We had students sign up for this module from a surprisingly diverse set of backgrounds, from biology, from maths, from geography and even from international politics. We also had a large number of staff and PhD students from our Biology department (IBERS) who wanted to sit in on the module. This was a wonderful group of students to teach. They're people who wanted to learn, and mostly just seemed to absorb ideas that first year undergraduates struggle with. They raised their game to the challenge.

Python's a great language for getting things done. So it makes a good hands-on language. However, it did highlight many of Python's limitations as a first teaching language. The objects/functions issue: I chose not to introduce the idea of objects at all. It's hard enough getting this much material comfortably into the time we had, and objects, classes and subclasses was something that I chose to leave out. So we have two ways to call functions: len(somelist) and somelist.reverse(). That's unfortunate. Variable scoping caught me out on occasion, and I'll have to fix that for next year. The Python 2 vs Python 3 issue was also annoying to work around. Hopefully next year we can just move to Python 3.

What impressed me most was the quality of the final assignment work. We asked the students to analyse a large amount of data about house sales, taken from http://data.gov.uk/ and population counts for counties in England and Wales taken from the Guardian/ONS. They had to access the data as XML over a REST-ful API, and it would take them approximately 4 days to download all the data they'd need. We didn't tell them in advance how large the data was and how slow it would be to pull it from an API. Undergrads would have complained. These postgrads just got on with it and recognised that the real world will be like this. If your data is large and slow to acquire then you'll need to test on a small subset, check and log any errors and start the assignment early. The students produced some clean, structured and well commented code and many creative summary graphs showing off their data processing and data visualisation skills.

I hope they're having just as much fun on their other modules for this course. I'm really looking forward to running this one again next year.

Wednesday, 25 January 2012

We need more than one programming language

I teach Haskell as a programming language to our undergraduates. I'm sure it's a continual subject of debate in computer science department coffee rooms up and down the country: "Which programming languages should we teach in our CS degrees?" The module that I teach is called "Concepts in Programming", and the idea is that there are indeed concepts in programming that are fundamental to many languages, that you can articulate the differences between programming languages and that these differences give each language different characteristics. Differences such as the type system, the order of evaluation, the options for abstraction, the separation of data and functions.

"We need more than one" is the title of a paper by Kathleen Fisher, a Professor in Computer Science at Tufts University. Her short, eloquent paper describes why we will never have just one programming language ("because a single language cannot be well-suited to all programming tasks"). She has had a career spanning industry and academia, has been the chair of the top programming language conferences (OOPSLA, ICFP), and has been the chair of the ACM Special Interest Group on Programming Languages. Her paper On the Relationship between Classes, Objects and Data Abstraction tells you everything you ever needed to know about objects. She knows about programming.

Her recent work has been looking at that problem of how to read in data files when the data is in some ad-hoc non-standard representation (i.e. not XML with a schema, or CSV or anything obvious). So we all have to write parsers/data structures to fit these data files. We write a program in a language such as Perl. She says "The program itself is often unreadable by anyone other than the original authors (and usually not even them in a month or two)". I've been there and done that.

And when we've written another new parser like this for the umpteenth time (especially in the field of bioinformatics) we begin to wonder "How many programs like this will I have to write?" and "Are they all the same in the end?" and Kathleen's paper The Next 700 Data Description Languages looks at just that question. What is the family of languages for processing data and what properties do they have? I love the title of this paper because it instantly intrigues by its homage to the classic 1966 paper The Next 700 Programming Languages by Peter Landin. (To see why Peter Landin's work on programming languages was so important, the in memorium speech for Peter Landin given at ICFP 2009 is well worth a listen, and also the last 3 mins of Rod Burstall's speech discusses this paper in particular).

So perhaps, in computer science, we're doomed to keep inventing new programming languages, which wax and wane in popularity over time: Lisp, Fortran, C, C++, Java, Ruby, Haskell, F#, Javascript and so on. But as computer scientists we should be able to understand why this happens. They're all necessary, all useful and all part of a bigger theoretical picture. We need more than one.

By the way, Kathleen Fisher would use Python as a teaching language.

This is my entry for the BCSWomen Blog Carnival.