Friday, 8 September 2017

Some highlights of Genome 10k and Genome Science 2017

Everyone comes away with different highlights from a conference, as we each see it from the perspective of our own research, but here were some of the highlights for me of Genome 10k and Genome Science 2017. The conference had multiple tracks so there were many talks I missed. The conference was also live tweeted by many under hashtag #g10kgs2017.
  1. New technology: Nanopore sequencing was mentioned by many speakers, but often from people who were just about to use it for the next step in their research. Long reads were mentioned frequently, and PacBio was still a contender for long read sequencing in talks and posters. Optalysys was advertising a hardware approach to sequence alignment, using light detected after passing through two images representing the sequences ("comparison as fast as the speed of light", except for the time it takes to refresh the images). 
  2. Assembly of long reads and assembly analysis: Those who were using long reads were often using this to produce a whole genome, and were therefore attempting assembly, though many fragments remain even with long reads. Canu was mentioned regularly during the talks, as was FALCON, and miniasm during informal chat. John Davey's talk describing detective work to understand the genome of red algae extremophile Galdieria sulphuraria stood out. After assembly he counted chromosomes by looking for telomeres and the end-of-chromosome read alignment, and still found puzzling questions: Could this 14Mbp organism have 72 chromosomes? Do some of them share regions? 
  3. Haplotyping: Sam Nicholls gave an excellent talk about the Metahaplome and resolving haplotypes in a metagenome. PacBio users now have FALCON-Unzip to phase diploid genomes assembled with FALCON.
  4. Comparative genomics, genome alignments and lineage tracing: After the genomes are assembled, the eukaryote researchers are busy comparing their species with other species (often using Cactus). This seemed to be a common topic. Comparing genomes across species is compute-intensive/expensive. Within-species comparison/alignment was discussed by Bernardo Clavijo, who had many wheat genomes to merge and used skip-mers for the alignment seeds. Graph genomes were mentioned briefly but are not yet used. Alternative RNA splicing was a topic of interest for several speakers.
  5. GC content and GC biases: seem to be responsible for everything, including undersampling for short read sequencing, missing genes in the fat sand rat, photosynthetic efficiency and the efficacy of natural selection. Steve Kelly's talk was fascinating. Plants need different amounts of nitrogen for photosynthesis and this corresponded to the GC content and codon usage of their genomes. So photosynthetic efficiency can be predicted by GC content. He went on to describe how increasing atmospheric CO2 will lead to increased mutation and speciation rate in plants.
  6. Animals with superpowers: Researchers studying animals have all the best stories. There were bats that live forever and don't get cancer (well at least 43 years), mice that can have their fur or limbs removed and have no problem regenerating them, Tasmanian devils transmit cancer by biting and passenger pigeons were once the most abundant bird in the US, migrating in flocks so dense that the sky was darkened, but are now extinct.
  7. Sketches by Alex Cagan: He drew each of the talks in the main room at lightning speed, uploading the sketched to Twitter immediately after each talk. The main room concentrated on the eukaryote genomes, so if you're interested in his summaries, see them all at https://twitter.com/search?q=%23g10kgs2017%20atjcagan
Other observations:
  • The sponsors and stall holders were mostly selling lab automation of one kind or another. Clearly, automation is what genomics researchers are likely to buy.
  • "We are a chemostat for microbes" - Lindsay Hall
  • "Strain resolution is not clustering but deconvolution" - Chris Quince
  • PerkinElmer have put a lot of work into understanding the biases of 16S kits for different hypervariable regions. 
  • Several presentations that confessed "Most of this work was done by my student" (then let your student give the presentation!). Also several people suggesting a useful paper by X where X was the last author rather than the first author. Give credit to the first author!

Friday, 4 August 2017

Joy plots

Today is the birthday of John Venn (of Venn diagrams fame). He was born on this day, 4th August 1834. So I was thinking about diagrams today and finally got around to looking up code for the trendiest-plots-of-2017, 'Joy plots'. As of July 2017 we now have ggjoy plotting code in R (https://github.com/clauswilke/ggjoy) and they're also beautiful in Python's Seaborn (https://github.com/mwaskom/seaborn/pull/1238). 

They're named after the Joy Division album cover. But finding out what 'Joy Division' actually means (https://en.wikipedia.org/wiki/House_of_Dolls) is always going to make me less joyful about the joy in using them.

Thursday, 25 May 2017

Releasing research software

How should researchers be releasing their software?

The BBSRC together with the Software Sustainability Institute and Elixir-UK recently held a workshop to find out. This was the "Developing Software Licensing Guidance for BBSRC Workshop" (April 2017). The BBSRC wanted to ask the community what guidance it should be providing for grant applicants, grant reviewers and the developers and users of research software.

Here's some of the points that were raised during the day: 

Ownership

Who owns the software? University academics often have contracts that say that everything we do belongs to the uni, even if it's in the evenings or weekends. Students, on the other hand tend to own their own IP. On collaborative projects, especially those collaborating with businesses, the ownership of software becomes more complex again. If the software is truly open source, then hopefully multiple people will contribute ('random strangers on the internet'). Who owns it then?

Licenses

Three main options: very permissive, copyleft or commercial. If you need to choose a software license, some universities are well supported by technology transfer staff who understand all the implications and some are not. There are websites to help people choose, but the issues are complex and so far they haven't helped me choose. The consensus for permissive licenses seemed to be that while MIT was good (simple, easy to read and understand), the Apache 2.0 license actually deals with accepting contributions from other developers in future ("Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions."). If you don't use this license then you may have to use a contributer licensing agreement and that can scare off the casual patch submitter. The "for academic use only" kind of license was widely seen to be awful, restricting future collaboration, restricting project expansion and being unclear about who is and isn't an academic.

Software is produced at different scales

There are quick scripts, solid code, and whole platforms. The BBSRC will be funding work that produces all three types of software. It wants all grant-holders to think about how they will release their software and support the creation of software. We know that documenting, testing and packaging is time consuming. How do we ensure that projects allow time for this? How can software be cited properly (see Software citation principles)? Is there any such thing as a 'throwaway script'? Will the project management team include someone who knows about software?

Existing advice

There is a collection of existing advice from various places about how to choose licenses and release software.
The release of software as an output of research is clearly an issue that is being raised by multiple organisations now and it's great to see software being taken seriously. 2017 will see the second Research Software Engineers Conference (RSE2017) and a June 2016 Dahgstul Workshop produced an Engineering Academic Software Manifesto, containing pledges such as
  • I will make explicit how to cite my software.
  • I will cite the software I used to produce my research results.
  • When reviewing, I will encourage others to cite the software they have used.
and more. The BBSRC now has the task of pulling together all the discussion from the workshop and other places and creating a guidance document to help grant applicants, reviewers and panellists, and also the software developers. This will then become part of the grant proposal process along with other docs such as the data management plan, pathways to impact, justification of resources, etc.

Thursday, 27 April 2017

Ada Lovelace and her contribution to computer science

Ada Lovelace (1815-1852) is known as the first computer programmer. What did she really contribute to computing? So many blog posts, articles and books get bogged down describing her childhood, and her famous father and overbearing mother. But what did she actually do?

A while ago I got stuck in and actually read the paper she actually wrote.  "Sketch of the Analytical Engine invented by Charles Babbage... with notes by the translator. Translated by Ada Lovelace"

In fact, she translated into English a paper by an Italian engineer, Luigi Menabrea, and clearly felt that his description did not go far enough, because she added her own notes at the end, which amount to more detail than the original translation.

The original paper by Menabrea was supposed to sell the idea of the Analytical Engine, a machine designed by Charles Babbage but not yet constructed. Money needed to be raised for this endeavour. Menabrea explains what the machine can do. To some extent he does try also to sell the idea, but his writing can be quite dry: 'To give an idea of this rapidity, we need only mention that Mr. Babbage believes he can, by his engine, form the product of two numbers, each containing twenty figures, in three minutes.'

Lovelace wanted to add more, and her writing is glowing with excitement about the machine. She added 7 detailed discussion notes, labelled A to G, and also 20 numbered footnotes. In the numbered footnotes she comments on issues with the translation. She also comments sometimes on how she feels that Menabrea hasn't quite understood completely the novelty and potential (for example 'M. Menabrea's opportunities were by no means such as could be adequate to afford him information on a point like this'). In the notes labelled A to G she explains many of the concepts that are the fundamentals of computing today.

Note A

This note spells out that the Analytical Engine is a general purpose computer. It doesn't just calculate fixed results, it analyses. It can be programmed. It's flexible. It links 'the operations of matter and the abstract mental processes of the most abstract branch of mathematical science'. This is not just a Difference Engine, but instead offers far more. She is very clear that 'the Analytical Engine does not occupy common ground with mere "calculating machines". It holds a position wholly its own'. It's also not just for calculations on numbers, but could operate on other values too. It might for instance compose elaborate pieces of music. 'The Analytical Engine is an embodying of the science of operations'. Lovelace has a beautiful turn of phrase that evokes her delight in imagining where this could go: it 'weaves algebraic patterns just as the Jacquard loom weaves flowers and leaves'.

Note B

This note is about memory, the 'storehouse' of the Engine. It is about variables, and retaining values in memory. She describes how the machine can retain separately and permanently any intermediate results, and then supply these intermediate results as inputs to further calculations. Any particular function that is to be calculated is described by its variables and its operators, and this gives the machine its generality.

Note C

This note is about reuse: the Engine uses input cards, as used in the Jacquard looms. Lovelace had toured the factories of the north with her mother and seen the Jacquard looms in action. They would have been the cutting edge of technology at the time. Intricate patterns were woven by machines, with thousands of fine threads, all controlled by punched cards. The cards needed to be created by card punching machines, with detailed tables of numbers that would be translated into the patterns of holes. The cards allowed a design to be repeated, and they were bound together in a sequence. She explains that these input cards, can be reused 'any number of times successively in the solution of one problem'. And furthermore, that this applies to just one card, or a whole set of cards. The train of cards can be rewound until they are in the correct position to be used again.
Jacquard loom, Nottingham Industrial Museum

Punched cards used in Jacquard loom, Nottingham Industrial Museum
 

Note D

This note is about the order and efficiency of operations. She explains that there are input variables, intermediate variables, and final results. Every operation must have two variables 'brought into action' as inputs and one variable to use as the result. We can then trace back and inspect a calculation to see how a result was obtained. We can see how often a value was used. We can think about how to batch up operations to make them more efficient. She begins to imagine not just how the machine will operate, but how programmers will have to think carefully about their programs, how they will debug them and how they will optimise the algorithms they implement.

Note E

In this note Lovelace explains how loops or 'cycles' can be used to solve series, for example to sum trigonometrical series. With a worked example (used in astronomical calculations), she abstracts out the operations needed to produce terms in the series and works on a notation for expressing cycles that include cycles. Note 18 expands further to explain that one loop can follow another, and there may be infinitely many loops.

Note F

This note begins with the statement 'There is in existence a beautiful woven portrait of Jacquard, in the fabrication of which 24,000 cards were required.' Babbage was so taken with this woven portrait that he bought one of the few copies produced. The amount of work that machines could do, unfailingly, without tiring, was changing the face of industry. The industrial revolution was sweeping through the country. Lovelace explains how the Engine will be able to solve long and intensive with a minimum of cards (they can be rewound, and used in cycles). The machine can calculate a long series of results without making mistakes, solving problems 'which human brains find it difficult or impossible to work out unerringly'. It might even be set to work to solve as yet unsolved and arbitrary problems. 'We might even invent laws for series or formulae in an arbitrary manner, and set the engine to work upon them and thus deduce numerical results which we might not otherwise have thought of obtaining; but this would hardly perhaps in any instance be productive of any great practical utility, or calculated to rank higher than as a philosophical amusement.'
 

Note G 

She concludes with the scope and limitations of the Analytical Engine. She is keen to stress the machine's limitations and about using the machine for discovery: it doesn't create anything. It has 'no pretensions to originate anything'. Lovelace then enumerates what it can do, but warns against overhyping it. Alan Turing references her concerns in his 1950 paper Computing Machinery and Intelligence (which is also well worth a read). In this paper he discusses 9 objections that people may raise against the possibility of AI. One of these 9 objections is Lady Lovelace's Objection. He writes:
Our most detailed information of Babbage's Analytical Engine comes from a memoir by Lady Lovelace (1842). In it she states, "The Analytical Engine has no pretensions to originate anything. It can do whatever we know how to order it to perform" (her italics).
In Note G we also get the detailed trace of a substantial program to calculate the Bernoulli numbers, showing the order of operations combining variables to make results. The trace shows looping and storage, showing intermediate results, and the correspondence with the mathematical formulae being calculated at each step. She inspects the program for efficiency, working out the number of punched cards required, the number of variables needed and the numbers of execution steps.

Lovelace wonders, in the third paragraph of this Note, whether our minds are able to follow the analysis of the execution of the machine. Will we really be able to understand computers and their abilities? 'No reply, entirely satisfactory to all minds, can be given to this query, excepting the actual existence of the engine, and actual experience of its practical results.'

Ada Lovelace is buried in Hucknall Church, Nottingham. She died aged 36, and never got to see the Analytical Engine constructed.

Monday, 10 April 2017

Visiting JG Mainz University

I've just returned from a two week visit to Johannes Gutenberg University Mainz, where I was hosted by Andreas Karwath in the Data Mining group of the Dept of Informatik. I was there to pick up new research ideas, to get away from admin/teaching duties for a while and to share what we've been working on lately. 

The University at Mainz is a large campus based university on the outskirts of the town, on the beautiful river Rhine. It's named after the inventor of the printing press in the west, Johannes Gutenberg. The Gutenberg museum is excellent, tracing books from wax tablets through handwritten parchments, to moveable type print, then the industrial revolution, typewriters and high volume printing. This is a museum about the value of information and its transmission, and the technology invented to do this. There was even a special exhibition about the Futura font, as an added bonus. This is the geometric circles-and-lines font used in posters for "2001 A Space Odyssey", and on the Apollo 11 moon plaque ("We came in peace for all mankind"), and in so much future-looking advertising and propaganda in the Art Deco inspired 1930s era, both in Germany and beyond.

It was interesting to be embedded in a data mining group, rather than bioinformatics for a change. They have a broad range of application areas, but also happily switch technology (neural nets, relational learning, topic models, matrix decomposition, graphs, rs-trees and more) as the application area needs. Also very interesting to see in which ways a different country's research culture is different. It's not REF-dominated like the UK, so more they're more free to focus on quick-turnaround peer-reviewed compsci conference publications, and perhaps more hierarchically structured, as only the few professors have permanent positions. And yet it's still the same. University departments are international places and share much in common, whichever country you're in: same grant applications, student supervisions, seminar talks, dept silos, etc.

It was great fun to be there, and they were excellent hosts. At the same time, it was strange and sad to be a British person on exchange in Germany during the week that the UK sent article 50 to the EU. International collaborations are so important to research that leaving the EU is bound to be hugely detrimental to us UK academics. We need more exchange, not less.

Tuesday, 4 April 2017

The metahaplome

A sample of water, soil or gut contents will contain a whole community of microbes, cooperating and competing. We now have the sequencing technology to begin to explore these communities, to find out the variation that they possess. This can be useful in the search for new anti-microbials, or in the search for better enzymes for biofuels. However, the sequencing technology is not quite there yet. The very short length of the reads, together with the errors introduced, combine to make the problem of reassembling the underlying genomes much more complex.

We introduce the concept of the 'metahaplome': the exact sequence of DNA bases (or "haplotype") that constitutes the genes and genomes of every individual present. We also present a data structure and algorithm that will recover the haplotypes in the metahaplome, and rank them according to likelihood.

Our preprint about the metahaplome is now available at bioRxiv: Probabilistic Recovery Of Cryptic Haplotypes From Metagenomic Data.

Monday, 20 March 2017

Is academic impact useful as a proxy measure for real world impact?

In academia we have various measures of "success". Some of the more common measures are "how many papers did we publish in reputable journals" and "how many other academics went on to use my work or refer to my work". We count citations, check out our h-index, and perhaps even check the number of other academics talking about our work on social media. We'd all like our work to be useful to others. Of course, any measure of success can be gamed. Academics may cite themselves to boost citation counts, use clickbait paper titles to attract attention, and select journals for prestige rather than availability to the community.

However, to be useful to other academics is not the same as to be useful to the rest of the world. Are we also having impact outside of academia? Some blue sky basic research is unlikely to do this (but perhaps likely to be cited by academics). Some applied research can have immediate impact on medical outcomes, law and policy, societal attitudes, civil rights, environmental strategies and business practices.

The UK tries to measure this kind of research impact by asking universities to submit REF Impact case studies. These are summary documents, written by academics, describing what impact they've had. They're not easy to write, or to evidence, and yet they are used as part of the REF exercise measuring research quality, whereby Higher Education funding bodies distribute research money to the universities.

How can we make it easier for academics to find and present their impact? Before we answer that, is the whole exercise worth doing anyway? We need to know if we could just cheaply count citations and use this as a proxy for real world impact. After all, if a paper is popular with academics, surely it's also going to be useful to the rest of the world too?

We've done the analysis in Measuring scientific impact beyond academia: An assessment of existing impact metrics and proposed improvement. TL;DR: the answer is 'no'. Academic impact is not correlated with more comprehensive impact in the real world. We're not surprised, but we needed to prove it. But don't take our word for it: we've made all the data available. Try it yourself.

On with the next steps now... James will now be data mining the real world impact of scientific research, from the collection of unstructured documents out there in news archives, parliamentary proceedings, etc. We expect NLP, data mining and machine learning challenges ahead as we trace the movement of academic ideas and results out from academia and into the rest of the world.