In November I went to a workshop to discuss the remit of the Alan Turing Institute, the UK national institute for data science, with regard to the theme of "Data Science Challenges in High Throughput Biology and Precision Medicine". This workshop was held in Edinburgh, in the Informatics Forum, and hosted by Guido Sanguinetti.
The Alan Turing Institute is a new national institute, funded partly by the government, and partly by five universities (Edinburgh, UCL, Oxford, Cambridge, Warwick). The amount of funding is relatively small compared with that of other institutes (e.g. the Crick) and seems to be enough to fund a new building next door to the Crick in London, together with a cohort of research fellows and PhD students to be based in the new building. What should be the scope of the research areas that it addresses and how should it work as an institute? There are currently various scoping workshops taking place to discuss these questions.
Data science is clearly important to society, whether it's used in the analysis of genomes, intelligence gathering for the security services, data analytics for a supermarket chain, or financial predictions for the city. Statistics, machine learning, mathematical modelling, databases, compression, data ethics, data sharing and standards and novel algorithms are all part of data science. The ATI is already partnered with Lloyds, GCHQ and Intel. Anecdotal reports from the workshop attendees suggest that data science PhD students are snapped up by industry, ranging from Tesco to JP Morgan, and that some companies would like to recruit hundreds of data scientists if only they were available.
The feeling at the workshop seemed to be a concern that the ATI will aim to highlight the UK's research in the theory of machine learning and computational statistics, but risks missing out on the applications. The researchers who work on new and cutting edge machine learning and computational statistics don't tend to be the same people as the bioinformaticians. The people who go to NIPS don't go to ISMB/ECCB. And KDD/ICML/ECML/PKDD is another set of people again. These groups used to be closer, and used to overlap more, but now they rarely attend each others' conferences. Our workshop discussed the division between the theoreticians who create the new methods but prefer their data to be abstracted from the problem at hand, and the applied bioinformaticians, who have to deal with complex and noisy data, and often apply tried and tested data science instead of the latest theoretical ideas. To publish work in bioinformatics generally requires us to release code and data, and to have shown results on a real biological problem. To publish in theoretical machine learning or computational statistics, there is no particular requirement for an implementation of the idea, or to demonstrate its effectiveness on a real problem. There is also a contrast between the average size of research groups in the two areas. Larger groups are needed to produce the data (people in the lab to run the experiments, bioinformaticians to manage and analyse the data, and these groups are often part of larger consortia) whereas the theoreticians are often cottage-industry style research with just a PI and a PhD student. How should these styles of working come together?
Health informatics people worry about access to data: how to share it, get it, and ensure trust and privacy. Pharmaceuticals worry about dealing with data complexity, such as how to analyse phenotype from cell images in high throughput screening, having interpretable models rather than non-linear neural networks, and how to keep up with all the new sources of information, such as function annotations via ENCODE. GSK now has a Chief Data Officer. Everyone is concerned about how to accumulate data from new bio-technologies (microarrays then RNA-seq, fluorescence then imaging, new techniques for measuring biomarkers of a population under longitudinal study). Trying to keep up with the changes can lead to bad experiment design, and bad choices for data management.
There was much discussion about needing to make more biomedical data open-access (with consent), including genomic, phenotypic and medical data. There seemed to be some puzzlement about why people are happy to entrust banks with their financial data, and supermarkets with purchase data, but not researchers with biomedical data. (I don't share their puzzlement: your genetic data is not your choice, it's what you're born with, and it belongs to your family as much as it belongs to you, so the implications of sharing it are much wider).
All these issues surrounding the advancement of Data Science are far more complex and varied than the creation of novel and better algorithms. How much will the ATI be able to tackle in the next five years? It's certainly a challenge.