2015 is the year that genome editing really became big news. A new technique, "CRISPR/CAS", was named as Science magazine's breakthrough of the year as voted by the public from a shortlist chosen by staff.
However, people have been manipulating DNA through many useful methods long before CRISPR/CAS made headlines. Gene deletion is an important tool when trying to understand the function of genes. Take out a gene and see what effect it causes. Genes can be disrupted (by removing a portion of the DNA or inserting some extra DNA) or can be interfered with, for example via their RNA production, or they can be entirely deleted. It's common practice when removing a gene to insert a marker, so that we can easily select for the cells where this procedure has been successful. For example, to insert an antibiotic resistance gene as a marker, so that we can now grow the cells on a plate with an antibiotic. Then only those that have lost our gene of interest and gained antibiotic resistance will now grow. The trouble with this is that many gene deletions have no visible effect by themselves. If we also want to delete a second gene and a third, then we need more markers, or we need to be able to remove and reuse the marker we inserted. We also don't want the process to leave any scars behind that could destabilise the genome. We've just published a paper to help solve this problem.
This process of 'swap a gene of interest for a marker gene' can be achieved in many organisms by homologous recombination. This is a process used by many cells to repair broken strands of DNA. If we provide a piece of DNA that has a good region of similarity to the region just downstream of our gene of interest, and also a good region of similarity to the region just upstream of the gene of interest, but instead of the gene of interest, has the marker gene between these regions, then the normal cellular processes of homologous recombination will exchange the two. Some organisms perform homologous recombination very readily (S. cerevisiae for example). Others may need a little more encouragement, such as creating a double stranded break.
Our new paper A tool for Multiple Targeted Genome Deletions that Is Precise, Scar-Free and Suitable for Automation with Wayne Aubrey as first author uses a 3-stage PCR process to synthesise a stretch of DNA (a 'cassette') that will do everything. It will have good regions of similarity to the regions upstream and downstream of the gene of interest. It will contain a marker gene. And (here's the good bit), it will contain a specially designed region ('R') before the marker gene that is identical to the region that occurs just after the gene of interest. In this way, after homologous recombination has done its thing and inserted the DNA cassette instead of the gene of interest, there will be two identical R regions, one before the marker gene, and one after the marker gene. Sometimes the DNA will loop round on itself, the two R regions will match up and homologous recombination will snip out the loop, including the marker gene.
We can encourage this to happen and select for the cells that have had this happen if our marker is also 'counter-selectable'. That is, we'd like a marker for which we can add something to the growth medium so that now only cells without the marker will now grow. That is, we'd like to use a marker or marker combination for which we can first select for its presence and then counter-select for its absence. When we have this we can select for cells that have had the marker replace the gene, and then counter-select for cells that have now lost the marker too. So we have a clean gene deletion.
Of course we're always standing on the shoulders of giants when we do science. Our method is an improvement on a method by Akada 2006, so that no extra bases are lost or gained and the method requires no gel purification steps. Just throw in your primers and products and away you go. It's not fussy about quantity. No purification steps means that it could be automated on lab robots. And it could be used to delete any genetic component, not just genes. Give it a try!
Wednesday 23 December 2015
Thursday 17 December 2015
Data science and a scoping workshop for the Turing Institute
In November I went to a workshop to discuss the remit of the Alan Turing Institute, the UK national institute for data science, with regard to the theme of "Data Science Challenges in High Throughput Biology and Precision Medicine". This workshop was held in Edinburgh, in the Informatics Forum, and hosted by Guido Sanguinetti.
The Alan Turing Institute is a new national institute, funded partly by the government, and partly by five universities (Edinburgh, UCL, Oxford, Cambridge, Warwick). The amount of funding is relatively small compared with that of other institutes (e.g. the Crick) and seems to be enough to fund a new building next door to the Crick in London, together with a cohort of research fellows and PhD students to be based in the new building. What should be the scope of the research areas that it addresses and how should it work as an institute? There are currently various scoping workshops taking place to discuss these questions.
Data science is clearly important to society, whether it's used in the analysis of genomes, intelligence gathering for the security services, data analytics for a supermarket chain, or financial predictions for the city. Statistics, machine learning, mathematical modelling, databases, compression, data ethics, data sharing and standards and novel algorithms are all part of data science. The ATI is already partnered with Lloyds, GCHQ and Intel. Anecdotal reports from the workshop attendees suggest that data science PhD students are snapped up by industry, ranging from Tesco to JP Morgan, and that some companies would like to recruit hundreds of data scientists if only they were available.
The feeling at the workshop seemed to be a concern that the ATI will aim to highlight the UK's research in the theory of machine learning and computational statistics, but risks missing out on the applications. The researchers who work on new and cutting edge machine learning and computational statistics don't tend to be the same people as the bioinformaticians. The people who go to NIPS don't go to ISMB/ECCB. And KDD/ICML/ECML/PKDD is another set of people again. These groups used to be closer, and used to overlap more, but now they rarely attend each others' conferences. Our workshop discussed the division between the theoreticians who create the new methods but prefer their data to be abstracted from the problem at hand, and the applied bioinformaticians, who have to deal with complex and noisy data, and often apply tried and tested data science instead of the latest theoretical ideas. To publish work in bioinformatics generally requires us to release code and data, and to have shown results on a real biological problem. To publish in theoretical machine learning or computational statistics, there is no particular requirement for an implementation of the idea, or to demonstrate its effectiveness on a real problem. There is also a contrast between the average size of research groups in the two areas. Larger groups are needed to produce the data (people in the lab to run the experiments, bioinformaticians to manage and analyse the data, and these groups are often part of larger consortia) whereas the theoreticians are often cottage-industry style research with just a PI and a PhD student. How should these styles of working come together?
Health informatics people worry about access to data: how to share it, get it, and ensure trust and privacy. Pharmaceuticals worry about dealing with data complexity, such as how to analyse phenotype from cell images in high throughput screening, having interpretable models rather than non-linear neural networks, and how to keep up with all the new sources of information, such as function annotations via ENCODE. GSK now has a Chief Data Officer. Everyone is concerned about how to accumulate data from new bio-technologies (microarrays then RNA-seq, fluorescence then imaging, new techniques for measuring biomarkers of a population under longitudinal study). Trying to keep up with the changes can lead to bad experiment design, and bad choices for data management.
There was much discussion about needing to make more biomedical data open-access (with consent), including genomic, phenotypic and medical data. There seemed to be some puzzlement about why people are happy to entrust banks with their financial data, and supermarkets with purchase data, but not researchers with biomedical data. (I don't share their puzzlement: your genetic data is not your choice, it's what you're born with, and it belongs to your family as much as it belongs to you, so the implications of sharing it are much wider).
All these issues surrounding the advancement of Data Science are far more complex and varied than the creation of novel and better algorithms. How much will the ATI be able to tackle in the next five years? It's certainly a challenge.
The Alan Turing Institute is a new national institute, funded partly by the government, and partly by five universities (Edinburgh, UCL, Oxford, Cambridge, Warwick). The amount of funding is relatively small compared with that of other institutes (e.g. the Crick) and seems to be enough to fund a new building next door to the Crick in London, together with a cohort of research fellows and PhD students to be based in the new building. What should be the scope of the research areas that it addresses and how should it work as an institute? There are currently various scoping workshops taking place to discuss these questions.
Data science is clearly important to society, whether it's used in the analysis of genomes, intelligence gathering for the security services, data analytics for a supermarket chain, or financial predictions for the city. Statistics, machine learning, mathematical modelling, databases, compression, data ethics, data sharing and standards and novel algorithms are all part of data science. The ATI is already partnered with Lloyds, GCHQ and Intel. Anecdotal reports from the workshop attendees suggest that data science PhD students are snapped up by industry, ranging from Tesco to JP Morgan, and that some companies would like to recruit hundreds of data scientists if only they were available.
The feeling at the workshop seemed to be a concern that the ATI will aim to highlight the UK's research in the theory of machine learning and computational statistics, but risks missing out on the applications. The researchers who work on new and cutting edge machine learning and computational statistics don't tend to be the same people as the bioinformaticians. The people who go to NIPS don't go to ISMB/ECCB. And KDD/ICML/ECML/PKDD is another set of people again. These groups used to be closer, and used to overlap more, but now they rarely attend each others' conferences. Our workshop discussed the division between the theoreticians who create the new methods but prefer their data to be abstracted from the problem at hand, and the applied bioinformaticians, who have to deal with complex and noisy data, and often apply tried and tested data science instead of the latest theoretical ideas. To publish work in bioinformatics generally requires us to release code and data, and to have shown results on a real biological problem. To publish in theoretical machine learning or computational statistics, there is no particular requirement for an implementation of the idea, or to demonstrate its effectiveness on a real problem. There is also a contrast between the average size of research groups in the two areas. Larger groups are needed to produce the data (people in the lab to run the experiments, bioinformaticians to manage and analyse the data, and these groups are often part of larger consortia) whereas the theoreticians are often cottage-industry style research with just a PI and a PhD student. How should these styles of working come together?
Health informatics people worry about access to data: how to share it, get it, and ensure trust and privacy. Pharmaceuticals worry about dealing with data complexity, such as how to analyse phenotype from cell images in high throughput screening, having interpretable models rather than non-linear neural networks, and how to keep up with all the new sources of information, such as function annotations via ENCODE. GSK now has a Chief Data Officer. Everyone is concerned about how to accumulate data from new bio-technologies (microarrays then RNA-seq, fluorescence then imaging, new techniques for measuring biomarkers of a population under longitudinal study). Trying to keep up with the changes can lead to bad experiment design, and bad choices for data management.
There was much discussion about needing to make more biomedical data open-access (with consent), including genomic, phenotypic and medical data. There seemed to be some puzzlement about why people are happy to entrust banks with their financial data, and supermarkets with purchase data, but not researchers with biomedical data. (I don't share their puzzlement: your genetic data is not your choice, it's what you're born with, and it belongs to your family as much as it belongs to you, so the implications of sharing it are much wider).
All these issues surrounding the advancement of Data Science are far more complex and varied than the creation of novel and better algorithms. How much will the ATI be able to tackle in the next five years? It's certainly a challenge.
Tuesday 29 September 2015
An executable language for change in biological sequences
A discussion on Twitter about whether there was a language for representing sequence edits prompted me to post my draft proposal for such a language. http://figshare.com/articles/Draft_proposal/1559009
Comments, criticism, collaboration and competition welcome. Hopefully I'll submit it shortly.
Comments, criticism, collaboration and competition welcome. Hopefully I'll submit it shortly.
Saturday 5 September 2015
Notes from workshop on Computational Statistics and Machine Learning
I've just attended "Autonomous Citizens: Algorithms for Tomorrow's Society", a workshop as part of the Network on Computational Statistics and Machine Learning (NCSML). That's an ambitious title for a workshop! Autonomous Citizens are not going to hit the streets any time soon. The futuristic goals of Artificial Intelligence are still some way off. Robots are still clumsy, expensive and inflexible. But AI has changed dramatically since I was a student. Back in the days when computational power was more limited, AI was mostly about hand-coding knowledge into expert systems, grammars, state machines and rule bases. Now almost any form of intelligent behaviour from Google translation to Facebook face recognition makes heavy use of computational statistics to infer knowledge.
Posters: there were some really good posters and poster presenters who did a great job of explaining their work to me. In particular I'd like to read more about:
The first talk presented the idea of an Automated Statistician (Zoubin Ghahramani). Throw your time series data at the automated statistician and it'll give you back a report in natural language (English) explaining the trends and extending a prediction for the future. The idea is really nice. He has defined a language for representing a family of statistical models, a search procedure to find the best combination of models to fit your data, an evaluation method so that it knows when to stop searching, and a procedure to interpret/translate the models and explain the results. His language of models is based on Gaussian processes with a variety of interesting kernels, together with addition and multiplication as operators on models, and also allowing change points, so we can shift from one model combination to another at a given timepoint.
The next two talks were about robots, which are perhaps the ultimate autonomous citizens. Marc Deisenroth spoke about using reinforcement learning and Bayesian optimisation as two methods for speeding up learning in robots (presented with fun videos showing learning of pendulum swinging, valve control and walking motion). He works on minimising the expected cost of the policy function in reinforcement learning. His themes of using Gaussian processes, using knowledge of uncertainty to help determine which new points to sample were also reflected in the next talk by Jeremy Wyatt about robots that reason with uncertain and incomplete information. He uses epistemic predicates (know, assumption), and has probabilities associated with his robot's rule base so that it can represent uncertainty. If incoming data from sensors may be faulty, then that probability should be part of the decision making process.
Next was Steve Roberts, who described working with crowd sourced data (from sites such as zooniverse), real citizens rather then automated ones. He deals with unreliable worker responses and large datasets. People vary in their reliability, and he needs to increase accuracy of results and also use their time effectively. The data to be labelled has a prior probability distribution. Each person also has a confusion matrix, describing how they label objects. These confusion matrices can be inspected, and in fact form clusters representing characteristics of the people (optimist, pessimist, sensible, etc). There are many potential uses for understanding how people label the data. Along the way, he mentioned that Gibbs sampling is a good method but is too slow for his large data, so he uses Variational Bayes, because the approximations work for this scenario.
Finally, we heard from Howard Covington, who introduced the new Alan Turing Institute which aims to be the UK's national institute for Data Science. This is brand new, and currently only has 4 employees. There will eventually be a new building for this institute, in London, opposite the Crick Institute. It's good to see that computer science, maths and stats now have an discipline-specific institute and will have more visibility from this. However, it's an institute belonging to 5 universities: Oxford, Cambridge, UCL, Edinburgh and Warwick, each of which has contributed £5million. How the rest of us get to join in with the national institute is not yet clear (Howard Covington was vague: "later"). For now, we can join the scoping workshops that discuss the areas of research that are relevant to the institute. The website, which has only been up for 4 weeks so far, has a list of these, but no joining information. Presumably, email the coordinator of a workshop if you're interested. The Institute aims to have 200 staff in London (from Profs to PhDs, including administrators). They're looking for research fellows now (Autumn 2015), and PhDs soon. Faculty from the 5 unis will be seconded there for periods of time, paid for by the institute. There will be a formal launch party in November.
Next year, the NCSML workshop will be in Edinburgh.
Posters: there were some really good posters and poster presenters who did a great job of explaining their work to me. In particular I'd like to read more about:
- A Probabilistic Context-Sensitive Model of Subsequences (Jaroslav Fowkes, Charles Sutton): a method for finding frequent interesting subsequences. Other methods based on association mining give lots of frequent but uninteresting subsequences. Instead, define a generative model, then go on to use data and EM to infer the parameters of the model.
- Canonical Correlation Forests (Tom Rainforth, Frank Wood): a replacement for random forests that projects (a bootstrap sample of) the data into a different coordinate space using Canonical Correlation Analysis before making the decision nodes.
- Algorithmic Design for Big Data (Murray Pollock et al): Retrospective Monte Carlo. Monte Carlo algorithms with reordered steps. There are stochastic steps and deterministic steps. The order can have a huge effect on efficiency. His analogy went as follows: imagine you've set a quiz with a right answer and a wrong answer. People submit responses and you need to choose a winner. You could first sort them all into two piles (correct, wrong) and then pick a winner from the correct pile (deterministic first, then stochastic). Or you could just randomly sample from all results until you get a winner (stochastic first). The second will be quicker.
- MAP for Dirichlet Process Mixtures (Alexis Boukouvalas et al): a method for creating a Dirichlet Process Mixture model. This is useful as a k-means replacement where you don't know in advance what k should be, and where your clusters are not necessarily spherical.
The first talk presented the idea of an Automated Statistician (Zoubin Ghahramani). Throw your time series data at the automated statistician and it'll give you back a report in natural language (English) explaining the trends and extending a prediction for the future. The idea is really nice. He has defined a language for representing a family of statistical models, a search procedure to find the best combination of models to fit your data, an evaluation method so that it knows when to stop searching, and a procedure to interpret/translate the models and explain the results. His language of models is based on Gaussian processes with a variety of interesting kernels, together with addition and multiplication as operators on models, and also allowing change points, so we can shift from one model combination to another at a given timepoint.
The next two talks were about robots, which are perhaps the ultimate autonomous citizens. Marc Deisenroth spoke about using reinforcement learning and Bayesian optimisation as two methods for speeding up learning in robots (presented with fun videos showing learning of pendulum swinging, valve control and walking motion). He works on minimising the expected cost of the policy function in reinforcement learning. His themes of using Gaussian processes, using knowledge of uncertainty to help determine which new points to sample were also reflected in the next talk by Jeremy Wyatt about robots that reason with uncertain and incomplete information. He uses epistemic predicates (know, assumption), and has probabilities associated with his robot's rule base so that it can represent uncertainty. If incoming data from sensors may be faulty, then that probability should be part of the decision making process.
Next was Steve Roberts, who described working with crowd sourced data (from sites such as zooniverse), real citizens rather then automated ones. He deals with unreliable worker responses and large datasets. People vary in their reliability, and he needs to increase accuracy of results and also use their time effectively. The data to be labelled has a prior probability distribution. Each person also has a confusion matrix, describing how they label objects. These confusion matrices can be inspected, and in fact form clusters representing characteristics of the people (optimist, pessimist, sensible, etc). There are many potential uses for understanding how people label the data. Along the way, he mentioned that Gibbs sampling is a good method but is too slow for his large data, so he uses Variational Bayes, because the approximations work for this scenario.
Finally, we heard from Howard Covington, who introduced the new Alan Turing Institute which aims to be the UK's national institute for Data Science. This is brand new, and currently only has 4 employees. There will eventually be a new building for this institute, in London, opposite the Crick Institute. It's good to see that computer science, maths and stats now have an discipline-specific institute and will have more visibility from this. However, it's an institute belonging to 5 universities: Oxford, Cambridge, UCL, Edinburgh and Warwick, each of which has contributed £5million. How the rest of us get to join in with the national institute is not yet clear (Howard Covington was vague: "later"). For now, we can join the scoping workshops that discuss the areas of research that are relevant to the institute. The website, which has only been up for 4 weeks so far, has a list of these, but no joining information. Presumably, email the coordinator of a workshop if you're interested. The Institute aims to have 200 staff in London (from Profs to PhDs, including administrators). They're looking for research fellows now (Autumn 2015), and PhDs soon. Faculty from the 5 unis will be seconded there for periods of time, paid for by the institute. There will be a formal launch party in November.
Next year, the NCSML workshop will be in Edinburgh.
Thursday 30 July 2015
ISMB/ECCB 2015
ISMB/ECCB 2015 (and HitSEQ 2015) was held in Dublin, just across the Irish Sea from us here in Aberystwyth. So off we went, to find out the latest research in bioinformatics. There were many parallel tracks, but the recurring themes of the talks I attended were:
The keynote talks tended to be of the kind that long-established group PIs do well. They're the "Here's a summary of all the work my group has been doing for the past 5-10 years to answer this particular biological question" talk. While I admire their determination and group size, I feel that they're speaking only to a subgroup of the audience with this kind of talk, and that a keynote should somehow also aim to more generally inspire the audience to go out and do great work, have new ideas, think in new directions, and not just to have learned a little more about that specific subject area. The far more off-the-wall non-keynote talk by David Searls about a bioinformatic analysis of James Joyce's book Ulysses fascinated the audience, and provided exactly that. He received a huge round of applause.
The jobs notice boards were full (below are just 2 of the notice boards). More bioinformaticians are clearly needed!
Many of the conference talks are now online, but you need to be a member of ISCB to see them http://www.iscb.org/ismb-mm/media-ismbeccb2015. The papers are also collected in a special ISMB/ECCB issue of Bioinformatics.
- lots of work on human genomics, particularly disease, particularly cancer
- single cell analysis, finding variation (SNVs) from clonal populations, haplotype resolution
- sequencing technologies: RNA-seq, sequencing of methylation, Hi-C sequencing, ultra deep sequencing and lots of promise for long reads
- reference sequences: most people were working with a reference rather than de-novo
- training bioinformaticians, maintaining software, keeping a core of bioinfomaticians
- the Burrows Wheeler transform - does it solve every large-data problem?
- graphs, and ways of cleaning up graphs, adding weights to graphs, finding minimal/maximal components of graphs
- text mining
- metagenomics
- multi-omics
Aberystwyth PhD students with their posters: Stefani Dritsa, Sam Nicholls, Tom Hitch, Francesco Rubino |
The keynote talks tended to be of the kind that long-established group PIs do well. They're the "Here's a summary of all the work my group has been doing for the past 5-10 years to answer this particular biological question" talk. While I admire their determination and group size, I feel that they're speaking only to a subgroup of the audience with this kind of talk, and that a keynote should somehow also aim to more generally inspire the audience to go out and do great work, have new ideas, think in new directions, and not just to have learned a little more about that specific subject area. The far more off-the-wall non-keynote talk by David Searls about a bioinformatic analysis of James Joyce's book Ulysses fascinated the audience, and provided exactly that. He received a huge round of applause.
The jobs notice boards were full (below are just 2 of the notice boards). More bioinformaticians are clearly needed!
Many of the conference talks are now online, but you need to be a member of ISCB to see them http://www.iscb.org/ismb-mm/media-ismbeccb2015. The papers are also collected in a special ISMB/ECCB issue of Bioinformatics.
Monday 8 June 2015
Aber Bioinformatics Workshop
Last week we had the 2nd Aber Bioinformatics Workshop. It's an internal workshop for work-in-progress talks, posters and networking and the aim is for us all to keep up with what's going on in Aberystwyth in bioinformatics across departments and institutes. We had a wide range of talks on genomics and sequence analysis, metabolomics, optimising proteins, population and community modelling, data infrastructure and other topics. Here's the programme for the day.
It was great to see that we now have so many people interested and working in bioinformatics, despite the difficulties in trying to understand all sides of the story (the biology, the computing, the statistics, etc). We talked about the range of modules and courses that were available to help people get up to speed with this, and how we should do more to let new PhD students know what is available. Also, now that we've had the workshop, hopefully we're more aware of the expertise and facilities available here in Aber, so we now know who to approach with questions and ideas.
At the end of the day we moved down to the pub, and continued to discuss more random topics: beetles, plant senescence, hens, temperature sensing wires for computer clusters, and concordance in Shakespeare texts. I'm sure this all helps in the long run.
Photo of all the attendees, taken by Sandy Spence |
At the end of the day we moved down to the pub, and continued to discuss more random topics: beetles, plant senescence, hens, temperature sensing wires for computer clusters, and concordance in Shakespeare texts. I'm sure this all helps in the long run.
Monday 1 June 2015
Burglary at the railway refreshment rooms
The NLW have a fantastic collection of digitised historic newspaper articles available. They're going to release a new interface shortly with access to 15 million articles, and I was testing the search facility on the new interface today. I came across this gem of a story from the Aberystwyth Observer in 1907, which also happens to be available in their currently live beta collection.
BURGLARY AT THE RAILWAY REFRESHMENT ROOMS. The police are investigating two cases of robbery from the refreshment rooms at Borth and Dovey Junction. At these places thieves broke into the premises and got away with a quantity of wine and money. This is the second or third time that Dovey Junction has been visited during the last few years.
http://welshnewspapers.llgc.org.uk/en/page/view/3050081/ART41 (may need to zoom out and zoom in again before moving the page to see the article, which is at the edge of the page)
The thought of wine and money being held at remote Dovey Junction station is delightful, as is the fact that the reporter can't remember if it's the second or third time this has happened. Perhaps the reporter knows something about where the wine went.
BURGLARY AT THE RAILWAY REFRESHMENT ROOMS. The police are investigating two cases of robbery from the refreshment rooms at Borth and Dovey Junction. At these places thieves broke into the premises and got away with a quantity of wine and money. This is the second or third time that Dovey Junction has been visited during the last few years.
http://welshnewspapers.llgc.org.uk/en/page/view/3050081/ART41 (may need to zoom out and zoom in again before moving the page to see the article, which is at the edge of the page)
The thought of wine and money being held at remote Dovey Junction station is delightful, as is the fact that the reporter can't remember if it's the second or third time this has happened. Perhaps the reporter knows something about where the wine went.
Thursday 21 May 2015
How much is enough?
How much is enough? This question seems to crop up very frequently when analysing data. For example:
Sometimes there are numbers to report, measures that give us an idea of whether the process was good enough, after we've done the expensive computation. We can report various statistics about how good the result is, such as the N50 and its friends for sequence assembly, or the predictive accuracy for a newspaper article place name labeller. Which statistics to report are highly questionable. Does a single figure such as the N50 really tell us anything useful about the assembled sequence? It can't tell us which parts were good and which parts were messy. Do we really need lots of long contigs if we're assembling a metagenome? Perhaps the assembly is just an input to many further pipeline stages, and actually, choppy short contigs will do just fine for the next stage.
PAC learning theory was an attempt in 1984 by Leslie Valiant to address the questions about what was theoretically possible with data and machine learning. For what kinds of problem can we learn good hypotheses in a reasonable amount of time (hypotheses that are Probably Approximately Correct)? This led on to the question of how much data is enough to make a good job of machine learning? Some nice blog posts describing PAC learning theory and how much data is needed to ensure low error do a far better job than I could of explaining the theory. However, the basic theory assumes nice clean noise-free data and assume that the problem is actually learnable (it also tends to overestimate the amount of data we'd actually need). In the real world the data is far from clean, and the problem might never be learnable in the format that we've described or in the language we're using to create hypotheses. We're looking for a hypothesis in a space of hypotheses, but we don't know if the space is sensible. We could be like the drunk looking for his keys under the lamppost because the light is better there.
Perhaps there will be more theoretical advances in the future that tell us what kinds of genomic analysis are theoretically possible, and how much data they'd need, and what parameters to provide before we start. It's likely that this theory, like PAC theory, will only be able to tell us part of the story.
So if theory can't tell us how much is enough, then we have to empirically test and measure. But if we're still not sure how much is enough, then we're probably just not asking the right question.
- "How much data do I need to label in order to train a machine learning algorithm to recognise place names that locate newspaper articles?"
- "Is my metagenome assembly good enough or do we need longer/fewer contigs?"
- "What BLAST/RAPSearch threshold is close enough?"
- "Are the k-mers long enough or short enough? (for taxon identification, for sequence assembly)"
Sometimes there are numbers to report, measures that give us an idea of whether the process was good enough, after we've done the expensive computation. We can report various statistics about how good the result is, such as the N50 and its friends for sequence assembly, or the predictive accuracy for a newspaper article place name labeller. Which statistics to report are highly questionable. Does a single figure such as the N50 really tell us anything useful about the assembled sequence? It can't tell us which parts were good and which parts were messy. Do we really need lots of long contigs if we're assembling a metagenome? Perhaps the assembly is just an input to many further pipeline stages, and actually, choppy short contigs will do just fine for the next stage.
PAC learning theory was an attempt in 1984 by Leslie Valiant to address the questions about what was theoretically possible with data and machine learning. For what kinds of problem can we learn good hypotheses in a reasonable amount of time (hypotheses that are Probably Approximately Correct)? This led on to the question of how much data is enough to make a good job of machine learning? Some nice blog posts describing PAC learning theory and how much data is needed to ensure low error do a far better job than I could of explaining the theory. However, the basic theory assumes nice clean noise-free data and assume that the problem is actually learnable (it also tends to overestimate the amount of data we'd actually need). In the real world the data is far from clean, and the problem might never be learnable in the format that we've described or in the language we're using to create hypotheses. We're looking for a hypothesis in a space of hypotheses, but we don't know if the space is sensible. We could be like the drunk looking for his keys under the lamppost because the light is better there.
Perhaps there will be more theoretical advances in the future that tell us what kinds of genomic analysis are theoretically possible, and how much data they'd need, and what parameters to provide before we start. It's likely that this theory, like PAC theory, will only be able to tell us part of the story.
So if theory can't tell us how much is enough, then we have to empirically test and measure. But if we're still not sure how much is enough, then we're probably just not asking the right question.
Wednesday 22 April 2015
Computer Science and Lindy Hop
It would seem that Lindy Hop is the dance of computing people, physicists and engineers. If you go to any swing dance camp, an unreasonable proportion of the people in the room will be somehow involved in IT. We have Lindy hoppers who have used Androids with sensors and fourier transforms to look at the pulse of the dance, use Lindy to illustrate quantum computing, and there is even a specific Lindy dance class for engineers. Sam Carroll described how digital media savvy the community was and is, in her Step Stealing work.
Okay, so people need money to go to dance camps, and computing professions generally pay well. And it gets us away from our desks and having fun with other people and music. However, these can't be the only reasons.
I think that I enjoy Lindy for lots of the same reasons that I enjoy computing. They're both about creating complex structures that are somehow beautiful. By complex structures I mean structures that are complicated enough that they make me feel pleased when I finally successfully make them work. By beautiful I mean code/dance/ideas that become elegant because of their appropriateness in that particular situation. And in both computing and Lindy I enjoy the reusable patterns. Reusable patterns in rhythm are like reusable patterns in computing: once you've understood them, they stay with you and can often tell you something more abstract about what you're trying to do.
So I think that computing and Lindy have more in common than just having fun. They also share reusable beautiful complexity.
Added note: If you want to try it out, come and join our Vintage Swinging in the Rain party on Friday 24th April, 8pm, Marine Hotel, Aberystwyth. There's a short dance class for beginners at about 8:30, and live music from The Paper Moon Band.
Okay, so people need money to go to dance camps, and computing professions generally pay well. And it gets us away from our desks and having fun with other people and music. However, these can't be the only reasons.
I think that I enjoy Lindy for lots of the same reasons that I enjoy computing. They're both about creating complex structures that are somehow beautiful. By complex structures I mean structures that are complicated enough that they make me feel pleased when I finally successfully make them work. By beautiful I mean code/dance/ideas that become elegant because of their appropriateness in that particular situation. And in both computing and Lindy I enjoy the reusable patterns. Reusable patterns in rhythm are like reusable patterns in computing: once you've understood them, they stay with you and can often tell you something more abstract about what you're trying to do.
So I think that computing and Lindy have more in common than just having fun. They also share reusable beautiful complexity.
Added note: If you want to try it out, come and join our Vintage Swinging in the Rain party on Friday 24th April, 8pm, Marine Hotel, Aberystwyth. There's a short dance class for beginners at about 8:30, and live music from The Paper Moon Band.
Wednesday 15 April 2015
Employers at BCSWomen Lovelace Colloquium 2015
I think this year's Lovelace Colloquium was notable for the strong employer presence, both in sponsorship and in having employer stalls. This seems to be a year when computer science students are generally in demand. A conference of computing undergraduates presenting their work consists of ambitious students, who are ideal targets for recruiters. And the fact that it's a room full of bright women undergraduates is a very good thing for companies looking to increase their diversity.
Some of the companies who sponsored this year's Lovelace have been strong supporters for many years, including Google, FDM, EMC, UTC Aerospace, Interface3 and VMWare. They understood this a long time ago. But newer to this event were a whole variety of other companies, some small, some large, including Twitter, Slack, GCHQ, Scott Logic, JP Morgan, Bloomberg and Kotikan. We hope they enjoyed it too, and return in future years.
Kate Ho provided the keynote speech. She started her own software (games) company right after her PhD, and has gone from strength to strength, running a variety of startup companies since then. Her three tips: have side projects, be distinctive, keep a diary, were all good advice, both for technical work and for career development.
The friendliness of the Lovelace Colloquium never ceases to surprise and motivate me. Part of this is driven by Hannah's organisation style, pre-conference and during-conference, where nothing is too much trouble and everyone is made to feel at home. But I think it's also genuinely a room full of people having fun, getting to know others and make connections, and finding inspiration for their future careers.
Some of the companies who sponsored this year's Lovelace have been strong supporters for many years, including Google, FDM, EMC, UTC Aerospace, Interface3 and VMWare. They understood this a long time ago. But newer to this event were a whole variety of other companies, some small, some large, including Twitter, Slack, GCHQ, Scott Logic, JP Morgan, Bloomberg and Kotikan. We hope they enjoyed it too, and return in future years.
Kate Ho provided the keynote speech. She started her own software (games) company right after her PhD, and has gone from strength to strength, running a variety of startup companies since then. Her three tips: have side projects, be distinctive, keep a diary, were all good advice, both for technical work and for career development.
The friendliness of the Lovelace Colloquium never ceases to surprise and motivate me. Part of this is driven by Hannah's organisation style, pre-conference and during-conference, where nothing is too much trouble and everyone is made to feel at home. But I think it's also genuinely a room full of people having fun, getting to know others and make connections, and finding inspiration for their future careers.
Tuesday 31 March 2015
Final year computer science projects 2015
Computer Science is a diverse subject and this is reflected in the final year projects that our undergraduates undertake. This year, the final year project students that I supervise have chosen the following:
- A simulation of Babbage's Analytical Engine to be used as an online educational tool. As the first design for a general purpose programmable computer, it's of huge importance, but there are few good resources to help explain it to the general public. This site will include some history about Babbage and Lovelace, and an interactive game where you get to program the engine. Technologies involved include client side web programming, expression parsing, and 3D graphics. (Rhian Watkins)
- Geotagging of the digitised newspaper articles in the collection of the National Library of Wales. A mention of a placename in an article does not necessarily mean that the article is about that place (for example an article about "the Duchess of York). This project uses a gazetteer from Open Street Map data, NLP to extract features, and then various machine learning algorithms to see if we can tell which placenames are relevant and which are not. (Sean Sapstead)
- A version control system for DNA. Software version control systems are not so useful for storing details of whole genomes and the modifications made to them. We want to explore Darcs-like patches, and use of sequence alignment tools to help record and inspect DNA modifications, and to be able to apply multiple modifications in a different order. (Thomas Hull)
- A tool for demonstrating the differences between two DNA sequences as audio/music and as animations. How can we show the public the differences between two strains of the Ebola virus, or the mutations in BRCA1 that can cause cancer? Sequence alignment tools and different translations and representations of the DNA strings are the key to this problem. (Andrew Poll, his project blog)
- The Happy Cow Game: an online collaborative game that represents the process of feeding a cow with the correct balance of foodstuffs to optimise its health, meat and milk. This was initially developed as a board game by veterinary lecturer Gabriel de la Fuente Oliver, and he's now helping us to turn it into an online game. Technologies involved include the Ruby on Rails framework, client side web tools and libraries, and a detailed understanding of how to make a complex game playable. (Simeon Smith)
- A tournament seeding tool for online gamers as part of Aber Community of Gamers. This one's going to collect data about previous games played, using APIs for the various online gaming platforms and then use interesting seeding algorithms to make sure that tournaments are fair and balanced. It's also producing the web site to support the community, using the Laravel framework. (Nathan Hand)
Andrew shows his prototype code for turning DNA differences into sound and animations to children in Science Week |
Wednesday 18 March 2015
Laboratory automation in a functional programming language
Whenever I write code in Haskell instead of other programming languages, it feels cleaner. Not just more elegant, but also more obviously correct. And that's not just about the lack of side effects and mutable variables. Haskell has stronger typing, which gives the programmer many guarantees and allows you to express more information about the code. It also has tools such as QuickCheck, in which you can state and test further properties that you believe to be true.
We wanted to bring these ideas to the area of laboratory automation. We've had some fairly large and complex lab automation systems in our lab over the years, with multiple robot arms, and dozens of devices to be serviced. These robot arms pass plastic plates containing yeast around incubators, washers, liquid handlers, centrifuge devises and so on. If the plates get deadlocked or left out of the incubator for too long because scheduling operations went wrong, then the experiment is ruined. However, this can happen if the scheduler needs to be able to make decisions on the fly during the experiments. It may need to decide what to do next based on the current instrument readings and current system capacity. So either you make a scheduler that's so simple that you know exactly what it will do in advance (but it can't do the workflow you really need), or you make a scheduler that's complex and flexible, but it's very difficult to analyse its properties. Hmm, I think Tony Hoare already suggested that choice.
So we've written a paper to demonstrate the benefits of programming a lab automation scheduler in Haskell, and in particular to demonstrate the kinds of properties that can be expressed and checked. We illustrate the paper with a fairly simple system and a fairly simple scheduler, but it's immediately obvious that more complex systems and schedulers can be explored by tweaking the code.
This paper was written by the three of us, coming together with three very different perspectives. Colin is a functional programming researcher at the Uni of York who enjoys opportunities to demonstrate the benefits of FP in real world problems. Rob works for PAA, an excellent lab automation company, who build complex bespoke systems (and software) for their clients. They built one of our lab automation systems. I'm both a user of such lab automation systems, and also a user of Haskell, without ever actually being an FP researcher.
The code is available as a literate Haskell file. The entire code is in this file, along with a complete description of what's going on and how it all works, and this file can easily be turned into a readable PDF document (which also includes all the code). https://github.com/amandaclare/lab-auto-in-fp
If you've been inspired by the ideas in this work, do please cite the paper:
C. Runciman, A. Clare and R. Harkness. Laboratory automation in a functional programming language. Journal of Laboratory Automation 2014 Dec; 19(6):569-76. doi: 10.1177/2211068214543373.
http://jla.sagepub.com/content/19/6/569.abstract
Abstract:
After some years of use in academic and research settings, functional languages are starting to enter the mainstream as an alternative to more conventional programming languages. This article explores one way to use Haskell, a functional programming language, in the development of control programs for laboratory automation systems. We give code for an example system, discuss some programming concepts that we need for this example, and demonstrate how the use of functional programming allows us to express and verify properties of the resulting code.
Tuesday 10 March 2015
Python for Scientists
This year, 2014/2015, we started a new MSc course: Statistics for Computational Biology. We can see that there's a huge demand for bioinformaticians, for statisticians who can read biology, and for programmers who know about statistics and can apply stats to biological problems. So this new MSc encompasses programming, statistics and loads of the current hot topics in biology. It's the kind of MSc I would have loved to have done when I was younger.
As part of this degree, I'm teaching a brand new module called Programming for Scientists, which uses the Python programming language. This is aimed at students who have no prior programming knowledge, but have some science background. And in one semester we teach them the following:
What impressed me most was the quality of the final assignment work. We asked the students to analyse a large amount of data about house sales, taken from http://data.gov.uk/ and population counts for counties in England and Wales taken from the Guardian/ONS. They had to access the data as XML over a REST-ful API, and it would take them approximately 4 days to download all the data they'd need. We didn't tell them in advance how large the data was and how slow it would be to pull it from an API. Undergrads would have complained. These postgrads just got on with it and recognised that the real world will be like this. If your data is large and slow to acquire then you'll need to test on a small subset, check and log any errors and start the assignment early. The students produced some clean, structured and well commented code and many creative summary graphs showing off their data processing and data visualisation skills.
I hope they're having just as much fun on their other modules for this course. I'm really looking forward to running this one again next year.
As part of this degree, I'm teaching a brand new module called Programming for Scientists, which uses the Python programming language. This is aimed at students who have no prior programming knowledge, but have some science background. And in one semester we teach them the following:
- The basics of programming: variables, loops, conditionals, functions
- File handling (including CSV)
- Plotting graphs using matplotlib
- Exceptions
- Version control using Git/Github
- SQL database (basic design, queries, and using from SQLite from Python)
- XML processing
- Accessing data from online APIs
We had students sign up for this module from a surprisingly diverse set of backgrounds, from biology, from maths, from geography and even from international politics. We also had a large number of staff and PhD students from our Biology department (IBERS) who wanted to sit in on the module. This was a wonderful group of students to teach. They're people who wanted to learn, and mostly just seemed to absorb ideas that first year undergraduates struggle with. They raised their game to the challenge.
Python's a great language for getting things done. So it makes a good hands-on language. However, it did highlight many of Python's limitations as a first teaching language. The objects/functions issue: I chose not to introduce the idea of objects at all. It's hard enough getting this much material comfortably into the time we had, and objects, classes and subclasses was something that I chose to leave out. So we have two ways to call functions: len(somelist) and somelist.reverse(). That's unfortunate. Variable scoping caught me out on occasion, and I'll have to fix that for next year. The Python 2 vs Python 3 issue was also annoying to work around. Hopefully next year we can just move to Python 3.
What impressed me most was the quality of the final assignment work. We asked the students to analyse a large amount of data about house sales, taken from http://data.gov.uk/ and population counts for counties in England and Wales taken from the Guardian/ONS. They had to access the data as XML over a REST-ful API, and it would take them approximately 4 days to download all the data they'd need. We didn't tell them in advance how large the data was and how slow it would be to pull it from an API. Undergrads would have complained. These postgrads just got on with it and recognised that the real world will be like this. If your data is large and slow to acquire then you'll need to test on a small subset, check and log any errors and start the assignment early. The students produced some clean, structured and well commented code and many creative summary graphs showing off their data processing and data visualisation skills.
I hope they're having just as much fun on their other modules for this course. I'm really looking forward to running this one again next year.
Monday 9 March 2015
International Women's Day pub quiz
On Sunday 8th March 2015, Hannah Dee and I organised a pub quiz for International Women's Day. We wanted to highlight some famous women in science, but we don't expect people to know much about famous women in science. So how to do a quiz? We themed 5 rounds around the women:
1) The Mary Anning fossil hunting round
A huge word search with many words related to Mary Anning's work and fossils to find (including "ichtheosaur" and "she sells sea shells", "on the sea shore".
2) The Amelia Earhart aviation round
Create paper aeroplanes that will travel from Europe (over here) to America (over there) and land within an area marked by a hula hoop. We should have had planes crossing the Atlantic in the other direction, but oh well, we're in west Wales.
3) The Caroline Herschel stargazing round
Early astronomy was often about spotting small differences in maps of the heavens. Thanks to heavens-above.com we had a copy of the sky map for the evening, and another copy that had been modified with gimp. Spot the difference! Three Gemini twins?
4) The Barbara McClintock genome round
Here we used C. Titus Brown's shotgunator to make a set of short reads from a few sentences about the work of Barbara McClintock. The teams had to assemble the genome to decipher the sentences. It must have seemed as if transposons were at work, because with a few repeated words the sentences they were constructing did get rather jumbled.
5) The Florence Nightingale data visualisation round
Finally the teams got to use a box of stuff (pipe cleaners, stickers, fluorescent paper, googly eyes, coloured pens) to make the most creative version of this year's HESA stats on women employed in higher education.
No trivia or celebrities in the quiz at all!
1) The Mary Anning fossil hunting round
A huge word search with many words related to Mary Anning's work and fossils to find (including "ichtheosaur" and "she sells sea shells", "on the sea shore".
2) The Amelia Earhart aviation round
Create paper aeroplanes that will travel from Europe (over here) to America (over there) and land within an area marked by a hula hoop. We should have had planes crossing the Atlantic in the other direction, but oh well, we're in west Wales.
3) The Caroline Herschel stargazing round
Early astronomy was often about spotting small differences in maps of the heavens. Thanks to heavens-above.com we had a copy of the sky map for the evening, and another copy that had been modified with gimp. Spot the difference! Three Gemini twins?
4) The Barbara McClintock genome round
Here we used C. Titus Brown's shotgunator to make a set of short reads from a few sentences about the work of Barbara McClintock. The teams had to assemble the genome to decipher the sentences. It must have seemed as if transposons were at work, because with a few repeated words the sentences they were constructing did get rather jumbled.
5) The Florence Nightingale data visualisation round
Finally the teams got to use a box of stuff (pipe cleaners, stickers, fluorescent paper, googly eyes, coloured pens) to make the most creative version of this year's HESA stats on women employed in higher education.
The scales of employment in HE |
No trivia or celebrities in the quiz at all!
Subscribe to:
Posts (Atom)