Amanda Clare: computer science

Showing posts with label computer science. Show all posts

Thursday, 27 April 2017

Ada Lovelace and her contribution to computer science

Ada Lovelace (1815-1852) is known as the first computer programmer. What did she really contribute to computing? So many blog posts, articles and books get bogged down describing her childhood, and her famous father and overbearing mother. But what did she actually do?

A while ago I got stuck in and actually read the paper she actually wrote. "Sketch of the Analytical Engine invented by Charles Babbage... with notes by the translator. Translated by Ada Lovelace"

In fact, she translated into English a paper by an Italian engineer, Luigi Menabrea, and clearly felt that his description did not go far enough, because she added her own notes at the end, which amount to more detail than the original translation.

The original paper by Menabrea was supposed to sell the idea of the Analytical Engine, a machine designed by Charles Babbage but not yet constructed. Money needed to be raised for this endeavour. Menabrea explains what the machine can do. To some extent he does try also to sell the idea, but his writing can be quite dry: 'To give an idea of this rapidity, we need only mention that Mr. Babbage believes he can, by his engine, form the product of two numbers, each containing twenty figures, in three minutes.'

Lovelace wanted to add more, and her writing is glowing with excitement about the machine. She added 7 detailed discussion notes, labelled A to G, and also 20 numbered footnotes. In the numbered footnotes she comments on issues with the translation. She also comments sometimes on how she feels that Menabrea hasn't quite understood completely the novelty and potential (for example 'M. Menabrea's opportunities were by no means such as could be adequate to afford him information on a point like this'). In the notes labelled A to G she explains many of the concepts that are the fundamentals of computing today.

Note A

This note spells out that the Analytical Engine is a general purpose computer. It doesn't just calculate fixed results, it analyses. It can be programmed. It's flexible. It links 'the operations of matter and the abstract mental processes of the most abstract branch of mathematical science'. This is not just a Difference Engine, but instead offers far more. She is very clear that 'the Analytical Engine does not occupy common ground with mere "calculating machines". It holds a position wholly its own'. It's also not just for calculations on numbers, but could operate on other values too. It might for instance compose elaborate pieces of music. 'The Analytical Engine is an embodying of the science of operations'. Lovelace has a beautiful turn of phrase that evokes her delight in imagining where this could go: it 'weaves algebraic patterns just as the Jacquard loom weaves flowers and leaves'.

Note B

This note is about memory, the 'storehouse' of the Engine. It is about variables, and retaining values in memory. She describes how the machine can retain separately and permanently any intermediate results, and then supply these intermediate results as inputs to further calculations. Any particular function that is to be calculated is described by its variables and its operators, and this gives the machine its generality.

Note C

This note is about reuse: the Engine uses input cards, as used in the Jacquard looms. Lovelace had toured the factories of the north with her mother and seen the Jacquard looms in action. They would have been the cutting edge of technology at the time. Intricate patterns were woven by machines, with thousands of fine threads, all controlled by punched cards. The cards needed to be created by card punching machines, with detailed tables of numbers that would be translated into the patterns of holes. The cards allowed a design to be repeated, and they were bound together in a sequence. She explains that these input cards, can be reused 'any number of times successively in the solution of one problem'. And furthermore, that this applies to just one card, or a whole set of cards. The train of cards can be rewound until they are in the correct position to be used again.

Jacquard loom, Nottingham Industrial Museum

Punched cards used in Jacquard loom, Nottingham Industrial Museum

Note D

This note is about the order and efficiency of operations. She explains that there are input variables, intermediate variables, and final results. Every operation must have two variables 'brought into action' as inputs and one variable to use as the result. We can then trace back and inspect a calculation to see how a result was obtained. We can see how often a value was used. We can think about how to batch up operations to make them more efficient. She begins to imagine not just how the machine will operate, but how programmers will have to think carefully about their programs, how they will debug them and how they will optimise the algorithms they implement.

Note E

In this note Lovelace explains how loops or 'cycles' can be used to solve series, for example to sum trigonometrical series. With a worked example (used in astronomical calculations), she abstracts out the operations needed to produce terms in the series and works on a notation for expressing cycles that include cycles. Note 18 expands further to explain that one loop can follow another, and there may be infinitely many loops.

Note F

This note begins with the statement 'There is in existence a beautiful woven portrait of Jacquard, in the fabrication of which 24,000 cards were required.' Babbage was so taken with this woven portrait that he bought one of the few copies produced. The amount of work that machines could do, unfailingly, without tiring, was changing the face of industry. The industrial revolution was sweeping through the country. Lovelace explains how the Engine will be able to solve long and intensive with a minimum of cards (they can be rewound, and used in cycles). The machine can calculate a long series of results without making mistakes, solving problems 'which human brains find it difficult or impossible to work out unerringly'. It might even be set to work to solve as yet unsolved and arbitrary problems. 'We might even invent laws for series or formulae in an arbitrary manner, and set the engine to work upon them and thus deduce numerical results which we might not otherwise have thought of obtaining; but this would hardly perhaps in any instance be productive of any great practical utility, or calculated to rank higher than as a philosophical amusement.'

Note G

She concludes with the scope and limitations of the Analytical Engine. She is keen to stress the machine's limitations and about using the machine for discovery: it doesn't create anything. It has 'no pretensions to originate anything'. Lovelace then enumerates what it can do, but warns against overhyping it. Alan Turing references her concerns in his 1950 paper Computing Machinery and Intelligence (which is also well worth a read). In this paper he discusses 9 objections that people may raise against the possibility of AI. One of these 9 objections is Lady Lovelace's Objection. He writes:

Our most detailed information of Babbage's Analytical Engine comes from a memoir by Lady Lovelace (1842). In it she states, "The Analytical Engine has no pretensions to originate anything. It can do whatever we know how to order it to perform" (her italics).

In Note G we also get the detailed trace of a substantial program to calculate the Bernoulli numbers, showing the order of operations combining variables to make results. The trace shows looping and storage, showing intermediate results, and the correspondence with the mathematical formulae being calculated at each step. She inspects the program for efficiency, working out the number of punched cards required, the number of variables needed and the numbers of execution steps.

Lovelace wonders, in the third paragraph of this Note, whether our minds are able to follow the analysis of the execution of the machine. Will we really be able to understand computers and their abilities? 'No reply, entirely satisfactory to all minds, can be given to this query, excepting the actual existence of the engine, and actual experience of its practical results.'

Ada Lovelace is buried in Hucknall Church, Nottingham. She died aged 36, and never got to see the Analytical Engine constructed.

Monday, 10 April 2017

Visiting JG Mainz University

I've just returned from a two week visit to Johannes Gutenberg University Mainz, where I was hosted by Andreas Karwath in the Data Mining group of the Dept of Informatik. I was there to pick up new research ideas, to get away from admin/teaching duties for a while and to share what we've been working on lately.

The University at Mainz is a large campus based university on the outskirts of the town, on the beautiful river Rhine. It's named after the inventor of the printing press in the west, Johannes Gutenberg. The Gutenberg museum is excellent, tracing books from wax tablets through handwritten parchments, to moveable type print, then the industrial revolution, typewriters and high volume printing. This is a museum about the value of information and its transmission, and the technology invented to do this. There was even a special exhibition about the Futura font, as an added bonus. This is the geometric circles-and-lines font used in posters for "2001 A Space Odyssey", and on the Apollo 11 moon plaque ("We came in peace for all mankind"), and in so much future-looking advertising and propaganda in the Art Deco inspired 1930s era, both in Germany and beyond.

It was interesting to be embedded in a data mining group, rather than bioinformatics for a change. They have a broad range of application areas, but also happily switch technology (neural nets, relational learning, topic models, matrix decomposition, graphs, rs-trees and more) as the application area needs. Also very interesting to see in which ways a different country's research culture is different. It's not REF-dominated like the UK, so more they're more free to focus on quick-turnaround peer-reviewed compsci conference publications, and perhaps more hierarchically structured, as only the few professors have permanent positions. And yet it's still the same. University departments are international places and share much in common, whichever country you're in: same grant applications, student supervisions, seminar talks, dept silos, etc.

It was great fun to be there, and they were excellent hosts. At the same time, it was strange and sad to be a British person on exchange in Germany during the week that the UK sent article 50 to the EU. International collaborations are so important to research that leaving the EU is bound to be hugely detrimental to us UK academics. We need more exchange, not less.

Tuesday, 5 April 2016

Lovelace Colloquium 2016

This year's Lovelace Colloquium was held at Sheffield Hallam University, last week (March 31st). Sheffield Hallam proved to be a great venue. It's convenient for most people in the UK to get to, with a smart building right by the train station, providing a large poster-exhibiting hall, a modern lecture theatre and a cafeteria, all next to each other.

Every year the Lovelace is an inspiring event. I've now been (and blogged) in 2012, 2013, 2014 and 2015 and it gets bigger and better each year. I'm particularly impressed by the first year undergraduates who are up there presenting posters alongside everyone else, talking to employers and thinking about their future careers.

I didn't get to attend many of the talks this year because I spent more time on the desk and doing organisational jobs (and fixing my posters-numbering error, oops!). But there were some really strong posters covering a wide range of computer science topics, including several with live Arduino demos. The end of the day panel session featured questions and advice on where the field is going in the future, the pros and cos of a career in industry vs academia, the challenges of running your own business and questions about recruitment.

This year I was also impressed to chat to many interesting people during the evening social, for example Claire and Emily from Relish Learning. After graduating from uni they worked for others until deciding one day that they could do it themselves. They set up their own business in Sheffield, and now provide digital e-learning, for a wide range of topics. They described how they've been recently training people in the Army on how to change the wheel on a tank (imagine animations of the components required, and the order in which to remove parts, etc). They are now keen to help others to succeed, to encourage them to believe that they can and to talk to others about how they did it.

If you are not recommending this event to your women undergrads in computer science, then they are missing out. Poster presenters get expenses refunded and may come away with a prize, thanks to all the sponsors. Employers are keen to meet them, so they will also come away with contacts to help them apply for a job or placement. The photos of the event give a great impression of what it's really like if anyone needs any further reassurance.

Other summaries of the day:

Saturday, 30 January 2016

Playful coding: computing activities for schools

In many schools, computing is a topic that needs more encouragement. The Playful Coding project wants to make practical activities that can be run in schools to explore ideas in computer science. I've just been along to one of their meetings and seen it in action. It was extremely inspiring to be in a room full of people who didn't see running computing engagement activities as a chore, but as fun. They had all put a lot of thought into making fun activities and all wanted to run their activities with the groups of children.

It's an EU project involving teachers and university researchers from Spain, Romania, Italy, France and us in Aberystwyth, Wales. Each project partner had developed several activities and the purpose of the meeting was to tested out many of these activities on children and their teachers, and to start to develop a guide for teachers to explain how to use them. Until that guide is produced, you can still browse the activities and have a go with them. Try out for example:

Collaborative story telling with Scratch, where each team of children makes a 30 second story with a character who moves from the left of the screen to the right. When all laptops are lined up, a character can move through every screen.
Artificial Intelligence, where children get to think about whether a computer can ever be intelligence, and how would we know if it was?
Poetry animation with Scratch, where children make characters that act out a poem.

There are lots more activities to choose from. Some just take an hour, some take a day, and some span a term. Some use robots, some use no particular equipment. They can be embedded into other lessons (languages, maths, science, art), or just standalone. And they are adaptable for different age ranges.

To follow the project see the Playful Coding website, follow #playfulcoding on Twitter or find Playful Coding on Facebook.

Saturday, 5 September 2015

Notes from workshop on Computational Statistics and Machine Learning

I've just attended "Autonomous Citizens: Algorithms for Tomorrow's Society", a workshop as part of the Network on Computational Statistics and Machine Learning (NCSML). That's an ambitious title for a workshop! Autonomous Citizens are not going to hit the streets any time soon. The futuristic goals of Artificial Intelligence are still some way off. Robots are still clumsy, expensive and inflexible. But AI has changed dramatically since I was a student. Back in the days when computational power was more limited, AI was mostly about hand-coding knowledge into expert systems, grammars, state machines and rule bases. Now almost any form of intelligent behaviour from Google translation to Facebook face recognition makes heavy use of computational statistics to infer knowledge.

Posters: there were some really good posters and poster presenters who did a great job of explaining their work to me. In particular I'd like to read more about:

A Probabilistic Context-Sensitive Model of Subsequences (Jaroslav Fowkes, Charles Sutton): a method for finding frequent interesting subsequences. Other methods based on association mining give lots of frequent but uninteresting subsequences. Instead, define a generative model, then go on to use data and EM to infer the parameters of the model.
Canonical Correlation Forests (Tom Rainforth, Frank Wood): a replacement for random forests that projects (a bootstrap sample of) the data into a different coordinate space using Canonical Correlation Analysis before making the decision nodes.
Algorithmic Design for Big Data (Murray Pollock et al): Retrospective Monte Carlo. Monte Carlo algorithms with reordered steps. There are stochastic steps and deterministic steps. The order can have a huge effect on efficiency. His analogy went as follows: imagine you've set a quiz with a right answer and a wrong answer. People submit responses and you need to choose a winner. You could first sort them all into two piles (correct, wrong) and then pick a winner from the correct pile (deterministic first, then stochastic). Or you could just randomly sample from all results until you get a winner (stochastic first). The second will be quicker.
MAP for Dirichlet Process Mixtures (Alexis Boukouvalas et al): a method for creating a Dirichlet Process Mixture model. This is useful as a k-means replacement where you don't know in advance what k should be, and where your clusters are not necessarily spherical.

Talks: these were mostly full talks (approx one hour), but then we had a short introduction to the Alan Turing Institute with Q&A at the end.

The first talk presented the idea of an Automated Statistician (Zoubin Ghahramani). Throw your time series data at the automated statistician and it'll give you back a report in natural language (English) explaining the trends and extending a prediction for the future. The idea is really nice. He has defined a language for representing a family of statistical models, a search procedure to find the best combination of models to fit your data, an evaluation method so that it knows when to stop searching, and a procedure to interpret/translate the models and explain the results. His language of models is based on Gaussian processes with a variety of interesting kernels, together with addition and multiplication as operators on models, and also allowing change points, so we can shift from one model combination to another at a given timepoint.

The next two talks were about robots, which are perhaps the ultimate autonomous citizens. Marc Deisenroth spoke about using reinforcement learning and Bayesian optimisation as two methods for speeding up learning in robots (presented with fun videos showing learning of pendulum swinging, valve control and walking motion). He works on minimising the expected cost of the policy function in reinforcement learning. His themes of using Gaussian processes, using knowledge of uncertainty to help determine which new points to sample were also reflected in the next talk by Jeremy Wyatt about robots that reason with uncertain and incomplete information. He uses epistemic predicates (know, assumption), and has probabilities associated with his robot's rule base so that it can represent uncertainty. If incoming data from sensors may be faulty, then that probability should be part of the decision making process.

Next was Steve Roberts, who described working with crowd sourced data (from sites such as zooniverse), real citizens rather then automated ones. He deals with unreliable worker responses and large datasets. People vary in their reliability, and he needs to increase accuracy of results and also use their time effectively. The data to be labelled has a prior probability distribution. Each person also has a confusion matrix, describing how they label objects. These confusion matrices can be inspected, and in fact form clusters representing characteristics of the people (optimist, pessimist, sensible, etc). There are many potential uses for understanding how people label the data. Along the way, he mentioned that Gibbs sampling is a good method but is too slow for his large data, so he uses Variational Bayes, because the approximations work for this scenario.

Finally, we heard from Howard Covington, who introduced the new Alan Turing Institute which aims to be the UK's national institute for Data Science. This is brand new, and currently only has 4 employees. There will eventually be a new building for this institute, in London, opposite the Crick Institute. It's good to see that computer science, maths and stats now have an discipline-specific institute and will have more visibility from this. However, it's an institute belonging to 5 universities: Oxford, Cambridge, UCL, Edinburgh and Warwick, each of which has contributed £5million. How the rest of us get to join in with the national institute is not yet clear (Howard Covington was vague: "later"). For now, we can join the scoping workshops that discuss the areas of research that are relevant to the institute. The website, which has only been up for 4 weeks so far, has a list of these, but no joining information. Presumably, email the coordinator of a workshop if you're interested. The Institute aims to have 200 staff in London (from Profs to PhDs, including administrators). They're looking for research fellows now (Autumn 2015), and PhDs soon. Faculty from the 5 unis will be seconded there for periods of time, paid for by the institute. There will be a formal launch party in November.

Next year, the NCSML workshop will be in Edinburgh.

Thursday, 21 May 2015

How much is enough?

How much is enough? This question seems to crop up very frequently when analysing data. For example:

"How much data do I need to label in order to train a machine learning algorithm to recognise place names that locate newspaper articles?"
"Is my metagenome assembly good enough or do we need longer/fewer contigs?"
"What BLAST/RAPSearch threshold is close enough?"
"Are the k-mers long enough or short enough? (for taxon identification, for sequence assembly)"

Sadly there's no absolute answer to any of these. It depends. It depends on what your data looks like and what you want from the result. It also depends on how much time you have. It depends what question you really wanted to answer. What's the final goal of the work?

Sometimes there are numbers to report, measures that give us an idea of whether the process was good enough, after we've done the expensive computation. We can report various statistics about how good the result is, such as the N50 and its friends for sequence assembly, or the predictive accuracy for a newspaper article place name labeller. Which statistics to report are highly questionable. Does a single figure such as the N50 really tell us anything useful about the assembled sequence? It can't tell us which parts were good and which parts were messy. Do we really need lots of long contigs if we're assembling a metagenome? Perhaps the assembly is just an input to many further pipeline stages, and actually, choppy short contigs will do just fine for the next stage.

PAC learning theory was an attempt in 1984 by Leslie Valiant to address the questions about what was theoretically possible with data and machine learning. For what kinds of problem can we learn good hypotheses in a reasonable amount of time (hypotheses that are Probably Approximately Correct)? This led on to the question of how much data is enough to make a good job of machine learning? Some nice blog posts describing PAC learning theory and how much data is needed to ensure low error do a far better job than I could of explaining the theory. However, the basic theory assumes nice clean noise-free data and assume that the problem is actually learnable (it also tends to overestimate the amount of data we'd actually need). In the real world the data is far from clean, and the problem might never be learnable in the format that we've described or in the language we're using to create hypotheses. We're looking for a hypothesis in a space of hypotheses, but we don't know if the space is sensible. We could be like the drunk looking for his keys under the lamppost because the light is better there.

Perhaps there will be more theoretical advances in the future that tell us what kinds of genomic analysis are theoretically possible, and how much data they'd need, and what parameters to provide before we start. It's likely that this theory, like PAC theory, will only be able to tell us part of the story.

So if theory can't tell us how much is enough, then we have to empirically test and measure. But if we're still not sure how much is enough, then we're probably just not asking the right question.

Wednesday, 22 April 2015

Computer Science and Lindy Hop

It would seem that Lindy Hop is the dance of computing people, physicists and engineers. If you go to any swing dance camp, an unreasonable proportion of the people in the room will be somehow involved in IT. We have Lindy hoppers who have used Androids with sensors and fourier transforms to look at the pulse of the dance, use Lindy to illustrate quantum computing, and there is even a specific Lindy dance class for engineers. Sam Carroll described how digital media savvy the community was and is, in her Step Stealing work.

Okay, so people need money to go to dance camps, and computing professions generally pay well. And it gets us away from our desks and having fun with other people and music. However, these can't be the only reasons.

I think that I enjoy Lindy for lots of the same reasons that I enjoy computing. They're both about creating complex structures that are somehow beautiful. By complex structures I mean structures that are complicated enough that they make me feel pleased when I finally successfully make them work. By beautiful I mean code/dance/ideas that become elegant because of their appropriateness in that particular situation. And in both computing and Lindy I enjoy the reusable patterns. Reusable patterns in rhythm are like reusable patterns in computing: once you've understood them, they stay with you and can often tell you something more abstract about what you're trying to do.

So I think that computing and Lindy have more in common than just having fun. They also share reusable beautiful complexity.

Added note: If you want to try it out, come and join our Vintage Swinging in the Rain party on Friday 24th April, 8pm, Marine Hotel, Aberystwyth. There's a short dance class for beginners at about 8:30, and live music from The Paper Moon Band.

Tuesday, 31 March 2015

Final year computer science projects 2015

Computer Science is a diverse subject and this is reflected in the final year projects that our undergraduates undertake. This year, the final year project students that I supervise have chosen the following:

A simulation of Babbage's Analytical Engine to be used as an online educational tool. As the first design for a general purpose programmable computer, it's of huge importance, but there are few good resources to help explain it to the general public. This site will include some history about Babbage and Lovelace, and an interactive game where you get to program the engine. Technologies involved include client side web programming, expression parsing, and 3D graphics. (Rhian Watkins)
Geotagging of the digitised newspaper articles in the collection of the National Library of Wales. A mention of a placename in an article does not necessarily mean that the article is about that place (for example an article about "the Duchess of York). This project uses a gazetteer from Open Street Map data, NLP to extract features, and then various machine learning algorithms to see if we can tell which placenames are relevant and which are not. (Sean Sapstead)
A version control system for DNA. Software version control systems are not so useful for storing details of whole genomes and the modifications made to them. We want to explore Darcs-like patches, and use of sequence alignment tools to help record and inspect DNA modifications, and to be able to apply multiple modifications in a different order. (Thomas Hull)
A tool for demonstrating the differences between two DNA sequences as audio/music and as animations. How can we show the public the differences between two strains of the Ebola virus, or the mutations in BRCA1 that can cause cancer? Sequence alignment tools and different translations and representations of the DNA strings are the key to this problem. (Andrew Poll, his project blog)
The Happy Cow Game: an online collaborative game that represents the process of feeding a cow with the correct balance of foodstuffs to optimise its health, meat and milk. This was initially developed as a board game by veterinary lecturer Gabriel de la Fuente Oliver, and he's now helping us to turn it into an online game. Technologies involved include the Ruby on Rails framework, client side web tools and libraries, and a detailed understanding of how to make a complex game playable. (Simeon Smith)
A tournament seeding tool for online gamers as part of Aber Community of Gamers. This one's going to collect data about previous games played, using APIs for the various online gaming platforms and then use interesting seeding algorithms to make sure that tournaments are fair and balanced. It's also producing the web site to support the community, using the Laravel framework. (Nathan Hand)

Andrew shows his prototype code for turning DNA differences into sound and animations to children in Science Week

Friday, 18 July 2014

Microscope webcam microtitre plate reading using image analysis

An A-level student has just spent two weeks with us for his work experience, and his project has been to investigate the use of a cheap microscope webcam as an alternative to an expensive plate reader for the measurement of the growth of yeast in microtitre plates. The longer term aim would be to mount this webcam on the deck of our Tecan Genesis liquid handler robot, and to have the robot arm move the plate under the webcam.

The webcam is a Veho VMS-004, used at 20x magnification, and it costs just £40. It was recognised automatically by Linux as a webcam and worked really well with the OpenCV library.

Robert Buchan-Terrey did an excellent job in interdisciplinary science in just two weeks, including the following:

Preparing media and growing yeast in our lab
Pipetting the yeast to make dilutions
Using the microscope webcam, taking images of the wells in the plate at intervals throughout the day, and corresponding plate readings with a real plate reader
Coding using Python and OpenCV to process the images (find the circular well, work out the average pixel intensity in the well)
Data analysis and stats to understand the results

He also produced a poster to demonstrate the findings and to take back to his school.

And the answer is: although he's just analysed the data from one time point so far, and we took no care to make sure the lighting conditions were stable when taking the images, or to shake the plates to evenly disperse the yeast, it really does look very plausible that we could use this in future. Averaging over 8 replicate wells gives a remarkable correspondence between image-analysis results and plate reader results. Individual wells are more variable, but still show promise. We've yet to test all the data, and to test the full range of the scale of optical density, but this looks extremely exciting.

Thanks very much to Wayne Aubrey and Hannah Dee for their help and expertise with the yeast biology and the image processing respectively.

Friday, 13 June 2014

Sewable wearable computing

I gave this as a very short talk at the recent BCS Mid-Wales Show and Tell. So I'm describing it here in case it's of use to others. The presentation was mostly a collection of rather large photos but it can be downloaded at http://figshare.com/articles/Sewable_Wearable_Computing/1056536. I was inspired to give it a go by Charlotte Godley (@charwarz on Twitter) who ran a wearables workshop for Girl Guides using Adafruit Gemmas. These are small Arduino chips on a mounted on a base, made by a company called Adafruit. The base has holes so that you can sew it onto items of clothing. Slide 2 shows a Gemma attached by crocodile clips to an LED, a light that can be programmed to flash in any colour. The Gemmas are very cheap, only £6.50 so you can safely play with electronics without spending too much if you break it.

Steel thread can be used to sew your Gemma to its LEDs. This conducts and so replaces the crocodile clips (or soldering bits of metal) when you want to make wearable electronics. It's not that easy to sew. It doesn't bend and tie as easily as thread does, and can come undone, and make short circuits when it crosses other bits of thread. Also, the longer the thread, the higher the resistance.

After the Gemma, I got a Flora, it's bigger sister. This costs approx £20, and has more available connections. I wanted to attach an accelerometer and lights, and have the lights flash different colours in order to demonstrate the x, y or z direction of movement. The presentation shows how I put it together and how much it cost. You can see how it was stitched, how I used nail varnish to stop the knots in the ends of the steel thread from unravelling, and how it gets programmed using the Arduino environment.

The Adafruit site has a lot of useful explanations and videos about the Gemma and the Flora. You can buy the equipment from many sites (I used Phenoptix and 4tronix, both were good).

Overall, it was fiddly, but a lot of fun. Computing projects that have an element of real hardware, poor connections, and many parts, each of which could be wrong, are always harder to debug than software. I've never really played with electronics before, and I ended up learning a lot even in this simple project, about power, resistance and circuits. And I now have a thing with flashing lights that I can wear on my leg while dancing the Charleston.

Thursday, 6 February 2014

Final year computer science projects

This term I'm supervising 5 final year student projects. They have one term (just over 3 months) in which to really create something they can be proud of. It's not long, especially when they have to get to grips with new terminology, new equipment, reading and research, specification, design, implementation and experiments, testing, documentation and reflection. They're also asked to keep progress diaries or blogs and use version control to manage their projects.

I've very excited about the projects this year, they're all diverse and fun:

Using information theory and machine learning to recognise quality control issues in next gen sequencing lanelets (together with Joshua Randall, Sanger Institute)
Analysis of the content of NLW's digitised newspapers collection
Modelling insect choice among plants (together with Lizzy Donkin and John Warren in IBERS)
A model of the brain of C. elegans that we can play with as a talking point in AI discussions with the public.
Using image processing and microscope webcams to determine the growth of yeast in microtitre plates.

Hopefully, we'll have some good results by May!

Bioinformatics and computational biology: 500 years of exciting problems?

I gave a talk at Warwick University, Department of Computer Science in January 2014. A look at the intertwining of computer science and biology from the days of Turing through the present and on to the future, including some of my research along the way.

In a 1993 interview, Donald Knuth worried that computer science in the future will be "pretty much working on refinements of well-explored things", whereas "Biology easily has 500 years of exciting problems to work on". I'll describe some of the bioinformatics and computational biology that I've been working on. My talk included a little about where the field has come from, where it's going in the future, and whether it should be considered a branch of computer science at all.

The slides are online at Figshare.

Thursday, 26 September 2013

ECML PKDD 2013

Machine learning, data mining and statistical data analysis is clearly a popular area now, judging by the number of attendees of this year's European conference, ECMLPKDD 2013.

It's been a long time since I last attended (2001 for multi-label classification by a modification to C4.5). I think the field has grown and matured a lot. There are far fewer papers now showing the results of a new algorithm on 10 different UCI datasets. There is far more presence from people in industry. And industry is varied: search engines, internet shopping and finance. Yahoo, Amazon, Zalando, Deloitte and many others sponsored the conference and sent people to speak or attend. There was an "industry track", and that room was full.

Themes that I picked up on were: regression (still popular!), lots of tensors and matrices, numerical analysis methods for large data sets, network mining, sequence mining, and generally using ML/DM to influence people (buying, voting, doing good, giving your system feedback).

The organisers this year have really done a good job: working wifi, lots of food and coffee, sessions running on time, plenty of mingling time, and choosing a venue in a beautiful city, with accommodation in a wide range of hotels within easy walking distance booked as an easy part of the registration process. It is appreciated!

Diversity is something that the ECMLPKDD community have started to work on improving. It has the usual male/female imbalance of a technical conference. Perhaps slightly more women than I expected, or maybe I'm just getting used to this. I'd hazard a guess at about 20% or a bit less. But next year's organising committee are more gender-balanced, and there will also be a Diversity Chair to keep an eye on the issue.

Openness of code and data is something else the community are working on improving. This year for the first time they had an award for "Open Science", and encouraged paper submissions to include a link to code/data. In order to award this, the organisers had to download, compile, run and test lots of submitted code. I don't know which of the organisers did this onerous task, but I'm very pleased they did.

If I had to point out one thing that could still be improved, my number 1 would be that the proceedings are owned by Springer, and are not open. For reasons known only to Springer, I can't make an account with them or reactive an existing account. Maybe Springer will reply to my email eventually. But if the proceedings were open access (papers deposited at arXiv for example) then this would really benefit the ML/DM community and others, and more widely promote the work of everyone who presented.

Next year, 2014, the conference moves to Nancy, France, a city with Art Nouveau architecture, and with many fine wines. www.ecmlpkdd2014.org

Friday, 23 August 2013

Venn diagrams

I'd always thought that drawing Venn diagrams were quite trivial, until I needed to create some recently. They are trivial, if we only have 2 equal-sized sets. If we have 3 sets, and we want the circle sizes to represent the number of data items in each set, then the layout algorithm is more complex. Luckily there's a great package for Python called matplotlib-venn which does exactly what I wanted.

However, if I have 4 sets, then it gets more complicated (and isn't yet handled by matplotlib-venn). A Venn diagram must show all possible intersections. Venn used overlapping ellipses to show how this could be achieved.

Diagram by RupertMillard from Wikimedia Commons

This is where Venn diagrams differ from Euler diagrams. Euler diagrams don't show empty intersections, so they can look much simpler than Venn diagrams, and can contain fully-nested circles. There are Venn diagrams that can represent all the overlaps of 5 and 6 sets, but we'd end up with some extremely complex diagrams that don't really aid the visualisation of our data.

Update (9 Oct 2013): Here's a Javascript/D3.js interactive version with more thoughts on why it's a difficult problem.

Update (20th March 2014): Here are a couple of completely over the top diagrams: a pine tree and a banana. Venn or Euler?

Monday, 12 August 2013

Reviewing for triple-blind conferences

Computer Science is different to some other fields of research in that it uses refereed conferences as the main quick-turn-around publication venue. There are many excellent computer science conferences with high quality submissions and high quality reviewing. Submitting your paper to a CS journal can take years (and give you a dialogue, where you get a change to fix any problems), but a conference will give you a straight yes or no decision, and good feedback, within around 3 months. And then if accepted, your work gets the publicity of a presentation too.

However, for the reviewers this can mean a heavy load if the organisers don't get enough reviewers together. I have just finished reviewing my allocation of 9 papers for ICDM 2013. Some people would only review 10 papers in an entire year! There were some excellent submissions amongst my 9 and it's going to be strongly competitive this year. Nine is a rather tough load, but one of the reasons I do reviews is because it makes me read more widely in areas tangential to mine, and it keeps me up to date. It's also a payback for all the reviews that others have done for me.

But ICDM has a triple-blind submission procedure. The authors don't include their names, and in fact they have to try to remove any evidence of themselves from the paper ("We extended the work of Smith et al" rather than "We extended our previous work"). They don't get to know who wrote their reviews, and even the programme committee co-chairs don't get to know the identities of the authors or reviewers until after the decisions have all been made. This is supposed to help us be fairer. So we won't be unduly influenced by big names or previous work. However, in practice, it's almost always possible to work it out, to recognise citations, writing style, working area, etc. People who are proud of their work will possibly try to leave you clues anyway. Very little work in a research group is done in isolation of previous work in that area. Data mining researchers don't make up such a large community that this could be hidden.

In fact I'd argue that the triple blind process is counter-productive. When other reviewers can see my name and we have to debate any disagreements over our reviews, then I feel my reputation is at stake if I give a poor review. So I'll go to lengths to make sure everything I say is fair. If my identity is known to the authors then likewise, I'll make extra sure that I'm not just seen as complaining, but actually giving them useful advice. We encourage better review quality by making it open. And seeing as how I can guess most of the authors anyway, there's little point in hiding their names. All it does is stops them making their code and data available for inspection. Some of the papers said "A link to the code will be inserted after the review process" and some papers just had no mention of making any data/code available. They would have been discouraged from doing so by the blind review, rather than encouraged.

Instead of closing and anonymising the system (blind, double-blind, now triple-blind), I think it would be more productive to open it. Would it be any worse if we did? There are certainly ways it which it would be better.

Tuesday, 26 June 2012

Day 1 of The Alan Turing Centenary Conference in Manchester

I've been attending the Alan Turing Centenary Conference in Manchester. The conference was in the huge and beautiful Manchester Town Hall, and it feels very appropriate to hear these talks in Manchester where Turing worked. The conference was over a weekend, including Saturday 23rd June, which was Turing's birthday 100 years ago. There were too many talks to write about all of them, but here are a few comments on a few of the talks from Day 1.

Vint Cerf is Chief Internet Evangelist for Google, one of the creators of the internet and of TCP/IP. He described the early days of the internet, how they had to make decisions about the size of the address space they would need in order to address each of the devices connected to the internet. Imagine their problem in the 70s, trying to work out how many computers there would be in the future. Now there are about 2-3 billion internet devices connected at any one time, and a load more devices only intermittently connected. Everything is now becoming networked, including Cerf's wine cellar, which of course has a sensor network measuring temperature, humidity and light conditions, and will send an alert to his mobile phone if his wine becomes too warm. See Vint Cerf's talk.

Yuri Matiyasevich talked about number theory, which was yet another of Turing's many fields of work. Turing wrote a few papers in number theory, looking at the Riemann hypothesis. During telling us about the Riemann hypothesis, Matiyasevich told us about Skewes number, "the largest number that served any definite purpose in mathematics" according to Hardy, which was a massive lower bound of 10^10^10^34. In 1939, Turing applied for a grant from the Royal Society for just £40 to calculate information about zeros in the Riemann zeta function. This rather small amount of money was to buy parts in order to adapt some machines for predicting tides to the purpose he had in mind. Matiyasevich showed a picture of a tide prediction machine, and it instantly gives an impression of the "technology" that Turing was working with at the time. Apparently, Turing was fine with the fact that his machine gave its output in base 32, stating that he was "sufficiently familiar" with reading numbers in this form that it wasn't a problem. See Yuri Matiyasevich's talk.

David Ferrucci is the main man behind the Watson system that beat humans at the American quiz game 'Jeopardy' talked about the creation of Watson. Although Jeopardy isn't a Turing test, it is a test of general knowledge, and in this case a machine was competing against humans. His team of about 20 people put together many components (parsers, entity and relation extraction, knowledge base and search techniques etc) and then tested the system as a whole as the primary metric, not any one component. Initially it took 2 hours to answer a question, and some serious parallelisation of tasks was needed to make it fast enough for the game show. He gave lots of examples of questions and answers in his talk, including a question that asked for the common theme between "shirt", "telephone" and "TV remote control", to which the answer was "buttons", though "made with petrochemicals" and "made by humans" were also valid answers suggested by Watson. See David Ferrucci's talk.

Fred Brooks talked about Turing's Pilot ACE, a prototype of the ACE computer that Turing designed and wanted to build. Why did Turing's computer not take off? He suggested 3 main reasons: that he didn't publish his 1945 proposal, the machine was too late by the time it finally got built, and he didn't make it easy to program (didn't foresee how important that would be). Again we get the impression of Turing's amazing ability with numbers. Using backwards binary with the lower order bits on the left was no problem to him, but was not so easy for others. ACE also had other peculiarities: no accumulator, no general multiply, no straightforward conditional branch. Brooks recommended a new book by Simon Lavington, "Alan Turing and his contemporaries: Building the world's first computer", and he recommended everyone read von Neumann's EDVAC report on a much more practical machine. You can now see the Pilot ACE in the Science Museum in London. See Fred Brooks' talk.

The conference even after day 1 was already giving a wonderful picture of how diverse Turing's contributions to computer science and mathematics were, and how far reaching his ideas would be.

See also my writeup of Day 2 and Day 3 of the conference, and the online video recordings of the talks.

Day 2 of the Alan Turing Centenary Conference in Manchester

On the second day of the Turing Centenary Conference in Manchester there were more great talks. The screen showing the presentations was flanked by a large image of Alan Turing, reclining in a chair, reading a book. It was rather like having him present, listening to the talks.

Tony Hoare asked the question "Can computers understand their own programs?" and proceeded to propose to answer it with an alternative to the Turing test. In the same way that "Can a machine be intelligent" cannot be answered without a definition of "intelligent", Hoare's question cannot be answered without a definition of "understand". Instead, if a computer can answer questions posed to it about its own program, and give correct and useful answers, and explain its reasoning, then we may conclude that it understands its program. These questions might be, for example: "Will this change make the program run more slowly". He also finished his talk with a hint at his current work on algebras for programming, by asking whether programming will become part of recognised formal logic in the future. See Tony Hoare's talk.

Manuela Veloso is a robotics researcher who had 116 robots in her lab at a count in 2011. Her talk gave a tour of her work. This included planning and replanning in domains like robot football, purposeful perception (we only see/perceive what we need to deal with the task in hand, and if you arrive at an airport, you won't notice what colour the seats in the lounge are, because you're busy looking for a sign saying "Exit" or "Baggage"), robots that cohabit with humans and ask humans for help, and discussions about whether both robots and humans now use off-board AI. If Veloso and her robots had come to my school when I was young, I'm sure I would have wanted to be a roboticist immediately. In one visit to a school, a 6-year old asked "Did the robots wonder why you picked them up?"

Donald Knuth gave a pre-dinner speech. This was actually just a short speech followed by question-and-answer session. He said that he was proud to be a geek and he believed that Alan Turing was 100% geek. He handled a wide variety of questions, such as "What advice would you give to a young researcher" (follow your muse), "Are your children and grandchildren programmers?" (my son is a maths teacher, my grandchildren are too young yet to know), "Are objects the same as co-routines (that question was from Bertrand Meyer, the creator of the Eiffel programming language, and the answer was an involved characterisation of their similarities and differences), "Whats your next book" (my life's work is the Art of Computer Programming, I'm working on the new volume, it's slow progress because I want it to be right).

During his short speech, Knuth said that he'd been on Friday before the conference to see the Alan Turing statue in Sackville Gardens, near Manchester Piccadilly. He'd sat there on the bench with Turing's statue, and he and Turing both sat there and thought for about 10 minutes, and he found it wonderful, and he recommended it to us all. When I left the conference on Monday lunchtime, I went via Turing's statue and did the same. The statue is life-size and feels strangely human, especially when you sit on the bench next to him. I'll never get to meet Turing, but I think that attending a 3 day conference all about his work, followed by sitting next to his life-size statue on a bench is probably the next best thing.

See also my writeup of Day 1 and Day 3 of the conference, and the online video recordings of the talks.

Day 3 and thoughts on the Alan Turing Centenary Conference in Manchester

On day 3 of the Alan Turing Centenary Conference in Manchester I had to leave after lunch and so I missed the afternoon sessions. However, the morning session was excellent.

Garry Kasparov: The world chess champion took us through the story of Turing's world-first chess playing program. Turing didn't have a computer to execute the algorithm, so he carried out the calculations by hand, taking 15 minutes to execute a move. Turing's algorithm was pitted against humans, and one of the matches was transcribed. Kasparov and colleagues collected the rules of Turing's chess playing program and implemented it. When they tried to reproduce the game, the results deviated from the game transcription 10 times. Presumably, Turing got bored in his hand-calulations, assumed what the answer would be, and just moved the piece. Live at the conference, Gary Kasparov played a game of chess against the implementation of Turing's program, and Kasparov very quickly won in 16 moves, with a commentary as he did so (pointing out moves that were "not the greatest choice!"). See Gary Kasparov's talk.

Rodney Brooks: The roboticist talked about Turing's unpublished 1948 paper "Intelligent Machinery", which discussed embodied robots. They were impractical in Turing's lifetime. In this paper Turing also says that machines that never make a mistakes could not be intelligent, that humans are not machines because they seem to be able to do mathematics despite Godel's famous theorem, and discusses the role of neurons and brains and their resemblance to a universal machine. Brooks showed how the field of robotics had benefitted from the exponential laws about processor speed and availability of sensors, and now there are many things we can do in robotics that Turing would not have thought possible back in 1948. In fact many things that look intelligent when humans do them, turn out to be the result of relatively simple laws, and his group's research can make robots look human-like just by taking care of how they move their gaze to track an object, or how they use prosody and movement to look as if they have emotions. See Rodney Brooks' talk.

Throughout this conference we were reminded of the huge range of topics that Turing worked on in his lifetime, and that this conference covered:

The Turing test (or "Imitation game") was repeatedly mentioned in several contexts by several people, including for its primary use - a discussion of whether AI is achievable and how we will know when we've made it.
Michael Rabin reminded us of Turing's contribution to decidability, the halting problem and computability.
Edmund Clarke reminded us that Turing was the first to realise that software verification would be important.
Yuri Matiyasevich reminded us of the work Turing did in number theory and that his work is still used in computing zeros for the Reimann zeta function.
Adi Shamir reminded us of Turing's work in encryption.
Fred Brooks reminded us of Turing's efforts to create early computers (and his Turing machine) and his thoughts about whether intelligent machines were possible.
Leslie Valiant and others pointed out that Turing had categorised 3 types of search: intellectual search (search algorithms taught now in all AI courses), genetical search (genetic algorithms, evolution) and cultural search (gaining knowledge as a community by interaction).
Garry Kasparov reminded us that Turing made the first automatic chess playing program.
Rodney Brooks told us that Turing wrote about the possibilities and problems for embodied robots long before they could be built.
Hans Meinhardt told us about Turing's work in understanding the formation of biological structure.

Happy Birthday Alan Turing! Where will computer science be in another 100 years?

See also my writeup of Day 1 and Day 2 of the conference, and the online video recordings of the talks.

Wednesday, 6 June 2012

Alan Turing, the first bioinformatician

This is the Alan Turing centenary year, and Alan Turing would have been 100 years old this month (on 23rd June) if he had lived this long. As well as inventing computers, theories of decidability, computability, computational cryptography and artificial intelligence, just before his death he also studied the relationship between mathematics and the shapes and structures found in biology. How do patterns in plants, such as the spiral packing of the seeds found in the head of a sunflower, come about? This year, in a big experiment, devised by Prof Jonathan Swinton to celebrate his centenary, sunflowers are being grown across the country. The seed heads will be collected, their patterns counted, then hopefully the results will demonstrate the relationship between Fibonacci numbers and biological growth that Turing was investigating. We're growing two sunflowers here in the Dept of Computer Science at Aberystwyth University as part of this experiment. Their names were voted on by the Department, and "one" and "zero" were chosen.

Turing used an early computer at Manchester (the Ferranti Mark 1, the first commercially available general purpose electronic computer) to model the chemical processes of reaction and diffusion, which could give rise to patterns such as spots and stripes. You can play with a Turing reaction-diffusion applet online, which shows how changes to the diffusion equation parameters produce different patterns. Turing wrote, near the end of his 1952 paper The Chemical Basis of Morphogenesis that:
"Most of an organism, most of the time, is developing from one pattern into another, rather than from homogeneity into a pattern. One would like to be able to follow this more general process mathematically also. The difficulties are, however, such that one cannot hope to have any embracing theory of such processes, beyond the statement of the equations. It might be possible, however, to treat a few particular cases in detail with the aid of a digital computer."

He then goes on to elaborate further on how computers have already been extremely useful to him in helping him to understand his models (no need to make so many simplifying assumptions). The fact that he actually used computers to investigate the models underlying biology, makes him the first bioinformatician / computational biologist. The fact that he could see the future, and could see how computers would enable us to model and explore the natural sciences makes him an amazingly visionary scientist.

Extra reading:

See the list of books that Turing read as a teenager at school. This list was extracted from his school library record.
Manchester Uni are hosting a Turing Centenary Conference this month, with a host of famous names in computer science giving talks in his honour.