Tuesday 15 October 2013

My Ada Lovelace post: Donna Robinson

My post for this Ada Lovelace Day is about a woman in computing who inspired me, a friend of mine called Donna Robinson. I first met Donna when living in Guildford, when I was 23. We shared a house (with two other people). She took me shopping for a toolbox (metal, not plastic) and tools (hammers, pliers, mole grips, hacksaw, screwdrivers, and lots more, everything I'd need to be self-sufficient). I didn't have much money. "Trust me", she said, "it's worth investing in good tools". I still have those tools and the toolbox and they're still in use, more than 15 years later.

Donna writes open source code. She made the Oxford College Student Database. She made Valkyrie for Valgrind. She made a library of Scottish Country Dances for the SCDS. She's continually making open source stuff for people, but most of it isn't visible on the web (and neither is she). She financed her open source work by doing up run-down houses in the day, and hacking her code all evening. She chooses not to have an ordinary 9-5 stable job, and it takes a lot of time and effort to do that.

What has most inspired me about Donna is not her coding (though she does that), or her handiness at any manual jobs (though she does that), or her dedication to whatever she chooses to do (see for example how she helped her cat get better after an accident) but her attitude: she knows that with enough time and research she can do whatever she wants to achieve, and she never seemed to have that worry that she wouldn't be taken seriously because she's not male.

Thursday 26 September 2013


Machine learning, data mining and statistical data analysis is clearly a popular area now, judging by the number of attendees of this year's European conference, ECMLPKDD 2013.

It's been a long time since I last attended (2001 for multi-label classification by a modification to C4.5). I think the field has grown and matured a lot. There are far fewer papers now showing the results of a new algorithm on 10 different UCI datasets. There is far more presence from people in industry. And industry is varied: search engines, internet shopping and finance. Yahoo, Amazon, Zalando, Deloitte and many others sponsored the conference and sent people to speak or attend. There was an "industry track", and that room was full.

Themes that I picked up on were: regression (still popular!), lots of tensors and matrices, numerical analysis methods for large data sets, network mining, sequence mining, and generally using ML/DM to influence people (buying, voting, doing good, giving your system feedback).

The organisers this year have really done a good job: working wifi, lots of food and coffee, sessions running on time, plenty of mingling time, and choosing a venue in a beautiful city, with accommodation in a wide range of hotels within easy walking distance booked as an easy part of the registration process. It is appreciated!

Diversity is something that the ECMLPKDD community have started to work on improving. It has the usual male/female imbalance of a technical conference. Perhaps slightly more women than I expected, or maybe I'm just getting used to this. I'd hazard a guess at about 20% or a bit less. But next year's organising committee are more gender-balanced, and there will also be a Diversity Chair to keep an eye on the issue.

Openness of code and data is something else the community are working on improving. This year for the first time they had an award for "Open Science", and encouraged paper submissions to include a link to code/data. In order to award this, the organisers had to download, compile, run and test lots of submitted code. I don't know which of the organisers did this onerous task, but I'm very pleased they did.

If I had to point out one thing that could still be improved, my number 1 would be that the proceedings are owned by Springer, and are not open. For reasons known only to Springer, I can't make an account with them or reactive an existing account. Maybe Springer will reply to my email eventually. But if the proceedings were open access (papers deposited at arXiv for example) then this would really benefit the ML/DM community and others, and more widely promote the work of everyone who presented.

Next year, 2014, the conference moves to Nancy, France, a city with Art Nouveau architecture, and with many fine wines. www.ecmlpkdd2014.org

Friday 23 August 2013

Venn diagrams

I'd always thought that drawing Venn diagrams were quite trivial, until I needed to create some recently. They are trivial, if we only have 2 equal-sized sets. If we have 3 sets, and we want the circle sizes to represent the number of data items in each set, then the layout algorithm is more complex. Luckily there's a great package for Python called matplotlib-venn which does exactly what I wanted.

However, if I have 4 sets, then it gets more complicated (and isn't yet handled by matplotlib-venn). A Venn diagram must show all possible intersections. Venn used overlapping ellipses to show how this could be achieved.

Diagram by RupertMillard from Wikimedia Commons

This is where Venn diagrams differ from Euler diagrams. Euler diagrams don't show empty intersections, so they can look much simpler than Venn diagrams, and can contain fully-nested circles. There are Venn diagrams that can represent all the overlaps of 5 and 6 sets, but we'd end up with some extremely complex diagrams that don't really aid the visualisation of our data.

Update (9 Oct 2013): Here's a Javascript/D3.js interactive version with more thoughts on why it's a difficult problem.

Update (20th March 2014): Here are a couple of completely over the top diagrams: a pine tree and a banana. Venn or Euler?

Monday 12 August 2013

Reviewing for triple-blind conferences

Computer Science is different to some other fields of research in that it uses refereed conferences as the main quick-turn-around publication venue. There are many excellent computer science conferences with high quality submissions and high quality reviewing. Submitting your paper to a CS journal can take years (and give you a dialogue, where you get a change to fix any problems), but a conference will give you a straight yes or no decision, and good feedback, within around 3 months. And then if accepted, your work gets the publicity of a presentation too.

However, for the reviewers this can mean a heavy load if the organisers don't get enough reviewers together. I have just finished reviewing my allocation of 9 papers for ICDM 2013. Some people would only review 10 papers in an entire year! There were some excellent submissions amongst my 9 and it's going to be strongly competitive this year. Nine is a rather tough load, but one of the reasons I do reviews is because it makes me read more widely in areas tangential to mine, and it keeps me up to date. It's also a payback for all the reviews that others have done for me.

But ICDM has a triple-blind submission procedure. The authors don't include their names, and in fact they have to try to remove any evidence of themselves from the paper ("We extended the work of Smith et al" rather than "We extended our previous work"). They don't get to know who wrote their reviews, and even the programme committee co-chairs don't get to know the identities of the authors or reviewers until after the decisions have all been made.  This is supposed to help us be fairer. So we won't be unduly influenced by big names or previous work. However, in practice, it's almost always possible to work it out, to recognise citations, writing style, working area, etc. People who are proud of their work will possibly try to leave you clues anyway. Very little work in a research group is done in isolation of previous work in that area. Data mining researchers don't make up such a large community that this could be hidden.

In fact I'd argue that the triple blind process is counter-productive. When other reviewers can see my name and we have to debate any disagreements over our reviews, then I feel my reputation is at stake if I give a poor review. So I'll go to lengths to make sure everything I say is fair. If my identity is known to the authors then likewise, I'll make extra sure that I'm not just seen as complaining, but actually giving them useful advice. We encourage better review quality by making it open. And seeing as how I can guess most of the authors anyway, there's little point in hiding their names. All it does is stops them making their code and data available for inspection. Some of the papers said "A link to the code will be inserted after the review process" and some papers just had no mention of making any data/code available. They would have been discouraged from doing so by the blind review, rather than encouraged.

Instead of closing and anonymising the system (blind, double-blind, now triple-blind), I think it would be more productive to open it. Would it be any worse if we did? There are certainly ways it which it would be better.

Monday 22 July 2013

The Welsh Crucible

This year I applied for a place on the Welsh Crucible, and my application was accepted. The Welsh Crucible is a yearly scheme for researchers in Wales. It takes 30 researchers from different institutions and different disciplines and puts us all together for 3 workshops ("labs") over the summer, to see what interesting ideas and collaborations can be forged.

So I'm now talking about grant applications to a plant biologist in Cardiff, an environmental chemist in Bangor and a physical geographer in Aberystwyth, and I have a new network of future collaborators spanning a huge range of subjects.

Here are a few other thoughts about the experience, in no particular order:

It's a fantastic networking opportunity, a great way to make contacts for subjects outside your immediate area.

Over the course of the 3 labs I became better at introducing my research in a short and presentable manner. At the start I found this very difficult. At the end it's still hard, but I'm making progress. As a researcher I often feel that I can't really describe my research, full of its day-to-day technical details, in a way that anyone else will understand. One of the exercises they asked us to do was to write 100 words about our research, for a lay person. Another was speed-networking (just 3 minutes per person and then move on). My description of my work still varies, depending on who I'm talking to, but I'm much more happy to attempt it now. Doing these exercises not only helps us communicate with other researchers, but also makes us introspect and see if we're actually doing what we want to be doing.

Speed networking
Speed networking. Photo by Keith Morris.

The best interdisciplinary work is likely to come from a real working friendship where we already trust each other. So build these, and the rest will follow. The Crucible has actually made me take more of an interest in other disciplines and it allows me to feel that it is okay to do so. It's good to be interested in other disciplines, even if the REF assessment says otherwise. In fact its generally made me think about the longer term, about putting good foundations in place and not worrying about short term measurements, and individual wins and losses. From Crucible people I've learned about gallium teaspoons, arthritis inflammation and bone chomping, the placebo effect, dating rocks by luminescence under big black sheets, Jews in Scotland, and the problems of conducting health studies across populations of people.  I'll definitely go to more seminars from other departments now. And maybe I'll wander into some other departments' coffee rooms too...

The wants and offers wall
The wants and offers wall. Photo by Keith Morris.
We were asked to start off by introducing ourselves with a 9-slide pecha kucha presentation. Far more of the male participants than the female participants included a picture of their children in this introduction (about 8/19 male vs about 0/11 female).

Several of the Crucible participants were promoted or took on new positions of responsibility between lab two and lab three. One participant described it as a useful benchmarking experience. We look at others around us and see what can be achieved. 

We have some very talented and enthusiastic researchers across Wales. As a group, we cover a lot of research expertise, and we have the whole range of skills (people management, media engagement, conference organising, book writing, schools outreach, grant writing, presenting, teaching, etc). I look forward to meeting everyone again at the reunion.

All the 2013 Crucible attendees
All the Crucible 2013 attendees. Photo by Keith Morris.

Friday 7 June 2013

Review: To Kill a Machine

This is the first time I've ever reviewed a play. But it seems appropriate to post this on the anniversary of Turing's death. 

To Kill a Machine by Catrin Fflur Huws and Scriptopgraphy Productions 

Alan Turing had a fascinating life and many people in the UK will know something about him. Perhaps you'll know about his amazing code-breaking work in the war, perhaps you'll know about the mysterious circumstances surrounding his death, or perhaps you'll know that he is one of the founders of computer science. But Turing was a genius, and we don't yet know enough about all the other great things he achieved. Breaking the Enigma code and creating the computer were just two of his accomplishments, and this play takes us through another aspect of his work and his life: his contributions to philosophy and Artificial Intelligence.

We might all agree that a rock shows no signs of intelligence, and that a human does possess intelligence, but in between there's a large grey area. Is a sheep intelligent? Is a city intelligent? Is a machine intelligent? And what is a machine anyway?

Catrin Fflur Hughes has devised a play that weaves together multiple themes in order to explore these ideas. She begins with the difference between a man and a women, a contrast which was more pronounced in the society of Turing's time, where assumptions about gender roles dictated the jobs you might aspire to, and your role in the war. From this she neatly moves on to ask what's the difference between a man and a machine? Are some people more like a machine than a man? And if understanding the difference in intelligence is difficult, then what about in love? What's the difference between loving a man and loving a woman? What happens when we can no longer distinguish these differences and when our lives might depend on it? As the play moves swiftly on, we feel enlightened, and at the same time perplexed: how can we not have considered the relationships between all of these things before?

This play has been written by a playwright who really understood what Alan Turing actually did, and wants to tell us all how ground-breaking that was, and how much it matters to all of us. It will appeal to scientists, who love to see their heroes portrayed in a way that everyone can understand. This is not just a classic and tragic story of love and betrayal, but contains cleverness and computing to keep the geeks happy too. And the acting was superb. Gwydion Rhys gives us a vulnerable hero that we'd like to protect.

As a computer scientist, I feel that this is a hugely important play. It promotes a British scientist whose work should be more widely celebrated and understood, and it was conceived in the centennial year, 100 years after Alan Turing's birth. Mixing the arts and science to create works like this is something we should all be doing more to encourage. We should make plays about science, and use the arts to tell us about great ideas. The world shouldn't be divided into science and arts (or men and women, or men and machines) and Turing didn't make such divisions. The playground between the two is where the best ideas are born.

Go and see this play, and you'll come away feeling inspired to follow in Alan Turing's footsteps. After seeing this, you'll want to be free to be a bit different, think great thoughts and create new ideas.

NLW digitised newspapers

Some fascinating stories are available in the newly digitised newspapers (from 1844 to 1910) from the National Library of Wales collection of Welsh Newspapers.

"The railway officials shouted to her to lay down, which she did just as the train reached her."

Friday 5 April 2013

Ada Lovelace Colloquium 2013

Dr Julie Greensmith introduces the 2013 Lovelace Colloquium in Nottingham
Yesterday I attended the 2013 Lovelace Colloquium. This a conference aimed at women undergraduates and MSc students in computing, and judging by the quality of the posters made by the attendees we certainly have a lot of fantastic women computing undergraduates in the UK.

When I was at school I was the only female in all of my A-level classes except Computing, where I was one of 2. On my undergrad degree, there was one other woman in my year group. I ended up deliberately joining an aerobics class in an attempt to meet other women. Throughout my computing career, women have been in a substantial minority. So how refreshing it is to go to Lovelace, and find so many enthusiastic technical women, all in one place. And how amazing it is that they are undergraduates, and will be going on to do great things in the future.

The conference is sponsored by several companies and organisations (Google, FDM, CA, EMC, HEA) who all recognise the opportunity of chatting to these students, and want to suggest their companies as career options. 

There were talks about ontologies, careers, artificial immune systems, computer vision and what it's like to work as a developer at Google. And there were plenty of networking opportunities, and the students clearly made the most of them. The support is good too - for example, when one student described her poor industrial year experience with a company that gave her secretarial work while the male placement students got the technical work, the whole audience was able to reassure her that this wasn't normal or right and she shouldn't let it put her off trying again for a computing career. But the highlight of the day is the poster session, which is where the students get to show off their work and enthusiasm, and win prizes of up to £500.

If you haven't yet been to this conference, as a student, an IT professional, an academic member of staff, a self employed consultant, or someone thinking about changing their career to computing, then I can't recommend it highly enough. The positive feeling you get from being part of it is huge.

See also blog posts by:

Monday 25 March 2013

The genome game

We've made an HTML5/Javascript educational game for teaching children about bioinformatics. You can try out the game on our webserver, or download it, fork it and develop it for your own purposes.

The idea is that we have 4 binary digits controlling 4 aspects of a cute creature's phenotype (eye colour, head colour, body colour and number of legs). The binary digits are big clickable buttons, which toggle the bit value and correspondingly change the creature. We can use this initial button-clicking exercise to talk about combinatorics and binary numbers: how many different creatures can be made by changes to 4 bits?

Then we show the underlying rules. These are the equivalent of "if-then-else" rules, defining how the bit values control the phenotype. The rules can be changed, so the children can choose which bit controls which characteristic, and also choose colours and leg numbers.

The genome game at National Science Week

Finally we can "Make population". This creates a random population of 5 creatures, all generated from the current set of rules, and it hides the rules. Now the game is to have a friend who didn't see the rules guess what the underlying rules were. They can see the 5 creatures and the genomes of each creature. Sometimes it's easy to work out the rules, and sometimes the rules can't be completely determined. It depends on the random 5 creatures. Sometimes all the creatures with blue eyes also happen to have 6 legs, and then we just can't tell which of the 4 bits is responsible for which characteristic.

We finish up the discussion by asking how many genes the children think are in baker's yeast (approx 6,000, easy to get hold of a bag of yeast, and they can guess what it is and what it does). They'd guess at "1? 2? 4? millions?" After this we asked them to guess how many genes in a human (approx 20,000). And describe how about 16 genes are actually responsible for your eye colour, not just one. And finally, ask how many genes in wheat (current estimate approx 100,000 or more). The look of astonishment at the complexity of wheat was a common reaction, quickly followed by "Why?". So we tell them that's what scientists are currently trying to find out: what do all our genes or the genes in yeast or wheat actually do? How many representatives of a population would scientists need to determine what the 100,000 genes in wheat do? And we tell them that if they want to work in bioinformatics when they grow up, then they could find out the answers for themselves.

Wheat and yeast: how many genes do they have?

How did this game come about? We have a BBSRC funded project on the application of multi-relational data mining to the problem of finding which parts of an organism's genotype are responsible for its phenotype. This problem is often called GWAS (genome wide association studies) or marker-assisted selective breeding. As part of the application for funding, we said that we'd do some public outreach activities, making a bioinformatics game for children that could be used as part of the Technocamps activities, and represented the research problem that we were working on. The game is obviously a highly simplified view of the research, but still does give an idea of how hard the problem is!