All Science Requires Janitor Work

I read an interesting article in the New York Times that said For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. For anyone that has done any sort of science this should not be a surprise. Science is the product of a relentless sort of focussed stupidity combined with intelligent insight and this sort of basic work is essential to the first part of the process.

Anyone who has done a science experiment, even a simple high school one, will understand that the act of data gathering and sorting is a menial task that cannot be simply automated. As a scientist, you are often playing the role of a highly trained monkey repeatedly doing the same tasks. Normally this sort of activity would be one that could be quickly automated but there turns out to be so much art in the science that you are often the only trained monkey who can do it.

Whether, as in the article,  it’s different words meaning the same thing  for a company providing  information on drug side effects needing to know that “drowsiness,” “sleepiness” and “somnolence” all meant the same thing or whether it’s a researcher analysing EEG patterns determining whether the activity is real or just an artefact, both situations are remarkably resistant to complete automation. Even if you could automate everything, it’s only when you’ve ploughed through the dataset that you start to have an idea of what’s going on – and that’s ignoring the green jelly bean problem of post hoc analysis.

For my PhD I manually compared 1.1 million EEG patterns and that was after the computer had processed and tentatively categorised the data.  I’d go to sleep seeing wiggly lines across the back of my eyelids but that was the price to pay for correctness that no computational analysis could match.

It’s fantastic that there are start-ups out there looking to solve this problem of data wrangling (as not everyone has an army of PhD students, postdocs or other slave labour to do this for them) but it’s an exceptionally tough problem that I don’t expect to see solved any time soon.

