Six papers for Big Data fans

Big Data, the processing of data sets that do not fit on a single computer, has come of age. It’s not just the level of interest shown at conferences like Strata but also the types of people participating. Sure there are loads of companies out there with products in this space but there are also plenty of end users coming forward and many of these are outside of technology companies. At the London version, one of the speakers was Ben Goldacre, doctor and author of the awesome Bad Science, who discussed the impact of missing data which is a huge issue for medical studies. Even the Whitehouse has weighed in on behalf of Big Data and emphasized its importance to business.

If you are going to use a technology I’m a big fan of going to the source and thankfully in this space, a lot of the published work on this is freely available. So I’ve collected some of the papers that are key to this area: five are about Big Data itself and the bonus one is about operational monitoring for massively distributed systems.

Big Data for business: be careful what you ask for

This post on the temptation of data raised an interesting aspect of Big Data: that we run the risk of being overwhelmed by data and that organisations and investors are looking too hard for the one piece of data that will be the key to success. Drowning in data is a real risk to an enterprise and something that great leaders are aware of ( lesson 3 in this awesome leadersip slidedeck from Colin Powell – “Experts often posses more data than judgement”) but being swamped by the data is not the only problem.

Another problem with casting the net of big data wide in this way is that eventually you will end up finding some sort of pattern, any sort of pattern if you keep looking hard enough (there’s an XKCD for this). The human brain is a fantastic pattern recognition system and is easily fooled (see Pareidolia) and it’s far too easy to make mistakes like confusing correlation with causation.
