Big Data, the processing of data sets that do not fit on a single computer, has come of age. It’s not just the level of interest shown at conferences like Strata but also the types of people participating. Sure there are loads of companies out there with products in this space but there are also plenty of end users coming forward and many of these are outside of technology companies. At the London version, one of the speakers was Ben Goldacre, doctor and author of the awesome Bad Science, who discussed the impact of missing data which is a huge issue for medical studies. Even the Whitehouse has weighed in on behalf of Big Data and emphasized its importance to business.
If you are going to use a technology I’m a big fan of going to the source and thankfully in this space, a lot of the published work on this is freely available. So I’ve collected some of the papers that are key to this area: five are about Big Data itself and the bonus one is about operational monitoring for massively distributed systems.
Distributed storage for structured and semi-structured data systems
Scalable and resilient datastores for Big Data
Google’s new globally distributed database
Batch processing and generation at scale
Ad-hoc queries at scale
Operational Monitoring at scale
Not strictly related to Big Data, Dapper covers tracing and profiling massively distributed systems