A query over a petabyte of data now runs about as easily as a query over a spreadsheet did a generation ago. We have watched that ease arrive, and the arrival was not inevitable. Big data, the problem of storing and processing more information than any single machine could hold, was for a period one of the harder things in computing, and the reason most institutions could not act on the data they already possessed. The infrastructure that tamed it, in hardware and in software, is worth describing, both as a piece of recent history and as a companion to the harder question of what we should do with the capability now that we have it.
When data outgrew the machine
For most of computing’s history, the answer to a larger problem was a larger machine. More memory, a faster processor, more disk. That answer held until data outgrew what any single machine could economically hold or process, which happened in earnest as the web turned every interaction into a record and datasets reached sizes no affordable computer could manage. With that, the bottleneck shifted. Difficulty no longer lay in how fast one processor could run. It lay in how to store and move information across many machines without the whole thing collapsing when one of them failed, as one of them always did. Scaling up had reached its ceiling. Scaling out was the only way forward, and scaling out is a genuinely harder problem.
The infrastructure answer
An answer that reshaped the field came from treating failure as normal rather than exceptional. Google’s work in the early 2000s, the Google File System and the MapReduce programming model (Dean & Ghemawat, 2004), spread both data and computation across large numbers of cheap, unreliable machines and assumed that some would fail at any moment. Storage was replicated, so the loss of a machine lost no data. Computation was divided, so the loss of a machine cost only the small piece it had been working on. Hadoop carried the same ideas into open source in the mid-2000s, putting web-scale data processing within reach of any organisation willing to run a cluster. Cloud platforms, with object storage and rentable compute arriving from 2006, removed even the need to own the hardware. Cost of storing and processing a terabyte fell by orders of magnitude over a decade, and a capability that had belonged to a handful of web giants became ordinary.
The software answer
Hardware and distribution were only half of it. Early large-scale data work was bespoke and difficult. A programmer reasoned explicitly about how data was partitioned across the cluster, wrote computation in the rigid shape that MapReduce required, and handled the consequences of machines failing mid-job. That work was powerful and unpleasant in equal measure. What followed was a steady climb in abstraction. Apache Spark, in the early 2010s, replaced the rigid MapReduce shape with a richer and faster model that kept data in memory across steps (Zaharia et al., 2012). SQL was layered back on top of distributed storage, so analysts could query enormous datasets in a language they already knew. Cloud data warehouses then hid the cluster entirely, presenting petabytes behind an interface no more demanding than a database table. Each layer concealed more of the distributed machinery beneath it, until the difficulty that had defined the field a decade earlier was, for most users, simply gone.
What we witnessed
What we witnessed in big data is the arc that recurs across every transition we have written about. A problem that was hard, bespoke, and the preserve of specialists became commoditised and easy, and the work moved up the stack as a result. We worked with data at a scale that strained the available tools before those tools matured, and the relief when the infrastructure caught up was considerable. The constraint that had defined the work, whether the data could be processed at all, dissolved, and a different constraint took its place.
That different constraint is the more interesting one, and it is where this piece hands off to its companion. Once storing and processing data at scale became easy, the binding question stopped being technical. It became a question of judgement: what the data should be used for, what it does to the people represented in it, and whether the ease of acting on it has outrun our care in deciding whether to. Infrastructure made the capability universal. Harder still is the discipline of deciding what to do with a capability that no longer pushes back, and that is the one still being learned. As with the abstractions in any maturing field, the ease is a genuine gain, and it rewards anyone who still understands what sits beneath it.