Actually the awk solution in the blog post doesn’t load the entire dataset into memory. It is not limited by RAM. Even if you make the input 100x larger, mawk will still be hundreds of times faster than Hadoop. An important lesson here is streaming. In our field, we often process >100GB data in <1GB memory this way.
This. For many analytical use cases the whole dataset doesn't have to fit into memory.
Still: of course worthwhile to point out how oversized a compute cluster approach is when the whole dataset would actually fit into memory of a single machine.
> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.