More succinct version of the same from Gary Bernhardt of WAT fame (from 2015, sa...

attractivechaos · on Jan 26, 2024

Actually the awk solution in the blog post doesn’t load the entire dataset into memory. It is not limited by RAM. Even if you make the input 100x larger, mawk will still be hundreds of times faster than Hadoop. An important lesson here is streaming. In our field, we often process >100GB data in <1GB memory this way.

tosh · on Jan 26, 2024

This. For many analytical use cases the whole dataset doesn't have to fit into memory.

Still: of course worthwhile to point out how oversized a compute cluster approach is when the whole dataset would actually fit into memory of a single machine.

CodesInChaos · on Jan 26, 2024

Reminds me of one of my favourite twitter posts:

> Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM.

https://twitter.com/DEVOPS_BORAT/status/299176203691098112

tetha · on Jan 26, 2024

DEVOPS_BORAT contains a lot of truth if you think about it, hah.

Our sarcastic team-motto is very much this: https://twitter.com/DEVOPS_BORAT/status/41587168870797312

h4kor · on Jan 26, 2024

https://yourdatafitsinram.net/

thelastgallon · on Jan 26, 2024

I see servers with more RAM, 32 TB[1], 48TB RAM[2], probably more at: https://buy.hpe.com/us/en/compute/mission-critical-x86-serve...

[1]https://buy.hpe.com/us/en/compute/mission-critical-x86-serve...

[2]https://buy.hpe.com/us/en/compute/mission-critical-x86-serve...

_ugfj · on Jan 27, 2024

64TB max:

IBM Power System E980 https://www.ibm.com/downloads/cas/VX0AM0EP (notably the E880 did 32TB in 2014)

SGI UV 300 https://www.uvhpc.com/sgi-uv-300

SGI UV 3000 https://www.uvhpc.com/sgi-uv-3000