For anything more complicated you can also get very far with simple python programs that read one line at a time and output some transformation of it (which might include turning one line into many to be piped into sort etc)
I think that the optimal way to do these kind of things is:
1) Assuming there is no joins/merges requirement, read in chunks and output GB dumps.
2) If joins/merges are required, use external merge sort.
Is this correct? Actually I'm wondering whether I could earn some bread and butter by focusing on the big data processing problems (e.g. sort/filter Terabytes+ dumps, do transformation for each line for Terabytes+ dumps, those kind of things) without actually knowing how to implement math algorithms (required for data science).
If so what kind of tools I need to master? I'm thinking about basic *nix tools like mentioned above, and also Python and maybe some compiled language for optimization (someone managed to speed up a Python external merge algorithm on 500GB file by 50% by implementing in Go), then maybe some easy algorithms (merge join, heap, etc.)
Learning Unix tools is pretty good place to start. There are a lot of commands that can do a lot of processing. It’s been a while since I learned but the book “Unix power tools” from oreily is pretty good. It’s old, but honestly these commands haven’t changed much.
Python is slower compared to some of it’s compiled cousins, but it’s quick to write and a great skill to have when bash scripting can’t handle some of the complexity or you need dB access. We use it sometimes to call c programs to do DNA sequence alignments and process the returns.