For anything more complicated you can also get very far with simple python progr...

markus_zhang · on June 27, 2019

I think that the optimal way to do these kind of things is:

1) Assuming there is no joins/merges requirement, read in chunks and output GB dumps.

2) If joins/merges are required, use external merge sort.

Is this correct? Actually I'm wondering whether I could earn some bread and butter by focusing on the big data processing problems (e.g. sort/filter Terabytes+ dumps, do transformation for each line for Terabytes+ dumps, those kind of things) without actually knowing how to implement math algorithms (required for data science).

If so what kind of tools I need to master? I'm thinking about basic *nix tools like mentioned above, and also Python and maybe some compiled language for optimization (someone managed to speed up a Python external merge algorithm on 500GB file by 50% by implementing in Go), then maybe some easy algorithms (merge join, heap, etc.)

acomjean · on June 27, 2019

Learning Unix tools is pretty good place to start. There are a lot of commands that can do a lot of processing. It’s been a while since I learned but the book “Unix power tools” from oreily is pretty good. It’s old, but honestly these commands haven’t changed much.

http://shop.oreilly.com/product/9780596003302.do

Python is slower compared to some of it’s compiled cousins, but it’s quick to write and a great skill to have when bash scripting can’t handle some of the complexity or you need dB access. We use it sometimes to call c programs to do DNA sequence alignments and process the returns.

markus_zhang · on June 27, 2019

Thanks a lot! Time to fire up VirtualBox and learn some things.

Anon84 · on June 27, 2019

Congratulations. You've just discovered the basics of MapReduce :)

adrianN · on June 28, 2019

Minus the enormous overhead that MapReduce brings.