Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For anything more complicated you can also get very far with simple python programs that read one line at a time and output some transformation of it (which might include turning one line into many to be piped into sort etc)


I think that the optimal way to do these kind of things is:

1) Assuming there is no joins/merges requirement, read in chunks and output GB dumps.

2) If joins/merges are required, use external merge sort.

Is this correct? Actually I'm wondering whether I could earn some bread and butter by focusing on the big data processing problems (e.g. sort/filter Terabytes+ dumps, do transformation for each line for Terabytes+ dumps, those kind of things) without actually knowing how to implement math algorithms (required for data science).

If so what kind of tools I need to master? I'm thinking about basic *nix tools like mentioned above, and also Python and maybe some compiled language for optimization (someone managed to speed up a Python external merge algorithm on 500GB file by 50% by implementing in Go), then maybe some easy algorithms (merge join, heap, etc.)


Learning Unix tools is pretty good place to start. There are a lot of commands that can do a lot of processing. It’s been a while since I learned but the book “Unix power tools” from oreily is pretty good. It’s old, but honestly these commands haven’t changed much.

http://shop.oreilly.com/product/9780596003302.do

Python is slower compared to some of it’s compiled cousins, but it’s quick to write and a great skill to have when bash scripting can’t handle some of the complexity or you need dB access. We use it sometimes to call c programs to do DNA sequence alignments and process the returns.


Thanks a lot! Time to fire up VirtualBox and learn some things.


Congratulations. You've just discovered the basics of MapReduce :)


Minus the enormous overhead that MapReduce brings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: