Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can solve many, perhaps most terascale problems on a standard computer with big enough hard drives using the old memory efficient tools like sed, awk, tr, od, cut, sort & etc. A9's recommendation engine used to be a pile of shell scripts on log files that ran on someone's desktop...


I was writing a batchwise ETL tool (to break documents in some proprietary format down into rows to feed to Postgres's COPY command) and I achieved a remarkable level of IO parallelism by relying on Unix tooling to do my map-reducing for me.

1. I wrote a plain SQL mapper program, which spawns a worker-thread pool, where each worker opens its own "part" file for each SQL table, such that a document consumed by worker N gets records written to "tableA/partN.pgcopy".

2. And then, after the mapper is done, to do the reduce step, I just spawned a `sort -m -u -k n1` invocation to collate the row-files of each table together into a single `.sql` file. This not only efficiently merge-sorts the (presorted) "part" files into one file (without needing to sort the files themselves), but also blows away any rows with duplicate primary-keys [i.e. duplicate first columns in the row's TSV representation]—meaning I can restart the mapper program in the middle of a job (causing it to create a new set of "parts") and sort(1) will take care of cleaning up the result!

I honestly don't think anything could be done to make this system more optimal for its use-case. sort(1) goes crazy fast when it can mmap(2) files on disk.

(Also, I'm pretty sure that even the framework part of the mapper program—the "N persistent workers that each greedily consume a document-at-a-time from a shared pipe, as if they were accept(2)ing connections on a shared socket"—could be created with Unix tooling as well, though I'm not sure how. GNU parallel(1), maybe?)

Bonus: once you have SQL rowsets in TSV form like this, you can calculate a "differential" rowset (against a rowset you've already inserted) using `comm -23 $new $old`. No need for a staging table in your data warehouse; you can dedup your data-migrations at the source.


Look into "bash-reduce", but it'd be great to have something like "bark" (bash-spark) which consumed documents at a time... and you're right, it might not even be that difficult.


Could you share the code for these operations? I have some similar occasional use cases.


In the waning months of G+ I fired off a few bulk archives (front page only), not the full community post set of the top 100k or so Communities (size and recency criteria) using a one-liner of awk, xargs, curl, and the Internet Archive's Save Page Now USL (SPN: https:web.archive.org/save/$URL). That took a bit over an hour.

On a mid-1990s iMac running Debian, and a rusty residential DSL connection.

R played a role in other community-related analysis and reporting.


For anything more complicated you can also get very far with simple python programs that read one line at a time and output some transformation of it (which might include turning one line into many to be piped into sort etc)


I think that the optimal way to do these kind of things is:

1) Assuming there is no joins/merges requirement, read in chunks and output GB dumps.

2) If joins/merges are required, use external merge sort.

Is this correct? Actually I'm wondering whether I could earn some bread and butter by focusing on the big data processing problems (e.g. sort/filter Terabytes+ dumps, do transformation for each line for Terabytes+ dumps, those kind of things) without actually knowing how to implement math algorithms (required for data science).

If so what kind of tools I need to master? I'm thinking about basic *nix tools like mentioned above, and also Python and maybe some compiled language for optimization (someone managed to speed up a Python external merge algorithm on 500GB file by 50% by implementing in Go), then maybe some easy algorithms (merge join, heap, etc.)


Learning Unix tools is pretty good place to start. There are a lot of commands that can do a lot of processing. It’s been a while since I learned but the book “Unix power tools” from oreily is pretty good. It’s old, but honestly these commands haven’t changed much.

http://shop.oreilly.com/product/9780596003302.do

Python is slower compared to some of it’s compiled cousins, but it’s quick to write and a great skill to have when bash scripting can’t handle some of the complexity or you need dB access. We use it sometimes to call c programs to do DNA sequence alignments and process the returns.


Thanks a lot! Time to fire up VirtualBox and learn some things.


Congratulations. You've just discovered the basics of MapReduce :)


Minus the enormous overhead that MapReduce brings.


Furthermore, as computers get faster and cheaper in every dimension what makes economic sense to use “Big Data” tooling and efforts gets substantially larger with it. The limits of single nodes 15 years ago were pretty serious but most problems businesses have even in the so-called enterprise can currently easily fit on a workstation costing maybe $5k and be crunched through in a couple hours or maybe minutes - a lot easier to deal with than multiple Spark or Hana nodes. Operationalizing the analysis to more than a single group of users or problem is where things get more interesting but I’ve seen very, very few companies that have the business needs to necessitate all this stuff at scale - most business leaders still seem to treat analytics results in discrete blocks via monthly / weekly reports and seem quite content with reports and findings that take hours to run. Usually when some crunching takes days to run it’s not because the processing itself takes a lot of CPU but because some ancient systems never intended to be used at that scale are the bottleneck or manual processes are still required so the critical path isn’t being touched at all by investing more in modern tools.

I can support “misguided” Big Data projects from a political perspective if they help fund fixing the fundamental problems (similar to Agile consultants) that plague an organization, but most consultants are not going to do very well by suggesting going back and fixing something unrelated to their core value proposition itself. For example, if you hire a bunch of machine learning engineers and they all say “we need to spend months or even years cleaning up and tagging your completely unstructured data slop because nothing we have can work without clean data” that’ll probably frustrate the people paying them $1MM+ / year each to get some results ASAP. The basics are missing by default and it’s why the non-tech companies are falling further and further behind despite massive investments in technology - technology is not a silver bullet to crippling organizational and business problems (this is pretty much the TL;DR of 15+ years of “devops” for me at least).


That is precisely what the projects I'm usually involved in do. A client might want "buzzword technology", but at the heart of it, what they really need are stable, scalable, and consolidated data pipelines to e.g. Hadoop or AWS that gives "Data Scientists" a baseline to work with (and anyone needing information, really - it was just called "Business Intelligence" a couple of years ago).

In the end it doesn't matter if you wind up with a multi-TB copy of some large database or a handful of small XML files - it's all in one place, it gets updated, there are usable ACL in place, and it can be accessed and worked with. That's the point where you think about running a Spark job or the above AWK magic.


> most business leaders still seem to treat analytics results in discrete blocks via monthly / weekly reports and seem quite content with reports and findings that take hours to run.

I would go further and even call long or at least not instant report generation a perceived feature. Similar to flight and hotel booking sites that show some kind of loading screen even if they could give instant search results, the duration of the generation itself seems to add trust to the reports.


> The basics are missing by default

Absolutely. I really want to see advanced AI/ML tools developed to address THIS problem. Don’t make me solve the data before I use ML, give me ML to fix my data!

That’s hard though, because data chaos is unbounded and computers are still dumb. I think there’s still tons of room for improvement though.


I watched a talk by someone in the intelligence community space nearly 8 years ago talking about the data dirt that most companies and spy agencies are combing through and the kind of abstract research that will be necessary to turn that into something consumable by all the stuff that private sector seems to be selling and hyping. So I think the old guard big data folks collecting yottabytes of crap across the world and trying to make sense of it are well aware and may actually get to it sometime soon. My unsubstantiated fear is that we can’t attack the data quality problem with any form of scale because we need a massive revolution that won’t be funded by any VC or that nobody will try to tackle because it’s too hard / not sexy - government funding is super bad and brain drain is a serious problem. In academia, who the heck gets a doctorate for advancements in cleaning up arbitrary data to feed into ML models when pumping out some more model and hyperparameter incremental improvements will get you a better chance of getting your papers through or employment? I’m sure plenty of companies would love to pay decent money to clean up data with lower cost labor than to have their highly paid ML scientists clean it up, so I’m completely mystified what’s going on that we’re not seeing massive investments here across disciplines and sectors. Is it like the climate change political problem of computing?


> In academia, who the heck gets a doctorate for advancements in cleaning up arbitrary data to feed into ML models

Well - Alex Ratner [stanford], for one: https://ajratner.github.io/

And several of Chris Re's other students have as well: https://cs.stanford.edu/~chrismre/

Trifacta is Joseph Hellerstein's [berkeley] startup for data wrangling: https://www.trifacta.com/

Sanjay Krishnan [berkeley]: http://sanjayk.io/


I was asking somewhat rhetorically but am glad to see that there’s some serious efforts going into weak supervision. At the risk of goalpost moving, I am curious who besides those in the Bay Area at the cutting edge are working on this pervasive problem? My more substantive point is that given the massive data quality problem among the ML community I would expect these researchers to be superhero class but why aren’t they?


... they are?

There are a lot of people tackling bits and pieces of the problem. Tom Mitchell's NELL project was an early one, using the web in all its messy glory...http://rtw.ml.cmu.edu/rtw/

Lots of other folks here (CMU). Particularly if you add an active learning. Hard messy problem that crosses databases and ML.


AWK is such an amazing tool


Awk is the most useful tool that people largely ignore in the UNIX tool chest. If you think of any script that has simple logic and involves transformations on input data, it could be more easily written in awk and integrated with the shell. After learning awk your UNIX abilities will increase exponentially.


I found this article to be really great for outlining all the capabilities of awk:

https://developer.ibm.com/tutorials/l-awk1/

From 2001.


Seconded, and the follow up articles:

https://developer.ibm.com/tutorials/l-awk2/ https://developer.ibm.com/tutorials/l-awk3/

These were mentioned but not linked to in the previous form of the article/blog, had a quick look at the newer version you linked to and that may still be the case.

A lot of the Awk info I had found prior to stumbling on these articles was focused on command line one-liners. So the sections on defining Awk scripts as files and multiline records were a great help to me.


What is A9?


It's an Amazon subsidiary

https://en.wikipedia.org/wiki/A9.com


What's A9?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: