Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I shudder to think who needs to process a million lines of csv that fast...


I have. I think it's a pretty easy situation for certain kinds of startups to find themselves in:

- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.

- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.

- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.

Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.


We use parquet extensively at work, and it's really slow to ingest. Slower than a hand rolled binary column oriented format.

Sometimes using something standardized is just worth it though.


It's become a very common interchange format, even internally; it's also easy to deflate. I have had to work on codebases where CSV was being pumped out at basically the speed of a NIC card (its origin was Netflow, and then aggregated and otherwise processed, and the results sent via CSV to a master for further aggregation and analysis).

I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?


protobuf is more friction, and actually slow to write and read.

For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.

Protobuf requires to install a library. Understand how it works. Write a schema file. Share the shema to others. The API is cumbersome.

Finally to offer this mutable struct via setter and getter abstraction, with variable length encoded numbers, variable length strings etc. The library ends up quite slow.

In my experience protobuf is slow and memory hungry. The generated code is also quite bloated, which is not helping.

See https://capnproto.org/ for details from the original creator of protobuf.

Is CSV faster than protobuf? I don't know, and I haven't tested. But I wouldn't be surprised if it is.


> For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.

Based on the amount of software I seen producing broken CSV or can't parse (more-or-less) valid CSV, I don't think that is true.

It seems to be easy, because just printf("%s,%d,%d\n", ...) but it is full of edge cases most programmers don't think about.


Not an issue when you control both ends of the pipe. CSV is a great interchange format for tabular data, especially so if it's only/mostly numeric. If you need to pass tabular data from internal service X to internal service Y it's great. And it's really fast.


Hmmm if they're just internal tools, why not just an array of structs? No parsing needed. Can have optionals. Can't go faster than nothing.


Mostly because dependencies are hard and extra so when you need another team using a different language to also neeeds support the same format.

I’d love to pass parquet data around, or SQLite dbs, or something else, but that requires dedicated support from other teams upstream/downstream.

Everyone and everything supports CSV, and when they don’t they can hack a simple parser quickly. I know that getting a CSV parser right for all the edge cases is very hard, but they don’t need to. They just need to support the features we use. That’s simple and quick and everyone quickly moves on to the actual work of processing the data.


Yeah there's no format to support that way. Maybe I'm more biased towards numeric data (sensor readings, etc), but I never have to worry about libraries and dependencies to say

data = (uint32_t *)read(f);

Or

data = struct.unpack...

Sounds like you're dealing with more heavily formatted or variably formatted data that benefits from more structure to it


Lucky you! We produce/consume several files from other teams, and those teams use Python, Java, Go, and C++ internally. I can try to bend those teams to my will by pushing my own custom serialization library (and would that even be a fair thing to do?) or I can just pass them a CSV.


Thanks to everyone above for some great responses. Cap'n Proto seems to do exactly what you're describing (the in-memory representation is identical to what's on the wire, and then getter/setter methods are generated which look at that).


yep use it a lot for internal stuff and I can't recall the last time we had an issue with parsing or using it. It just works for us as a data interchange file format for tabular data. Of course our character set is basically just ascii letter and numbers, we don't even need commas or quotation marks.


Precisely. And if things get a bit more complicated slap a ‘|’ as a separator and you are almost guaranteed to never need to quote anything.


Extremely hard to tell an HR person, "Right-click on here in your Workday/Zendesk/Salesforce/etc UI and export a protobuf". Most of these folks in the business world LIVE in Excel/Spreadsheet land so a CSV feels very native. We can agree all day long that for actual data TRANSFER, CSV is riddled with edge cases. But it's what the customers are using.


It's extremely unlikely they need to load spreadsheets large enough for 21Gb/s speed to matter


You’d be surprised. Big telcos use CSV and SFTP for CDR data, and there’s a lot of it.


Oh absolutely! I'm just mentioning why CSV is chosen over Protobufs.


Kind of, there isn't a 1:1 mapping of protobuf wire types to schema types, so you need to package the protobuf schema with the data and compile it to parse the data, or decide on the schema before-hand. So now you need to decide on a file format to bundle the schema and the data.


I'm not the biggest fan of Protobuf, mostly around the 'perhaps-too-minimal' typing of the system and the performance differentials present on certain languages in the library.

e.x. I know in .NET space, MessagePack is usually faster than proto, I think similar is true for JVM. Main disadvantage is there's not good schema based tooling around it.


I shudder to think of what it means to be storing the _results_ of processing 21 GB/s of CSV. Hopefully some useful kind of aggregation, but if this was powering some kind of search over structured data then it has to be stored somewhere...


Just because you’re processing 21GB/s of CSV doesn’t mean you need all of it.

If your data is coming from a source you don’t own, it’s likely to include data you don’t need. Maybe there’s 30 columns and you only need 3 - or 200 columns and you only need 1.

Enterprise ETL is full of such cases.


For all its many weaknesses, I believe CSV is still the most common data interchange format.


Erm, maybe file based? JSON is the king if you count exchanges worldwide a sec. Maybe no 2 is form-data which is basically email multipart, and if course there's email as a format. Very common =)


I meant file-based.


I honestly wonder if JSON is king. I used to think so until I started working in fintech. XML is unfortunately everywhere.


JSON: because XML is too hard.

Developers: hey, let's hack everything XML had back onto JSON except worse and non-standardized. Because it turns out you need those things sometimes!


JSON isn't great for tabular data. And an awful lot of data is tabular.


Yeah, I don’t like parsing XML, but I’d rather do that than deal with the Lovecraftian API design that comes with complex JSON representations.


JSON tabular data only adds a couple of brackets per line and at the start/end of the file vs CSV. In exchange for these bits (that basically disappear when compressed), you get a guaranteed standard formatting. Seems like a decent tradeoff to me.


lots of folks in Finance, you can share csv with any Finance company and they can process it. It's text.


Humans generate decisions / text information at rates of ~bytes per second at most. There is barely enough humans around to generate 21GB/s of information even if all they did was make financial decisions!

So 21 GB/s would be solely algos talking to algos... Given all the investment in the algos, surely they don't need to be exchanging CSV around?


Standards (whether official or de facto) often aren't the best in isolation, but they're the best in reality because they're widely used.

Imagine you want to replace CSV for this purpose. From a purely technical view, this makes total sense. So you investigate, come up with a better standard, make sure it has all the capabilities everyone needs from the existing stuff, write a reference implementation, and go off to get it adopted.

First place you talk to asks you two questions: "Which of my partner institutions accept this?" "What are the practical benefits of switching to this?"

Your answer to the first is going to be "none of them" and the answer to the second is going to be vague hand-wavey stuff around maintainability and making programmers happier, with maybe a little bit of "this properly handles it when your clients' names have accent marks."

Next place asks the same questions, and since the first place wasn't interested, you have the same answers....

Replacing existing standards that are Good Enough is really, really hard.


CSV is a questionable choice for a dataset that size. It's not very efficient in terms of size (real numbers take more bytes to store as text than as binary), it's not the fastest to parse (due to escaping) and a single delimiter or escape out of place corrupts everything afterwards. That not to mention all the issues around encoding, different delimiters etc.


Its great for when people need to be in the loop, looking at the data, maybe loading in Excel etc. (I use it myself...). But not enough humans around for 21 GB/s


> (real numbers take more bytes to store as text than as binary)

Depends on the distribution of numbeds in the sataset. It's quite common to have small numbers. For these text is a more efficient representation compared to binary, especially compared to 64-bit or larger binary encodings.


The only real example I can think of is the US options market feed. It is up to something like 50 GiB/s now, and is open 6.5 hours per day. Even a small subset of the feed that someone may be working on for data analysis could be huge. I agree CSV shouldn't even be used here but I am sure it is.


OPRA is a half dozen terabytes of data per day compressed.

CSV wouldn't even be considered.


You might have accumulated some decades of data in that format and now want to ingest it into a database.


Yes, but if you have decades of data, what turns on having to wait for a minute or 10 minutes to convert it?


> Humans generate decisions / text information at rates of ~bytes per second at most

Yes, but the consequences of these decisions are worth much more. You attach an ID to the user, and an ID to the transaction. You store the location and time where it was made. Ect.


I think these would add only small amount of information (and in a DB would be modelled as joins). Only adds lots of data if done very inefficiently.


Why are you theoretising? I can tell you from out there its used massively, and its not going away in contrary. Even rather small banks can end up generating various reports etc. which can easily become huge.

The speed of human decision has basically 0 role here, as it doesn't with messaging generally, there is way more to companies than just direct keyboard-to-output link.


You seem to not realize that most humans are not coders.

And non coders use proprietary software, which usually has an export into CSV or XLS to be compatible with Microsoft Office.


In basically every situation it is inferior to HDF5.

I do not think there is an actual explanation besides ignorance, laziness or "it works".


That cartesian product file accounting sends you at year end?


Ugh.....I do unfortunately.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: