I have. I think it's a pretty easy situation for certain kinds of startups to find themselves in:
- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.
- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.
- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.
Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.
It's become a very common interchange format, even internally; it's also easy to deflate. I have had to work on codebases where CSV was being pumped out at basically the speed of a NIC card (its origin was Netflow, and then aggregated and otherwise processed, and the results sent via CSV to a master for further aggregation and analysis).
I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?
protobuf is more friction, and actually slow to write and read.
For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.
Protobuf requires to install a library. Understand how it works. Write a schema file. Share the shema to others. The API is cumbersome.
Finally to offer this mutable struct via setter and getter abstraction, with variable length encoded numbers, variable length strings etc. The library ends up quite slow.
In my experience protobuf is slow and memory hungry. The generated code is also quite bloated, which is not helping.
> For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.
Based on the amount of software I seen producing broken CSV or can't parse (more-or-less) valid CSV, I don't think that is true.
It seems to be easy, because just printf("%s,%d,%d\n", ...) but it is full of edge cases most programmers don't think about.
Not an issue when you control both ends of the pipe. CSV is a great interchange format for tabular data, especially so if it's only/mostly numeric. If you need to pass tabular data from internal service X to internal service Y it's great. And it's really fast.
Mostly because dependencies are hard and extra so when you need another team using a different language to also neeeds support the same format.
I’d love to pass parquet data around, or SQLite dbs, or something else, but that requires dedicated support from other teams upstream/downstream.
Everyone and everything supports CSV, and when they don’t they can hack a simple parser quickly. I know that getting a CSV parser right for all the edge cases is very hard, but they don’t need to. They just need to support the features we use. That’s simple and quick and everyone quickly moves on to the actual work of processing the data.
Yeah there's no format to support that way. Maybe I'm more biased towards numeric data (sensor readings, etc), but I never have to worry about libraries and dependencies to say
data = (uint32_t *)read(f);
Or
data = struct.unpack...
Sounds like you're dealing with more heavily formatted or variably formatted data that benefits from more structure to it
Lucky you! We produce/consume several files from other teams, and those teams use Python, Java, Go, and C++ internally. I can try to bend those teams to my will by pushing my own custom serialization library (and would that even be a fair thing to do?) or I can just pass them a CSV.
Thanks to everyone above for some great responses. Cap'n Proto seems to do exactly what you're describing (the in-memory representation is identical to what's on the wire, and then getter/setter methods are generated which look at that).
yep use it a lot for internal stuff and I can't recall the last time we had an issue with parsing or using it. It just works for us as a data interchange file format for tabular data. Of course our character set is basically just ascii letter and numbers, we don't even need commas or quotation marks.
Extremely hard to tell an HR person, "Right-click on here in your Workday/Zendesk/Salesforce/etc UI and export a protobuf". Most of these folks in the business world LIVE in Excel/Spreadsheet land so a CSV feels very native. We can agree all day long that for actual data TRANSFER, CSV is riddled with edge cases. But it's what the customers are using.
Kind of, there isn't a 1:1 mapping of protobuf wire types to schema types, so you need to package the protobuf schema with the data and compile it to parse the data, or decide on the schema before-hand. So now you need to decide on a file format to bundle the schema and the data.
I'm not the biggest fan of Protobuf, mostly around the 'perhaps-too-minimal' typing of the system and the performance differentials present on certain languages in the library.
e.x. I know in .NET space, MessagePack is usually faster than proto, I think similar is true for JVM. Main disadvantage is there's not good schema based tooling around it.
I shudder to think of what it means to be storing the _results_ of processing 21 GB/s of CSV. Hopefully some useful kind of aggregation, but if this was powering some kind of search over structured data then it has to be stored somewhere...
Just because you’re processing 21GB/s of CSV doesn’t mean you need all of it.
If your data is coming from a source you don’t own, it’s likely to include data you don’t need. Maybe there’s 30 columns and you only need 3 - or 200 columns and you only need 1.
Erm, maybe file based? JSON is the king if you count exchanges worldwide a sec. Maybe no 2 is form-data which is basically email multipart, and if course there's email as a format. Very common =)
JSON tabular data only adds a couple of brackets per line and at the start/end of the file vs CSV. In exchange for these bits (that basically disappear when compressed), you get a guaranteed standard formatting. Seems like a decent tradeoff to me.
Humans generate decisions / text information at rates of ~bytes per second at most. There is barely enough humans around to generate 21GB/s of information even if all they did was make financial decisions!
So 21 GB/s would be solely algos talking to algos... Given all the investment in the algos, surely they don't need to be exchanging CSV around?
Standards (whether official or de facto) often aren't the best in isolation, but they're the best in reality because they're widely used.
Imagine you want to replace CSV for this purpose. From a purely technical view, this makes total sense. So you investigate, come up with a better standard, make sure it has all the capabilities everyone needs from the existing stuff, write a reference implementation, and go off to get it adopted.
First place you talk to asks you two questions: "Which of my partner institutions accept this?" "What are the practical benefits of switching to this?"
Your answer to the first is going to be "none of them" and the answer to the second is going to be vague hand-wavey stuff around maintainability and making programmers happier, with maybe a little bit of "this properly handles it when your clients' names have accent marks."
Next place asks the same questions, and since the first place wasn't interested, you have the same answers....
Replacing existing standards that are Good Enough is really, really hard.
CSV is a questionable choice for a dataset that size. It's not very efficient in terms of size (real numbers take more bytes to store as text than as binary), it's not the fastest to parse (due to escaping) and a single delimiter or escape out of place corrupts everything afterwards. That not to mention all the issues around encoding, different delimiters etc.
Its great for when people need to be in the loop, looking at the data, maybe loading in Excel etc. (I use it myself...). But not enough humans around for 21 GB/s
> (real numbers take more bytes to store as text than as binary)
Depends on the distribution of numbeds in the sataset. It's quite common to have small numbers. For these text is a more efficient representation compared to binary, especially compared to 64-bit or larger binary encodings.
The only real example I can think of is the US options market feed. It is up to something like 50 GiB/s now, and is open 6.5 hours per day. Even a small subset of the feed that someone may be working on for data analysis could be huge. I agree CSV shouldn't even be used here but I am sure it is.
> Humans generate decisions / text information at rates of ~bytes per second at most
Yes, but the consequences of these decisions are worth much more. You attach an ID to the user, and an ID to the transaction. You store the location and time where it was made. Ect.
Why are you theoretising? I can tell you from out there its used massively, and its not going away in contrary. Even rather small banks can end up generating various reports etc. which can easily become huge.
The speed of human decision has basically 0 role here, as it doesn't with messaging generally, there is way more to companies than just direct keyboard-to-output link.