More

fuhrysteve · on Aug 18, 2021

Yeah, this is pretty much it. The author complains about CSVs being "notoriously inconsistent" as though switching to some other format would magically change that. They're only inconsistent because sometimes lazy programmers do ",".join(mylist) instead of using an RFC4180 compliant CSV writer. Lazy programmers will just use non-compliant methods of creating whatever magic format OP is dreaming about. Case in point: trailing commas in JSON objects, and other ridiculous things that people have come up with such as encoding a date in JSON like this: "\/Date(628318530718)\/" https://docs.microsoft.com/en-us/previous-versions/dotnet/ar...

CSVs also are great because you can parse them one row at a time. This makes for a very scale-able and memory-efficient way of processing very large files containing millions of rows.

Let there be no mistake: Everyone reading this today will retire long before CSVs retire. And that's just fine by me.

ivanbakel · on Aug 18, 2021

>CSVs also are great because you can parse them one row at a time. This makes for a very scale-able and memory-efficient way of processing very large files containing millions of rows.

Even RFC4180-compliant CSVs can be incredibly memory-inefficient to parse. If you encounter a quoted field, you must continue to the next unescaped quote to discover how large the field is, since all newlines you encounter are part of the field contents. Field sizes (and therefore row sizes) are unbounded, and much harder to determine than simply looking for newlines - if you were to naively treat CSV as a "memory-efficient" format to parse, you would create a parser that would be easy to blow up with a trivial large file.

newbie2020 · on Aug 19, 2021

Hah, I use the join function when I write them. Can you elaborate upon why that’s bad?

fuhrysteve · on Aug 19, 2021

Doesn't require a whole lot of elaboration.

",".join(["a,b","c","d\ne","\"f\",g\""])

yields:

a,b,c,d

e,"f",g"

Try opening that in any csv reader.

fuhrysteve · on Oct 31, 2020

If it's so uninteresting to you, why participate in the discussion?

nxmnxm99 · on Oct 31, 2020

To comment on how uninteresting it is, obviously?

fuhrysteve · on Oct 30, 2020

Great practical note by Josh Berkus on why Uber left Posgresql. Basically: runaway table bloat because Uber had a usecase that postgres doesn't address as well as InnoDB.

https://www.postgresql.org/message-id/5797D5A1.5030009%40agl...

api · on Oct 30, 2020

The whole VACUUM paradigm is the biggest thing that bugs me about pgsql. The fact that it can actually freeze things always worries me. Can’t this happen constantly in the background like modern GCs?

spacemanmatt · on Oct 31, 2020

I think you might be pleasantly surprised to get current on how vacuum works in current editions. Vacuum used to be a bigger issue for pg.

chousuke · on Oct 30, 2020

It does, though? That's what autovacuum is, unless I'm missing something.

It's just not always enough without tuning, just like GC.

api · on Oct 31, 2020

We have had a lot of problems with frequently updated tables. Is auto tuning really that hard? Or is pg’s philosophy not to “auto” anything?

fuhrysteve · on Aug 21, 2020

> I think just a mere existence of flu or cold was a mistake. We should have eradicated those years ago.

It's not clear to me how eradicating these would have ever been possible in the past, or will be in the foreseeable future. The flu has (probably) been around since at least 6000 BC, and numerous strains can be spread by birds & many other species.

EdwinLarkin · on Aug 21, 2020

I thought this is more or less a logistical nightmare. We cant create and distribute that many vaccines for the N number of strains.

It would be possible but we just never really tried.

majewsky · on Aug 21, 2020

No, a significant part of it is that immunity is temporary. The article describes a particular cold virus where immunity lasts about 40 weeks, hence it resurges every winter.

fuhrysteve · on Aug 12, 2020

To be fair, python isn't the only language whose package management system is all but incoherent to folks who don't use python every day (and sometimes even to them!). npm is pretty rough to get setup too, and you run into a lot of issues similar to this.

wonderlg · on Aug 12, 2020

I don’t think Node has trouble from the very first installation. A brew install will set you up with the latest version of both node and npm and they will work.

At most you’ll have trouble running the right version (rare nowadays, unlike in pre-v1 days)

fuhrysteve · on Aug 1, 2020

> What are the tradeoffs vs using docker? Just curious.

Probably some combination of memory usage and complexity, depending on your application. If you're already familiar with using docker as a development environment, definitely go for it.

I don't use pipenv, I'm still using plain old virtualenv for development. Mostly it's just a matter of familiarity. If there's not an itch, why scratch?

fuhrysteve · on July 20, 2020

Certainly not, but considering there should be 100M available by the end of the year, it seems like they are scaling up the production line considerably

fuhrysteve · on June 19, 2020

Good article explaining how they use k-anonymity here:

https://blog.cloudflare.com/validating-leaked-passwords-with...

fuhrysteve · on March 27, 2019

How much do you trust k-anonymity?

fuhrysteve · on Jan 24, 2019

Honestly I bet it's a learning algorithm that they used to identify bots, rather than something a human decided.

Spam detection algorithms usually have a training set. One of the features could have been "uses_feature_x", which, according to the training set, is known to have a high probability of being a spam bot (because humans rarely use those features)