jkeiser's comments

jkeiser · on July 5, 2020

We have started experimenting how to do it with simdjson, though :) https://github.com/simdjson/simdjson/pull/947

jkeiser · on July 5, 2020

It's kind of a "continued learning" requirement, a little. Work doesn't always have the problems (or funding for them) that grow you in the right directions.

Open source is also a sort of portfolio for many people, I think. Resumes can only tell you so much; code speaks volumes. It pays, just not directly or immediately.

Though my first job in Silicon Valley-sized tech was at Netscape, and I got that specifically after writing some big open source patches for Mozilla. So it can sometimes pay more directly.

jkeiser · on July 5, 2020

What I've learned from this (as a simdjson author) is that we need to update the quick start in the README to have -O3. I was so psyched about the fact that we now compiled warning-free without any parameters ... that I didn't stop to think that some people would go "huh I did what they told me and simdjson is slow, wtf." Because we evidently told you to compile it in debug mode in the quick start :)

simdjson relies deeply on inlining to let us write performant code that is also readable.

Sorry to have sent you down a blind alley!

One thing to note: if you want to get good numbers to chew on, we have a bunch of really good real world examples, of ALL sizes (simdjson is about big and small), in the jsonexamples/ directory of simdjson. And if you want to check ojc's validation, there are a number of pass.json and fail.json files in the jsonchecker/ directory.

jkeiser · on July 5, 2020

simdjson doesn't care what is in the padding and won't modify it; it just needs the buffer the string lives in to have 32 extra addressable (allocated) bytes. It doesn't ever use the bytes to make decisions, but it may read them as certain critical algorithms run faster when you overshoot a tiny bit and correct after.

Most real world applications like sockets read into a buffer, and can easily meet this requirement.

If you are interested to know what it's for, the place where it parses/unescapes a string is a good example. Instead of copying byte by byte, it generally copies 16-32 raw bytes at a time, and just sort of caps it off at the end quote, even though it might have copied a little extra. Here's some pseudocode (note this isn't the full algorithm, I left a out some error conditions and escape parsing for clarity):

  // Check the quote
  if (in[i] == '"') {
    i++;
    len = 0;
    while (true) {
      // Use simd to copy 32 bytes from input to output
      chunk = in[i..i+32];
      out[len..len+32] = chunk;

      // Note we already wrote 32 bytes, and NOW check if there was a quote in there
      if (int quote = chunk.find('"')) {
        len += quote;
        break;
      }
      len += 32; // No quote, so keep parsing the string
      i += 32;
    }
  }

glangdale · on July 5, 2020

Yes, this is exactly why simdjson wants padding. It certainly doesn't need the string to be embedded in source or any such nonsense.

I wish there was a standardized attribute that C++ knew about that pretty much just said "hey, we're not right next to some memory-managed disaster, and if you read off this buffer, you promise not to use the results".

It is awful practice to read off the end of a buffer and let those bytes affect your behavior, but it is almost always harmless to read extra bytes (and mask them off or ignore them) unless you're next to a page boundary or in some dangerous region of memory that's mappped to some device.

This attribute would also need to be understood by tools like Valgrind (to the extent that valgrind can/can't track whether you're feeding this nonsense into a computation, which it handles pretty well).

jkeiser · on April 1, 2020

Worth pointing out: I thought it was just the SIMD that made it fast when I first got involved. It turns out that while it helps, it's just a tool that helps to achieve the real gain: eliminating branches (if statements) that make the processor stumble and fall in its attempts to sprint ahead (speculative execution). simdjson's first stage pushes this capability really far towards its limit, achieving 4 instructions per cycle by not branching. And yes, 1 cycle is the smallest amount of time a single instruction can take. Turns out a single thread is running multiple instructions in parallel at any given time, as long as you don't trip it up!

Parsing is notoriously serial and branchy, which is what makes simdjson so out of the ordinary. It's using a sort of "microparallel algorithm," running a small parallel parse on a SIMD-sized chunk of JSON (16-32 bytes depending on architecture), and then moving to the next.

And yeah, you have to go back over a decade to find CPUs that don't have SIMD. simdjson runs on those too, just obviously doesn't use SIMD instructions :)

glangdale · on April 2, 2020

This is a good explanation of why it's fast.

An interesting point with the design of simdjson loses its branchlessness in "stage 2". I originally had a bunch of very elaborate plans to try to stay branchless far further into the parsing process. It proved just too hard to make it work. There were some promising things that ultra-modern Intel chips - meaning Icelake - and future iterations of ARM (SVE/SVE2) - are adding to their SIMD abilities, so it might be worth revisiting this in a few years (there aren't too many Icelake boxes out there and SVE barely exists).

jkeiser · on April 2, 2020

Yep. Most of stage 2's branchiness essentially comes from "is the next thing an array? object? string? number? Handle it differently if so."

Making it so you can handle all the brackets at once, all the strings at once, all the numbers at once, would make a big difference, and we're thinking about that. Another thing that could help is making the if statement more predictable using type information from the user. get<int>() could mean "I expect this next thing to be an integer, so parse it that way and just yell if it's not, please."

It's difficult. But it's why I'm still so fascinated! Solving JSON thoroughly and completely will give us a lot of information on how to quickly parse XML, YAML, and other file formats.

We've clearly been collectively doing parsing wrong (including me) if there's this much of a gap. It's exciting to see something innovative and new in this domain and even being able to contribute to it :) @lemire deserves a ton of credit for making an actual project out of his work and promoting it; I likely wouldn't have heard of it otherwise.

glangdale · on April 2, 2020

That's right. There's a problem which I referred to as the toothpaste problem - you can squeeze the needed branchiness around but you can't make it go away entirely (at least, I couldn't).

There used to be 4 stages (stage 1 was the marks, stage 2 was bits-to-indexes, stage 3 was the tape construction and stage 4 was the 'clean everything up and do all the hard stuff'). It's possible - though awkward - to do tape construction branchlessly, but the gyrations required were expensive and weird and it just delayed the reckoning.

I built a prototype of the 'gather all the X at once and handle it in one go' and the logic to gather that stuff was more expensive than just handling everything.

In my wacky world of branch free coding (which I've been doing a while) there are worse things than missing a branch. The idea that you can accumulate an array branchlessly (i.e. always put the thing you have in a location somewhere and bump a pointer when you need to) seems pretty cool, but branch miss is not the only hazard. This technique of branchlessly writing a log is a anti-pattern I've tried over and over again and a stream of unpredictable writes is just as big a pain as a bunch of unpredictable branches - it causes the pipeline to come to a screaming halt. If you can get it going somehow (new SIMD tricks? better algorithm?) I'd be impressed.

Get my email from Daniel or hit me up in Twitter DMs if you want to discuss further.

glangdale · on April 2, 2020

... and yes, agreed that Daniel has done a great job making this into a real thing. This would have stayed a bunch of code sketches if I had been the only one working on it. In terms of quality, the project now is well on the way to being commercial quality code rather than the research prototype that I did. I understand I have you to thank for that as well!

jkeiser · on April 1, 2020

Streaming and selective parsing are good things, and something we're looking into for the next set of features.

Note that there are real speed gains to be had by not being selective. The CPU is normally sprinting ahead, executing future instructions before you even get to them, and every branch can make it trip and land flat on its face.

We've found we have to be very careful with every branch we introduce. I've tried to introduce branches that skip tons of code and ought to make things a ton faster, but which instead slow it down.

rurban · on April 2, 2020

That's of course true. Branching is costly and malloc is costly. But there is a need to filter objects and arrays in the 2nd part behind the parsing, in the conversion to your language/data. With a slower seperate function of course.

Parsing is the neglible fast part, converting it into your business objects is the slow part. This is usually done with the 2nd tape step. This involves a lot of malloc's, and you really want to skip the unneeded parts and filter it upfront.

Still thinking how to design my new JSON parser around simdjson.

jkeiser · on April 2, 2020

Agreed, we're thinking it through too. Most parsing of JSON could be done almost without any second stage at all, or at least with a direct-to-user-struct/array, if you had the right API.

jkeiser · on April 1, 2020

simdjson does not modify the input, fyi. Rather than \0, which isn't really a part of JSON, the SIMD instructions search for whitespace and JSON structure like commas, colons, etc.

eska · on April 3, 2020

Yeah, sorry I didn't feel like giving some guy who hasn't even read the article a full explanation. However I saw in a presentation on YouTube that these characters are replaced (e.g. ") to then apply SIMD instructions. This implies copies and allocations. But these are necessary, and it would not be possible to achieve this with indices.

jkeiser · on April 1, 2020

simdjson's spinup overhead is minimal, it runs at a pretty steady rate no matter the size of document: https://github.com/simdjson/simdjson/blob/master/doc/growing...

It's also still faster than other parsers in our tests, even on the smallest documents (https://github.com/simdjson/simdjson/issues/312#issuecomment... advantage is just smaller, like 1.1x and 1.2x instead of 2.5x :) It really starts to pull ahead somewhere between 512 bytes and 1K.

jkeiser · on April 1, 2020

Yep, simdjson is a fully compliant, validating JSON parser, up to and including full UTF-8 validation. That's part of what makes its speed so eerie.

jkeiser · on April 1, 2020

Yep. You can see the raw numbers at https://simdjson.org/about/.

simdjson UTF8-validation, exact numbers: 2.5 GB/s

RapidJSON insitu, UTF8-validation: 0.41 GB/s

RapidJSON insitu, UTF8-validation, exact numbers: 0.39 GB/s

RapidJSON UTF8-validation: 0.29 GB/s

RapidJSON UTF8-validation, exact numbers: 0.28 GB/s