bluuewhale's comments

bluuewhale · 2025-12-22T12:20:22 1766406022

Author here!

I wanted to share a follow-up to this post. https://bluuewhale.github.io/posts/further-optimizing-my-jav...

This time I went back with a profiler and optimized the actual hot path.

A huge chunk of time was going to Objects.equals() because of profile pollution / missed devirtualization.

After fixing that, the next bottleneck was ARM/NEON “movemask” pain (VectorMask.toLong()), so I tried SWAR… and it ended up faster (even on x86, which I did not expect).

chandlerc1024 · 2025-12-23T08:38:02 1766479082

FYI, we ended up implementing a _really_ nice SWAR version in the Carbon derivative of SwissTable that might be worth looking at for inspiration: https://github.com/carbon-language/carbon-lang/blob/trunk/co...

Can see the rest of that file and the adjacent `raw_hashtable.h` for the rest of the SwissTable-like implementation and `hashing.h` for the hash function.

FWIW, it consistently out-performs SwissTable in some respects, but uses a weaker but faster hash function that is good enough for the hash table, but not good for other use cases.

bluuewhale · 2025-12-16T01:54:26 1765850066

Great write-up.

This kind of "debugging journey" post is gold.

bluuewhale · 2025-12-16T01:50:57 1765849857

Ohhh, I went digging after your comment and I think this is exactly what you were referring to: JDK-7192942 ("Inefficient calculation of power of two in HashMap").

I honestly love hearing about these hidden gem micro-optimizations.

Thanks a lot for sharing!

bluuewhale · 2025-12-16T01:41:02 1765849262

This is a really fun project!

In SwissTable, the main difference is that you only get a probabilistic signal of presence (kind of like a Bloom filter), but the core idea feels very similar.

bluuewhale · 2025-12-16T01:36:50 1765849010

Thanks. This is actually one of the topics I really want to tackle next.

If we just wrap every API call with synchronized, I'd expect heavy contention (some adaptive spinning and then OS-level park/unpark), so it'll likely bottleneck pretty quickly.

Doing something closer to ConcurrentHashMap (locking per bin rather than globally) could mitigate that.

For the open-addressing table itself, I'm also considering adding lightweight locking at the group level (e.g., a small spinlock per group) so reads stay cheap and writes only lock a narrow region along the probe path.

poorman · 2025-12-16T02:04:24 1765850664

I think that's a great idea! I just checked one of my larger projects and it 55% ConcurrentHashMap and 45% HashMap so I'd personally benefit from this plan.

bluuewhale · 2025-12-16T01:22:56 1765848176

Thanks for the heads-up!

I'm still figuring out Hugo, so I'm not sure I can fix it right away, but I'll take a look.

inson · 2025-12-16T21:53:00 1765921980

oh nice, hugo is great!

bluuewhale · 2025-12-16T01:20:29 1765848029

Thanks! Appreciate it

bluuewhale · 2025-12-14T12:10:49 1765714249

Thanks for the feedback.

You've nailed the core idea. I'll tweak the structure a bit to make the concepts clearer up front.

bluuewhale · 2025-12-14T12:07:07 1765714027

Author here!

Thanks for the great point. This is actually the main topic I'm working on for the next post.

It's understandable to expect SIMD to win purely because it's wider, but in practice the end-to-end cost matters more than raw VL.

With the Java Vector API, the equality compare can indeed be compiled down to real SIMD instructions, yet the overall path may still lose if turning a VectorMask into a scalar bitmask is expensive. The "best case" is a vector compare followed by a single instruction that packs the result into a bitmask; if the JIT doesn't hit that lowering, it can fall back to extra work such as materializing the mask and repacking it in scalar code. From what I can tell, they have been working on intrinsic for VectorMask.toLong (https://bugs.openjdk.org/browse/JDK-8273949).

Also, SWAR avoids that entire transition by staying in GPR and producing the bitmask directly with a small, predictable sequence of bit operations. For small fixed-size probes, that simplicity often outweighs SIMD's theoretical advantage, and on some CPUs heavier vector usage can even come with frequency effects that further narrow the gap. So, I guess the more likely explanation isn't that the Vector API never uses SIMD.

I'll take a closer look at how it compiles down to machine code and share what I find.

P.S. Benchmark results can vary a lot depending on the environment (OS, CPU and JDK/JIT version and flags), so it’s also possible the benchmark picture changes on a different setup.