More

charleslmunger · 2026-01-22T00:02:45 1769040165

What hardware are you running on where the cost of a relaxed 64 bit load and a branch is significant compared to a (possibly contended) cas?

You could always use ldset on arm for this.

charleslmunger · 2026-01-02T05:10:57 1767330657

Yup this works but there's as of yet no HBR13.5 or better input so you're not getting full hdmi 2.1 equivalent. But if you don't care about 24 bits per pixel DSC then you can have an otherwise flawless 4k120hz experience.

https://trychen.com/feature/video-bandwidth

charleslmunger · 2025-12-10T00:12:40 1765325560

It's so weird to see the leading heroin story phrased like a hypothetical, when:

1. Heroin itself was marketed as a "non-addictive morphine substitute", and sold to the public. It didn't become a controlled substance until 1914 (according to Wikipedia) 2. The opioid crisis was basically started and perpetuated by Purdue pharma, again marketing Oxycodone with the label “Delayed absorption as provided by OxyContin tablets, is believed to reduce the abuse liability of a drug.” and other more egregious advertising. 3. Britain went to war with China twice to force the Qing dynasty to allow them to sell opium there. 4. President Teddy Roosevelt's grandfather made a ton of money in the opium trade.

It's supposed to be sort of shocking hypothetical, except actually that's basically the history of the actual drug.

charleslmunger · 2025-12-08T02:07:34 1765159654

>Critical section under 100ns, low contention (2-4 threads): Spinlock. You’ll waste less time spinning than you would on a context switch.

If your sections are that short then you can use a hybrid mutex and never actually park. Unless you're wrong about how long things take, in which case you'll save yourself.

>alignas(64) in C++

    std::hardware_destructive_interference_size

Exists so you don't have to guess, although in practice it'll basically always be 64.

The code samples also don't obey the basic best practices for spinlocks for x86_64 or arm64. Spinlocks should perform a relaxed read in the loop, and only attempt a compare and set with acquire order if the first check shows the lock is unowned. This avoids hammering the CPU with cache coherency traffic.

Similarly the x86 PAUSE instruction isn't mentioned, even though it exist specifically to signal spin sections to the CPU.

Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.

raggi · 2025-12-08T02:46:00 1765161960

> Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.

Very much this. Spins benchmark well but scale poorly.

magicalhippo · 2025-12-08T02:33:51 1765161231

> Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex

Yeah, pure spinlocks in user-space programs is a big no-no in my book. If you're on the happy path then it costs you nothing extra in terms of performance, and if you for some reason slide off the happy path you have a sensible fall-back.

charleshn · 2025-12-08T04:04:16 1765166656

> std::hardware_destructive_interference_size Exists so you don't have to guess, although in practice it'll basically always be 64.

Unfortunately it's not quite true, do to e.g. spacial prefetching [0]. See e.g. Folly's definition [1].

[0] https://community.intel.com/t5/Intel-Moderncode-for-Parallel...

[1] https://github.com/facebook/folly/blob/d2e6fe65dfd6b30a9d504...

menaerus · 2025-12-08T11:10:54 1765192254

Some things from the article are debatable for sure, and some are maybe missing like the one you mention with PAUSE instruction, which I also have not been aware of, but generally speaking I thought it was a really good content. Lean system engineering skills applied to real world problems. I especially appreciated the examples of large-scale infra codebases doing it in practice.

surajrmal · 2025-12-08T04:28:30 1765168110

Hybrid locks are also bad for overall system performance by maximizing local application performance. There is a reason default lock implementations from OS don't spin even a little bit.

menaerus · 2025-12-08T08:20:16 1765182016

> There is a reason default lock implementations from OS don't spin even a little bit.

glibc pthread mutex uses a user-space spinlock to mitigate the syscall cost for uncontended cases.

charleslmunger · 2025-12-08T09:36:59 1765186619

That depends on your workload. If you're making a game that's expected to use near 100% of system resources, or a real time service pinned to specific cores, your local application is the overall system.

surajrmal · 2025-12-10T15:08:21 1765379301

Totally agree. However it's important to differentiate those workloads from the average workload which is to participate in a larger system.

imtringued · 2025-12-08T17:58:34 1765216714

This is nonsense. If the lock hasn't been acquired, you don't spin to begin with and if the lock has been acquired and the lock is being released shortly after, the spinning avoids a context switch. If the maximum number of retries has been reached, the thread was going to sleep anyway and starts scheduling the next thread (which was only delayed by the few attempted spins). This means in the worst case the next spin will only happen once all the other queued up threads have had their turn and that's assuming you're immediately running into another acquired lock.

surajrmal · 2025-12-10T15:21:40 1765380100

It's makes the worse case sufficiently bad and unfair such that it makes things worse overall. If the lock is contended by a thread with higher priority, then that blocking thread will have its priority increased. Now if the ends thread to get the lock is one spinning on it rather than actual high priority one, then this will repeat, leading to large latency on front of the high priority thread and a lot of misaligned CPU utilization by a lower priority thread.

Spinning on a CAS is far more expensive than spinning on most other instructions as well as it affects all core that may try to access that cache line, which may include things other than the lock itself.

Also consider how the system acts under high CPU load. You will end up with threads holding locks when not running leading to the majority of the time you miss the lock you spin all 100 times. This just exacerbate the CPU load issues even more. Hybrid locks are only helpful under lower CPU load.

nly · 2025-12-08T09:44:49 1765187089

GNU libc posix mutexes do spin...

surajrmal · 2025-12-10T05:46:19 1765345579

And I think it'd a poor choice that causes worse system performance. Android's bionic doesn't spin, nor does Windows or Fuchsia. Avoiding the syscall overhead is generally detrimental to overall system performance especially when the CPU load is high.

saagarjha · 2025-12-08T04:20:30 1765167630

> std::hardware_destructive_interference_size

Of course, this is just the number the compiler thinks is good. It’s not necessarily the number that is actually good for your target machine.

nly · 2025-12-08T09:41:15 1765186875

The PAUSE instruction isn't actually as good as it used to be. In, iirc, Skylake Intel massively increased the latency to improve utilisation under hyperthreading. The latency of this instruction is now really high.

Most people using spinlocks really care about latency, and many will have hyperthreading disabled to reduce jitter

SkiFire13 · 2025-12-08T10:51:56 1765191116

If the PAUSE instruction is too fast doesn't that kinda defeat its purpose?

menaerus · 2025-12-08T14:31:07 1765204267

Yeah, I think so too now that I read some documentation about it. It appears that the main issue with the spinlock pattern is that it inhibits "a severe performance penalty when exiting the [spinlock] loop because it [CPU] detects a possible memory order violation." [0].

~10 years ago, on Haswell, it took ~9 cycles to retire, and from Skylake onward, with some exceptions, it takes a magnitude more - ~140 cycles.

These numbers alone suggests that it really messes up hard with the CPU pipeline, perhaps BP (?) or speculative execution (?) or both (?) such that it will basically force the CPU to flush the whole pipeline. This is at least how I read this. I will remember this instruction as "damage control" instruction from now on.

[0] https://www.felixcloutier.com/x86/pause

nly · 2025-12-10T10:12:08 1765361528

Not sure if you'll see this now, but the actual reason you want to use it is as a speculation barrier and a hint to various predictors.

Lfence is the better choice these days.

charleslmunger · 2025-12-02T04:36:51 1764650211

>The compiled compressed binary for an APK

This doesn't undermine your argument at all, but we should not be compressing native libs in APKs.

https://developer.android.com/guide/topics/manifest/applicat...

charleslmunger · 2025-11-29T18:53:36 1764442416

>Not at all? Most memory-safety issues will never even show up in the radar

Citation needed? There's all sorts of problems that don't "show up" but are bad. Obvious historical examples would be heartbleed and cloudbleed, or this ancient GTA bug [1].

1: https://cookieplmonster.github.io/2025/04/23/gta-san-andreas...

charleslmunger · 2025-11-19T01:38:10 1763516290

Unfortunately the standard library mutex is designed in such a way that condition variables can't use requeue, and so require unnecessary wakeups. I believe parking lot doesn't have this problem.

charleslmunger · 2025-11-02T17:53:43 1762106023

You can influence the choice of conditional moves (usually inserting them) with

__builtin_expect_with_probability(..., 0.5)

https://github.com/protocolbuffers/protobuf/commit/9f29f02a3...

charleslmunger · 2025-10-13T02:00:13 1760320813

Jetbtains IDEs let you configure this - my favorite use is to highlight kotlin extension functions differently than normal functions.

This kind of highlighting as a secondary information channel for compiler feedback is great. Color, weight, italics, underlines - all help increase information density when reading code.

charleslmunger · 2025-09-22T16:48:14 1758559694

If you're working on something where the cost of bugs is high and they're tricky to detect, LLM generated code may not be a winning strategy if you're already a skilled programmer. However, LLMs are great for code review in these circumstances - there is a class of bugs that are hard to spot if you're the author.

As a simple example, accidentally inverting feature flag logic will not cause tests to fail if the new behavior you're guarding does not actually break existing tests. I and very senior developers I know have occasionally made this mistake and the "thinking" models are very good at catching issues like this, especially when prompted with a list of error categories to look for. Writing an LLM prompt for an issue class is much easier than a compiler plugin or static analysis pass, and in many cases works better because it can infer intent from comments and symbol names. False positives on issues can be annoying but aren't risky, and also can be a useful signal that the code is not written in a clear way.