narrowbyte's comments

narrowbyte · on July 13, 2024

the basic argument reads to me as "taxing wealth at 100% would be bad policy (for various reasons), therefore taxing wealth (at all) is fundamentally bad policy".

taxing income at 100% would also be bad policy - that fact alone doesn't mean that taxing income (at all) is fundamentally bad policy.

I don't find that the post really engages with any more realistic scenario.

narrowbyte · on July 10, 2024

"Doesn't even try" is too strong.

"When compiling from the same source on independent infrastructure yields bit-by-bit identical results, this gives confidence that the build infrastructure was not compromised and the artifact really does correspond to the source." - https://reproducible.nixos.org/

hansvm · on July 11, 2024

That's fair. Attempting to rationalize my choice of language:

- Their homepage defines reproducibility to be something other than bitwise identical results.

- Getting actual bitwise reproducible builds is still hard for most large projects, even with the work Nix has done (note that the quote you pulled doesn't actually say that Nix _does_ provide such builds, and the rest of that linked text just tries to highlight some of the tools you have at your disposal to achieve that).

They do "try" insofar as they're aware of the desire for bitwise identical results, provide them in some cases, and provide tools to diagnose problems. They're also 20 years old and more than happy to call the current results reproducible. At the very least, it doesn't look like one of the top properties for the project.

narrowbyte · on June 11, 2024

quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.

raphlinus · on June 11, 2024

One of the other major things that's changed is that Nvidia now has independent thread scheduling (as of Volta, see [1]). That allows things like individual threads to take locks, which is a pretty big leap. Essentially, it allows you to program each individual thread as if it's running a C++ program, but of course you do have to think about the warp and block structure if you want to optimize performance.

I disagree that SIMT is only for embarrassingly parallel problems. Both CUDA and compute shaders are now used for fairly sophisticated data structures (including trees) and algorithms (including sorting).

[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...

yosefk · on June 11, 2024

It's improtant that GPU threads support locking and control flow divergence and I don't want to minimize that, but threads within a warp diverging still badly loses throughput, so I don't think the situation I'd fundamentally different in terms of what the machine is good/bad at. We're just closer to the base architecture's local maximum of capabilities, as one would expect for a more mature architecture; various things it could be made to support it now actually supports because there was time to add this support

narrowbyte · on June 11, 2024

I intentionally said "more towards embarrassingly parallel" rather than "only embarrassingly parallel". I don't think there's a hard cutoff, but there is a qualitative difference. One example that springs to mind is https://github.com/simdjson/simdjson - afaik there's no similarly mature GPU-based JSON parsing.

raphlinus · on June 11, 2024

I'm not aware of any similarly mature GPU-based JSON parser, but I believe such a thing is possible. My stack monoid work [1] contains a bunch of ideas that may be helpful for building one. I've thought about pursuing that, but have kept focus on 2D graphics as it's clearer how that will actually be useful.

[1]: https://arxiv.org/abs/2205.11659

xoranth · on June 11, 2024

> That allows things like individual threads to take locks, which is a pretty big leap.

Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? What happens if the lanes point to the same location?

raphlinus · on June 11, 2024

There's a bit more information at [1], but I think the details are not public. The hardware is tracking a separate program counter (and call stack) for each thread. So in the CAS example, one thread wins and continues making progress, while the other threads loop.

There seems to some more detail in a Bachelors thesis by Phillip Grote[2], with lots of measurements of different synchronization primitives, but it doesn't go too deep into the hardware.

[1]: https://arxiv.org/abs/2205.11659

[2]: https://www.clemenslutz.com/pdfs/bsc_thesis_phillip_grote.pd...

xoranth · on June 11, 2024

Thanks!

majke · on June 11, 2024

Last time i looked at intel scatter/gather I got the impression it only works for a very narrow use case, and getting it to perform wasn’t easy. Did I miss something?

narrowbyte · on June 11, 2024

The post says, about SIMT / GPU programming, "This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor."

I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.

yosefk · on June 11, 2024

Barrel threaded machines like GPUs have easier time hiding the latency of bank conflict resolution when gathering/scattering against local memory/cache than a machine running a single instruction thread. So pretty sure they have a fundamental advantage when it comes to the throughput of scatter/gather operations that gets bigger with a larger number of vector lanes

majke · on June 12, 2024

vpgatherdd - I think that for newer CPUs it is faster than many loads + inserts, but if you are going to fault a lot, then it becomes slow.

> The VGATHER instructions are implemented as micro-coded flow. Latency is ~50 cycles.

https://www.intel.com/content/www/us/en/content-details/8141...

ribit · on June 12, 2024

Modern GPUs are exposing the SIMD behind the SIMT model and heavily investing into SIMD features such as shuffles, votes, and reduces. This leads to an interesting programming model. One interesting challenge is that flow control is done very differently on different hardware. AMD has a separate scalar instruction pipeline which can set the SIMD mask. Apple uses an interesting per-lane stack counter approach where value of zero means that the lane is active and non-zero value indicates how many blocks need to be exited for the thread to become active again. Not really sure how Nvidia does it.

narrowbyte · on Aug 2, 2023

what does this mean? "In LLVM IR, much like in Rust but unlike in C/C++, individual loads and stores are volatile (i.e., have compiler-invisible side-effects)."

steveklabnik · on Aug 2, 2023

In C and C++, volatile is a qualifier for a type. You then use that type like any other.

In Rust, there is no volatile types. There are two functions, read_volatile and write_volatile, on pointers.

Rust’s API is basically identical to the intrinsics, whereas C and C++‘s are not. This plays out with stuff like the drama around volatile compound operators being deprecated.