Hacker Newsnew | past | comments | ask | show | jobs | submit | conorbergin's commentslogin

LLMs are deterministic, the same model under the same conditions will produce the same output, unless some randomness is purposefully injected. Neural networks in general can be thought of as universal function approximators.

Whenever somebody calls LLMs "non-deterministic", assume they meant "chaotic", in the informal sense of being a system where small changes of input can cause large changes to output, and the only way to find out if it will happen is by running the full calculation.

For many applications, this is equally troublesome as true non-determinism.


I don't think LLMs are that chaotic, you can replace words in an input at get a similar answer, and they are very good at dealing with typos.

They are definitely not interpretable, I was reading some stuff from mechanistic interpretability researchers saying they've given up trying to build a bottom up model of how they work.


> I don't think LLMs are that chaotic, you can replace words in an input at get a similar answer, and they are very good at dealing with typos.

Compare "You are a helpful assistant. Your task is to <100 lines of task description> <example problem>"

with

"you are a helpless assistant. Your task is to <100 lines of task description> <example problem>"

I've changed 3 or 4 CHARACTERS ("ful" to "less") out of a (by construction) 1000+ character prompt.

and the outputs are not at all similar.

Just realized I've never tried the "you are a helpless ass" prompt. Again a very minor change in wording, just dropping a few letters. The helpless assistant at least output text apologizing for being so bad at the task.


Sure. What did you expect? You changed the semantic of your prompt to the complete opposite. Of course it will attempt to make sense of it to its ability, and deliver what you requested. The input isn't formally specified, that's inherent for the domain, not the model or a human. GP, on the other hand, is talking about semantically negligible differences like typos.

That's not really true. If you turn a few knobs you can make them deterministic. Namely setting temperature to zero, and turning off all history. But none of the cloud providers do this. Because it's not a product as far as they are concerned. So in practice - not so much.

Can someone explain why this is? Do LLMs somehow contain a true random number generator? Why wouldn't they produce the same outputs given the same inputs (even temperature)?

edit: I'm not talking about an LLM as accessed through a provider. I'm just talking about using a model directly. Why wouldn't that be deterministic?


The model outputs a probability distribution for the next token, given the sequence of all previous tokens in the context window. It’s just a list of floats in the same order as the list of tokens that the tokenizer uses.

After that, a piece of software that is NOT the LLM chooses the next token. This is called the sampler. There are different sampling parameters and strategies available, but if you want repeatable* outputs, just take the token with the highest probability number.

* Perfect determinism in this sense is difficult to achieve because GPU calculations naturally have a minor bit of nondeterminism. But you can get very close.


I'm not so sold the LLM is an LLM without a sampler but it's not worth quibbling over. It's part of the statistical model anyways.

the llm is the trained part, the rest is the handwritten part. The sampler is handwritten, not learned.

Believe it or not in statistics and machine learning the hard coded parts of a model that impact the results are considered part of the model. But I understand that now days we don't care about these things because ai goes brrr.

There are A LOT of misconceptions about llms, biggest one is they are not deterministic. And they are 100% deterministic and temperature has nothing to do with it. You WILL get exactly same result every single time (at ANY temperature) as long as you use same sampling parameters and server config parameters. What causes variance in LLM's is server parameters like batch processing and caching among a few other things possibly. the batching being responsible for most of the issues. The reason that flag is used is because large providers serve multiple customers per one gpu, and breaking up the vram is tricky and causes drift. If you start llama.cpp for example with only one person per slot batching off, you will always get same results every time even at temperature 1.2 or whatever other parameters because you are using one gpu per inferance call so no fucky buseness there. Reason most people are unaware of this is because most people have experience only with api instead of working with the actual inferance enjine itself so this godd damned myth keeps spreading. my vide for referance here where you can download and try for yourself. https://www.youtube.com/watch?v=EyE5BrUut2o

Thanks so much for this! I still haven't got around to building my own language model yet, so I'm a bit fuzzy on the details, but if I imagined a thought experiment where I did all the math by hand on paper, I just couldn't see how I would end up with a different output each time given the same inputs. Finding out that the variance other people are seeing comes from the server/hardware stuff clears that up.

This is a surprisingly annoying question to Google. A lot of articles give the reason that softmax returns a probability distribution, as if the presence of the word "probability" means the tokens will be different every time.


An LLM model itself -- that is, the weights and the mathematical functions linking them -- does not tell you exactly how to train from data, nor how to generate an output. Instead, it describes a function providing relative likelihood(output | input).

Deciding how to pick a particular output given that likelihood function is left as an exercise for the user, which we call inference.

One obvious choice is to keep picking the highest likelihood token, feed it into the model, and get another -- on repeat. This is what most algorithms call "temperature=0". But doing this for token after token can lead boring output, or steer you into pathological low-probability sequences like a set of endless repeats.

So, the current SOTA is to intentionally introduce a random factor (temperature>0) to the sampling process -- along with other hacks, like explicit suppression of repeats.


Yea sure. So temperature is baked into these LLM models and when it isn't zero it increases the probability of taking a different path to decode the tokens. Whether it's at a provider or downloaded on your own machine.

Technically even when the temperature is 0 it's not deterministic but it's more likely to be... You can have ties in probabilities for generating the next words. And floating point noise is real.

All these models are doing is guesstimating the next token to say.


> Namely setting temperature to zero, and turning off all history

That's not nearly enough, though. The multi-node/GPU inference and specifically batching (and ordering in batching) have non-deterministic consequences for the current LLM services.


True but for small models it's pretty close. See my comment below about other cases leading to nondeterminism.

Eh, conceptually true, but in practice, it is rather hard to get any decent performance out of a GPU and still produce a deterministic answer.

And in any case, setting the temperature to zero will not produce a useful result, unless you don't mind your LLM constantly running into infinite loops.


Yes theres a good thinking machines lab blog about this

You're being downvoted, but you're right. Determinism is a different concept and doesn't characterise LLMs well. You can have deterministic random number generators for example.

I doubt a fork would ever happen, Blender, being computer graphics software, has a huge knowledge gap between it's developers and it's users.

That "syntactic sugar" encompasses the entire value proposition of markdown, there's nothing stopping you using Typst to author blog posts or take notes, they even have HTML export.

I wonder if well designed "mutable" operating systems like Arch and Alpine that are going to beat NixOS etc. in the long run. An install script is strictly more powerful that a declarative config language, and typically less verbose.


Might as well use Guix then. You still have the declarative config language, but also a turing-complete (and convenient) programming language.

What do you mean by strictly more powerful?

Scripts are typically turing complete, config files are typically not.

This is a much more promising technique from Applied Science: https://www.youtube.com/watch?v=UIqhpxul_og


I remember that video. Hey, I wonder what happened to that $3000 Micronics SLS printer? Wasn't it a kickstarter? I remember that being a big deal at the time, and I guess it suddenly disappeared?

> We got bought by Formlabs in 2024

> Formlabs sells their own SLS printer for $25000

> Formlabs charges a license fee to be able to print with custom materials like Applied Science did [0]

ah, well that explains it.

0 - https://support.formlabs.com/s/article/Setting-up-Open-Mater...


Not a fan of “texture healing”, a very convoluted and unsatisfying way of fixing a minor problem with monospace fonts, I’d be more interested in seeing letterforms redesigned to be more optically balanced within the grid, another commenter points out ubuntu mono does this somewhat, but I imagine you could make some fairly radical alterations to certain letters and still be legible.


I fell in love with Intel One Mono for this reason


The Osprey's accident rate is not that bad, and the US Army have ordered a new smaller tiltrotor, the v280.


They officially named it recently to the 'MV-75'.


The Wikipedia page says this will replace UH-60s, but I just do not see how that airframe is a direct comparable to what’s been a workhorse for decades. Maybe it means only in a long range reconnaissance role? But even then, that mission is primarily owned by UAS platforms now. Confusing.


I imagine UH-60 and variants will continue to serve (who knows, maybe with new airframes) along side the MV-75 for quite a while, in a similar way to how UH-1s continued to be in use well after UH-60s were deployed in large numbers. This Congressional Research Service summary of the FLRAA/MV-75 program states that the Army has plans to continue ordering UH-60s (on the order of 255 between 2027 and 2031) - https://www.congress.gov/crs-product/IF12771

The key requirements that drive MV-75's downsides (size, complexity, cost) is the Army wants to play game in the Pacific. The UH-60 is deeply limited there.

For example, the MV-75's range should let it go (one-way) from Guam to the Philippines, straight from Okinawa to Taiwan (no need to island hop) - potentially as a two way mission. Same as Philippines to Taiwan.

The "comparability" is that the MV-75 and UH-60 can be delivery ~14 troops into an order magnitude similar size clearing.


Thank you! This context really clarifies what the use case is for this. The range difference matters.


What is so unbelievable about that?

Sure, its going to take decades to actually make the transition and the UH-60 will remain in service for decades more after that in less demanding roles. I expect by the time this finishes, the MV-75 will be considered another workhorse, if maybe slightly fuzzier and the UH will be an antiquated platform.

But ultimately they both solve the same problem, moving stuff from A to B in rough terrain fast. But with the ever increasing amount of reconnaissance assets, A needs to be further behind the frontline and so range and speed needs to increase beyond what you can manage with a pure helicopter.


Thank God for Zig


For bringing us back to Modula-2?


Elaborate.


I don’t think PL theory driven design produces good systems languages.


Rust as it exists today is very much "PL theory" driven. It's not necessarily a good language, but it's been consistently ranked as the #1 "most loved" by Stack Overflow for the past few years.


Is webgpu a good standard at this point? I am learning vulkan atm and 1.3 is significantly different to the previous APIs, and apparently webgpu is closer in behavior to 1.0. I am by no means an authority on the topic, I just see a lack of interest in targeting webgpu from people in game engines and scientific computing.


For a text editor it's definitely good enough if not extreme overkill.

Other then that the one big downside of WebGPU is the rigid binding model via baked BindGroup objects. This is both inflexible and slow when any sort of 'dynamism' is needed because you end up creating and destroying BindGroup objects in the hot path.

Vulkan's binding model will really only be fixed properly with the very new VK_EXT_descriptor_heap extension (https://docs.vulkan.org/features/latest/features/proposals/V...).


The modern Vulkan binding model is relatively fine. Your entire program has a single descriptor set containing an array of images that you reference by index. Buffers are never bound and instead referenced by device address.


Do you think Vulkan will become "nice" to use, could it ever be as ergonomic as Metal is supposed to be?


Apparently "joy to use" is one of the new core goals of Khronos for Vulkan. Whether they succeed remains to be seen, but at least they acknowledge now that a developer hostile API is a serious problem for adoption.

The big advantage of Metal is that you can pick your abstraction level. At the highest level it's convenient like D3D11, at the lowest level it's explicit like D3D12 or Vulkan.


Bevy engine uses wgpu and supports both native and WebGPU browser targets through it.

The WebGPU API gets you to rendering your first triangle quicker and without thinking about vendor-specific APIs and histories of their extensions. It's designed to be fully checkable in browsers, so if you mess up you generally get errors caught before they crash your GPU drivers :)

The downside is that it's the lowest common denominator, so it always lags behind what you can do directly in DX or VK. It was late to get subgroups, and now it's late to get bindless resources. When you target desktops, wgpu can cheat and expose more features that haven't landed in browsers yet, but of course that takes you back to the vendor API fragmentation.


It's a good standard if you want a sort of lowest-common-denominator that is still about a decade newer than GLES 3 / WebGL 2.

The scientific folks don't have all that much reason to upgrade from OpenGL (it still works, after all), and the games folks are often targeting even newer DX/Vulkan/Metal features that aren't supported by WebGPU yet (for example, hardware-accelerated raytracing)


Khronos is trying to entice scientific folks with ANARI, because there was zero interest to move from OpenGL as you mention.

https://www.khronos.org/anari/


Having no CSD at all is unacceptable on small screens IMHO, far too much real estate is taken up by a title bar, you can be competitive with SSD by making them really thin, but then they are harder to click on and impossible with touch input. At the moment I have firefox setup with CSD and vertical tabs, only 7% of my vertical real estate is taken up by bars (inc. Gnome), which is pretty good for something that supports this many niceties.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: