Dear Future architect of some neat new technology to be someday posted on HN please do the following:
If you make anything with a UI, even a GUI stack, always include screenshots. If you make a programming language or programming framework / library, always include code samples!
Really show us nerds the bits we want to see right away, screenshots / code or even a video (fully optional, unless its some type of terminal shell or something where a video would illuminate things!) would sell it better. Many of us are working and don't yet have time to pull down everything to run it locally.
Honestly, even if its not open source, if you're selling a product, SHOW THE PRODUCT not just descriptions of the product.
So, given that this is almost brought up every single time a GUI framework/library is posted on HN, and has been for decades at this point, and given that it apparently doesn't make any difference, what's a better approach for educating the ecosystem at large about this problem? Do we need a website? Nation outreach program? Do we need a arewescreenshottingyet.com landing page with stats to keep track? How can we finally fix this constantly persisting problem of people not adding screenshots for visual things they've built?!
Two options: continue to spend the effort to teach new cohorts community expectations; or eliminate the production of new cohorts so the existing lessons eventually saturate the population.
There are architectural changes (such as reasoning or mixture of experts) that measurably improve how well models perform. So the improvements are definitely not just from data.
I can speak for my area of expertise - multilingual capabilities. Some SOTA models are making huge strides in their support of various languages, and increasingly they understand and can produce text in languages where GPT-4 era models were absolutely lost. These are probably from a combination of richer training dataset and architectural improvements (more parameters?).
Now that doesn't necessarily mean that models are also getting substantially better at English or other major languages. They likely are to some degree, but we've reached a point with major languages where core linguistic proficiencies are covered, and what's left is the more squishy part: style, tone of voice, ability to use different registers naturally, or what some people would call linguistic taste. But that's much harder to measure and therefore trickier to provide evidence for.
Thanks for your perspective from somebody working on the field. I still wonder though: what would the results be if we'd just use a richer dataset + more parameters? Would it be really that different results? (except costs, as MoE def helps with that)
MoE: I assume some people just specialize in working with routing as with that, as by reducing the amount of params and just using a subset, you end up making it less costly. So, AI researchers are only working on optimizations on getting this better?
Same question on Reasoning, so AI researchers are working mostly on optimizations on top of it, like CoT and so on, like mini-optimizations.
So basically, they work on those micro-optimizations, put them together and see a % improvement in a benchmark?
I'm sure this is probably awesome for languages, which if I'm not mistaken, it was the use-case initially used on "All you need is attention" and the entire LLM revolution.
But this seems to be a very clear path to be "taking the car to the carwash by foot" for a long time, isn't it?
It feels like we'll keep "taking the car to the carwash by foot" until somebody optimizes for that prompt, or some pre-training done, and then there'll be another prompt that will show that the AI has real trouble with very basic real-world reasoning and imagination.
Isn't it the case, or do you see any kind of research that could take us from that plateau full of micro-optimizations that get us a few cm higher to the peak?
MoE is mostly an optimization of the active parameters and therefore lowering the compute requirements, but it can provide some performance improvements over dense models in some cases.
I would not describe reasoning as optimization: In fact, it's typically doing the opposite, as models spend way more tokens (and therefore compute) on responding to the same prompt. Some of the smartest models these days use ridiculous amounts of reasoning before they ever respond. Try Deep Research in Gemini or Claude and you'll see what I'm talking about.
>> But this seems to be a very clear path to be "taking the car to the carwash by foot" for a long time, isn't it?
I thought the progress was plateauing sometime last year too, but then some new models got released and we saw that the multilingual capabilities improvements are real. And if you want something more tangible and reported on, consider the Opus 4.5/4.6 coding revolution (Claude Code explosion) a few months back.
LLMs being stochastic and statistical machines, there will always be funny things that people will come up with that will trick them, be it R's in strawberry or the carwash by foot. At the same time, I can tell you from my experience that a lot of the Misguided attention ( https://github.com/cpldcpu/MisguidedAttention ) type of stump questions work at a much lower rate with newer models. Progress is being made, it's just not in visible areas.
BTW, you can come up with many trick questions that will stump even humans with PhDs. They will be of different kind than the ones for LLMs, but this is not a flaw unique to LLMs.
If you're asking whether the progress to AGI isn't taking too long, then I personally think LLMs, at least with their current architecture, are not the foundation of AGI, and will always have inherent limitations. But we're fully in the "that's just like, your opinion, man" territory now :)
LLMs for language feels like it's definitely the way to go. I feel like that by just improving it further can definitely reach perfection, if not very close.
My concern is mostly all adjacent fields, like systems thinking, spatial reasoning, "real" human-like reasoning etc or as you put it, "AGI".
Doesn't seen this will take us there at all. I don't feel like we're closer to AGI than we were on the earliest versions of ChatGPT.
Just saw your thinking edit! That's a great question and one I wanted to study in depth, but these days you don't really get access to the raw thinking data. It's usually summarized and you can't even be sure what language the model thought in unless you have access to the logits (so only viable for open-weights models).
I am fairly convinced that there's a certain polyglot snowball effect: once the LLM is fluent in 20 languages, it can pick up on similarities in vocabulary, syntax etc. and learn the 21st language with much less effort (and training data). This might be difficult to actually study in an isolated way, but it's a real effect for humans and it makes sense the the pattern matchers that LLMs are would find these shortcuts.
Using similar words should land you in similar places in the latent space, even if they actual word or their order is slightly different. Where it gets interesting is how well English words map to their counterparts in other languages, and what practical differences it makes. From various studies, it seems that the gravitational pull of English language/culture training data is substantial, but an LLM can switch cultures and values when prompted in different languages.
Disclosure: I work at RWS/TrainAI, we did this study. Recently I alluded to it in a comment and was encouraged to share it, so here it is!
We focus on multilingual proficiency, which tends to be understudied: most benchmarks are English-heavy or even English-only and don't tell you much about how models actually perform across languages.
This is our second iteration of the study. 120 linguists, 8 models, 8 languages, 4 tasks, every output blind-reviewed by 3 native speakers.
Some notable insights:
- GPT-5 is strong at text normalization and translation but regressed on content generation vs GPT-4o. Chinese outputs had spacing/punctuation issues, Polish read like "translationese" even with no source text.
- Gemini 2.5 Pro scored 4.56/5 on Kinyarwanda. In our first study (late 2024), no model could produce coherent text in that language.
- Top LLMs outscored humans working under realistic constraints (time-limited, single pass, no QA). Humans didn't rank 1st in any language. (We're now planning a follow-up to zoom in on that.)
- Tokenizer efficiency matters again: reasoning models burn 5-10x more tokens thinking. Claude Sonnet 4.5 encodes Tamil at 1.19 chars/token vs Gemini's 4.24 — ~3.5x cost difference for the same output. There has been a lot of talk about the Opus 4.7 tokenizer, this is the same issue, just in multilingual setting.
One more thing: we're working on a multilingual benchmark that will evaluate core linguistic proficiency in 30 languages. We already have a lot of data internally and I can tell you that:
- Gemini 3 Pro is a multilingual monster.
- GPT-5.4 is a really good translation model, big improvements over previous subversions in the 5 family.
- Opus 4.6 is good but usually third place.
- Somehow, Grok 4.20 is surprisingly good at some long-tail languages? Its performance profile is really odd. Unlike all the other models.
Yes, but post training cannot possibly account for all possible use cases. Sane defaults are fine, you can't really do much about sampling parameters in chatbots and coding harnesses anyway. And when making an API call, you have to actively change the parameter in your payload. I don't believe there's any real risk.
The risk is that people tweak it, potentially on accident, and then think the model is bad instead of understand they are using it wrong. This causes potential reputational damage by exposing the control.
There's been quite a few threads about Opus 4.7 but none of them seems to have discussed some breaking changes on the API side, particularly removal of sampling parameters.
From the migration guide:
>> Sampling parameters removed: Setting temperature, top_p, or top_k to any non-default value on Claude Opus 4.7 returns a 400 error.
Let's set aside that this should probably be a deprecation warning and not a 400. Not having these dials limits utility for cases like synthetic data generation, natural language QA and many more. Even though temp=0 does not guarantee determinism, getting 99 identical responses out of 100 is reasonably close to determinism for most practical use case. Default temp gives you wild swings in performance which temp=0 almost perfectly eliminates. And there are valid use cases for using temp=0 or experimenting with different values.
The writing was on the wall since even earlier Opus versions would override temperature setting and reset to default when thinking was enabled. Now there is no way to control it at all. It is a bit disappointing.
I understand most people around here will be using Opus in Claude Code or in another harness for coding, and in that case you are not really affected. But for those of you building products and using the API in different ways, how are you dealing with this change?
If anyone from Anthropic is reading this, any insights into why it was removed would be great. I am struggling to believe this is because of distillation concerns. Thanks!
Claude's tokenizers have actually been getting less efficient over the years (I think we're at the third iteration at the least since Sonnet 3.5). And if you prompt the LLM in a language other than English, or if your users prompt it or generate content in other languages, the costs go higher even more. And I mean hundreds of percent more for languages with complex scripts like Tamil or Japanese. If you're interested in the research we did comparing tokenizers of several SOTA models in multiple languages, just hit me up.
Thanks for sharing! Have been begrudgingly using Darktable since that seems to be your best option on Linux, but the UI/UX never really clicked with me. I wish this was opensource but I will give this a shot (pun intended) for sure.
reply