Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Benchmarks from page 4 of the model card:

    | Benchmark             | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |-----------------------|-----------|---------|------------|-----------|
    | Humanity's Last Exam  | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2             | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond          | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025             |           |         |            |           |
    |   (no tools)          | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |   (code execution)    | 100%      | -       | 100%       | -         |
    | MathArena Apex        | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro              | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro        | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning     | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5      | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU            | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro     | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0    | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified    | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench              | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2       | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified     | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                  | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA           | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle)    |           |         |            |           |
    |   (128k avg)          | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |   (1M pointwise)      | 26.3%     | 16.4%   | n/s        | n/s       |
n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly



Wow. They must have had some major breakthrough. Those scores are truly insane. O_O

Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there

But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.

Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.

And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD


The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.


From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting." [emphasis mine]

While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.


The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??

This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.

If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.


How do they hold back questions in practice though? These are hosted models. To ask the question is to reveal it to the model team.


They pinky swear not to store and use the prompts and data lol


A legally binding pinky swear LOL


with fineprint somewhere on page #67, that there are exceptions.


Who needs fine print when there is an SRE with access to the servers who is friends with a research director who gets paid more if the score goes up?


You have to trust that the LLM provider isn't copying the questions when Humanities Last Exam runs the test.


There are only eleventy trillion dollars shifting around based on the results, so nobody has any reason to lie.


Seems difficult to believe, considering the number of people who prepare this dataset, who also work(ed) or hold shares in Google or OpenAI, etc.


So everybody is cheating in your mind? We can't trust anything? How about taking a more balanced take: there's certainly some progress, and while the benchmark results most likely don't represent the world reality, the progress is continuous.


This. A lot of boosters point to benchmarks as justification of their claims, but any gamer who spent time in the benchmark trenches will know full well that vendors game known tests for better scores, and that said scores aren’t necessarily indicative of superior performance. There’s not a doubt in my mind that AI companies are doing the same.


I don't think any of these companies are that reductive and short-sighted to try to game the system. However, Goodhart's Law comes into play. I am sure they have their own metrics that arr much more detailed than these benchmarks, but the fact remains LLMs will be tuned according to elements that are deterministically measurable.


shouldn't we expect that all of the companies are doing this optimization, though? so, back to level playing field.


Its the other way around too, HLE questions were selected adversarially to reduce the scores. I'd guess even if the questions were never released, and new training data was introduced, the scores would improve.


not possible on ARC-AGI, AFAIK


SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% is actually insane.


Anthropomorphic found their corner and are standing strong there.


These numbers are impressive, at least to say. It looks like Google has produced a beast that will raise the bar even higher. What's even more impressive is how Google came into this game late and went from producing a few flops to being the leader at this (actually, they already achieved the title with 2.5 Pro).

What makes me even more curious is the following

> Model dependencies: This model is not a modification or a fine-tune of a prior model

So did they start from scratch with this one?


Google was never really late. Where people perceived Google to have dropped the ball was in its productization of AI. The Google's Bard branding stumble was so (hilariously) bad that it threw a lot of people off the scent.

My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.


Google’s productization is still rather poor. If I want to use OpenAI’s models, I go to their website, look up the price and pay it. For Google’s, I need to figure out whether I want AI Studio or Google Cloud Code Assist or AI Ultra, etc, and if this is for commercial use where I need to prevent Google from training on my data, figuring out which options work is extra complicated.

As of a couple weeks ago (the last time I checked) if you are signed in to multiple Google accounts and you cannot accept the non-commercial terms for one of them for AI Studio, the site is horribly broken (the text showing which account they’re asking you to agree to the terms for is blurred, and you can’t switch accounts without agreeing first).

In Google’s very slight defense, Anthropic hasn’t even tried to make a proper sign in system.


Not to mention no macOS app. This is probably unimportant to many in the hn audience, but more broadly it matters for your average knowledge worker.


And a REALLY good macOS app.

Like, kind of unreasonably good. You’d expect some perfunctory Electronic app that just barely wraps the website. But no, you get something that feels incredibly polished…more so than a lot of recent apps from Apple…and has powerful integrations into other apps, including text editors and terminals.


Which app are you referring to?


The ChatGPT app for Mac is native and very good.


Anthropic sign-on is surprisingly bad.


Bard was horrible compared to the competition of the time.

Gemini 1.0 was strictly worse than GPT-3.5 and was unusable due to "safety" features.

Google followed that up with 1.5 which was still worse than GPT-3.5 and unbelievably far behind GPT-4. At this same time Google had their "black nazi" scandals.

With Gemini 2.0 finally had a model that was at least useful for OCR and with their fash series a model that, while not up to par in capabilities, was sufficiently inexpensive that it found uses.

Only with Gemini-2.5 did Google catch up with SoTA. It was within "spitting distance" of the leading models.

Google did indeed drop the ball, very, very badly.

I suspect that Sergey coming back helped immensely, somehow. I suspect that he was able to tame some of the more dysfunctional elements of Google, at least for a time.


I feel like 1.5 was still pretty good -- my school blocked chatgpt at the time but didn't bother with anything else, so I was using it more than anything else for general research help and it was fine. The blocking fact is probably the biggest reason I use Gemini 90% of the time now, because school can never block google search and ai mode is in that now. That, and the android integration.

To be fair, for my use case (apart from GitHub copilot stuff with Claude 4.5 sonnet) I've never noticed too big of a difference between the actual models, and am more inclined to judge them by their ancillary services and speed, which google excells in.


Gemini 1.5 Pro was definitely useful at OCR. I used it for that on the free tier.


> their fash series

Unfortunate typo.


Oh, I remember the times when I compared Gemini with ChatGPT and Claude. Gemini was so far behind, it was barely usable. And now they are pushing the boundries.


You could argue that chat-tuning of models falls more along the lines of product competence. I don't think there was a doubt about the upper ceiling of what people thought Google could produce.. more "when will they turn on the tap" and "can Pichai be the wartime general to lead them?"


The memory of Microsoft's Tay fiasco was strong around the time the brain team started playing with chatbots.


Google was catastrophically traumatized throughout the org when they had that photos AI mislabel black people as gorillas. They turned the safety and caution knobs up to 12 after that for years, really until OpenAI came along and ate their lunch.


It still haunts them. Even in the brand-new Gemini-based rework of Photos search and image recognition, "gorilla" is a completely blacklisted word.


It should be blocklisted instead. How insensitive of them.


oh they were so late there were internal leaked ('leaked'?) memos about a couple grad students with $100 budget outdoing their lab a couple years ago. they picked themselves up real nice, but it took a serious reorg.


At least at the moment, coming in late seems to matter little.

Anyone with money can trivially catch up to a state of the art model from six months ago.

And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.


> Anyone with money can trivially catch up to a state of the art model from six months ago.

How come apple is struggling then?


Apple is struggling with _productizing_ LLMs for the mass market, which is a separate task from training a frontier LLM.

To be fair to Apple, so far the only mass market LLM use case so far is just a simple chatbot, and they don't seem to be interested in that. It remains to be seen if what Apple wants to do ("private" LLMs with access to your personal context acting as intimate personal assistants) is even possible to do reliably. It sounds useful, and I do believe it will eventually be possible, but no one is there yet.

They did botch the launch by announcing the Apple Intelligence features before they are ready though.


Anyone with enough money and without an entrenched management hierarchy preventing the right people from being hired and enabled to run the project.


It looks more like a strategic decision tbh.

The may want to use 3rd party or just wait for AI to be more stable to see how people actually use it instead of adding slop in the core of their product.


> It looks more like a strategic decision tbh.

Announcing a load of AI features on stage and then failing to deliver them doesn't feel very strategic.


In contrast to Microsoft, who puts Copilot buttons everywhere and succeeds only in annoying their customers.


This is revisionist history. Apple wanted to fully jump in. They even rebranded AI as Apple Intelligence and announced a hoard of features which turned out to be vaporware.


But apple intelligence is a thing, and they are struggling to deliver on the promises of apple intelligence.


Sit and wait per usual.

Enter late, enter great.


One possibility here is that Google is dribbling out cutting edge releases to slowly bleed out the pure play competition.


Being known as a company that is always six months late than the competitors isn't something to brag about...


I was referring to a new entrant, not perpetual lag


Apple has entered the chat.


There are no leaders. Every other month a new LLM model comes out and it outperforms the previous ones by a small margin, the benchmarks always look good (probably because the models are trained on the answers) but then in practice they are basically indistinguishable from the previous ones (take GPT4 vs 5). We've been in this loop since around the release of ChatGPT 4 where all the main players started this cycle.

The biggest strides in the last 6-8 months have been in generative AIs, specifically for animation.


I hope they keep the pricing similar to 2.5 Pro, currently I pay per token and that and GPT-5 are close to the sweet spot for me but Sonnet 4.5 feels too expensive for larger changes. I've also been moving around 100M tokens per week with Cerebras Code (they moved to GLM 4.6), but the flagship models still feel better when I need help with more advanced debugging or some exemplary refactoring to then feed as an example for a dumber/faster model.


> So did they start from scratch with this one

Their major version number bumps are a new pre-trained model. Minor bumps are changes/improvements to post-training on the same foundation.


And also, critically, being the only profitable company doing this.


It's not like they're making their money from this though. All AI work is heavily subsidised, for Alphabet it just happens that the funding comes from within the megacorp. If MS had fully absorbed OpenAI back when their board nearly sunk the boat, they'd be in the exact same situation today.


They're not making money, but they're in a much better situation than Microsoft/OpenAI because of TPUs. TPUs are much cheaper than Nvidia cards both to purchase and to operate, so Google's AI efforts aren't running at as much of a loss as everyone else. That's why they can do things like offer Gemini 3 Pro for free.


A lot of major providers offer their cutting edge model for free in some form these days, that's merely a market penetration strategy. At the end of the day (if you look at the cloud prices), TPUs are only about 30% cheaper. But NVidia produces orders of magnitude more cards. So Google will certainly need more time to train and globally deploy inference for their frontier models. For example, I doubt they could do with TPUs what xAI did with Nvidia cards.


What does it mean nowadays to start from scratch? At least in the open scene, most of the post-training data is generated by other LLMs.


They had to start with a base model, that part I am certain of


That looks impressive, but some of the are a bit out of date.

On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.


What's more impressive is that I find gemini2.5 still relevant in day-to-day usage, despite being so low on those benchmarks compared to claude 4.5 and gpt 5.1. There's something that gemini has that makes it a great model in real cases, I'd call it generalisation on its context or something. If you give it the proper context (or it digs through the files in its own agent) it comes up with great solutions. Even if their own coding thing is hit and miss sometimes.

I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.


I've noticed that too. I suspect it has broader general knowledge than the others, because Google presumably has the broadest training set.


That's a different model not in the chart. They're not going to include hundreds of fine tunes in a chart like this.


It's also worth pointing out that comparing a fine-tune to a base model is not apples-to-apples. For example, I have to imagine that the codex finetune of 5.1 is measurably worse at non-coding tasks than the 5.1 base model.

This chart (comparing base models to base models) probably gives a better idea of the total strength of each model.


It's not just one of many fine tunes; it's the default model used by OpenAI's official tools.


I would love to know what the increased token count is across these models for the benchmarks. I find the models continue to get better but as they do their token usage also does. Aka is model doing better or reasoning for longer?


I think that is always something that is being worked on in parallel. Recent paradigm seems to be the models understanding when they need to use more tokens dynamically (which seems to be very much in line with how computation should generally work).


Should I assume the GPT-5.1 it is compared against is the pro version?


Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks?

Because it seems to lead by a decent margin on the former and trails behind on the latter


I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.


Neither :(

LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.


But ... what's missing from this comparison: Kimi-K2.

When ChatGPT-3 exploded, OpenAI had at least double the benchmark scores of any other model, open or closed. Gemini 3 Pro (not the model they actually serve) outperforms the best open model ... wait it does not uniformly beat the best open model anymore. Not even close.

Kimi-k2 beats Gemini 3 pro on several benchmarks. On average it scores just under 10% better then the best open model, currently Kimi-K2.

Gemini-3 pro is in fact only the best in about half the benchmarks tested there. In fact ... this could be another llama4 moment. The reason Gemini-3 pro is the best model is a very high score on a single benchmark ("Humanity's last exam"), if you take that benchmark out GPT-5.1 remains the best model available. The other big improvement is "SciCode", and if you take that out too the best open model, Kimi K2, beats Gemini 3 pro.

https://artificialanalysis.ai/models

And then, there's the pricing:

Kimi K2 on OpenRouter: $0.50 / M input tokens, $2.40 / M output tokens

Gemini 3 Pro: For contexts ≤ 200,000 tokens: US$ 2.00 per 1 M input tokens, Output tokens: US$ 12.00 per 1 M tokens For contexts > 200,000 tokens (long context tier): US$ 4.00 per 1 M input tokens , US$ 18.00 per 1 M output tokens

So Gemini 3 pro is 4 times, 400%, the price of the best open model (and just under 8 times, 800%, with long context), and 70% more expensive than GPT-5.1

The closed models in general, and Google specifically, serve Gemini 3 pro at double to triple the speed (as in tokens-per-second) of openrouter. Although even here it is not the best, that's openrouter with gpt-oss-120b.


This is a big jump in most benchmarks.And if it can match other models in coding while having that Google TPM inference speed and the actually native 1m context window, it's going to be a big hit.

I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.


> it's over for the other labs.

What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.


I mean over in that I don't see a need to use the other models. Codex models are the best but incredibly slow. Claude models are not as good(IMO) but much faster. If gemini can beat them while having being faster and having better apps with better integrations, i don't see a reason why I would use another provider.


You should probably keep supporting competitors since if there's a monopoly/duopoly expect prices to skyrocket.


> it's over for the other labs.

Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.


Can you explain what you mean by this? iPhone was the end of Blackberry. It seems reasonable that a smarter, cheaper, faster model would obsolete anything else. ChatGPT has some brand inertia, but not that much given it's barely 2 years old.


Yeah iPhone was the end of Blackberry but Google Pixel was not the end of iPhone.

The new Gemini is not THAT far of a jump to switch your org to a new model if you already invested in e.g. OpenAI.

The difference must be night and day to call it "its over".

Right they all are marginally different. Today google fine tuned their model to be better, tomorrow it will be new Kimi, after that DeepSeek.


Ask yourself why Microsoft Teams won. These are business tools first and foremost.


That's an odd take. Teams doesn't have the leading market share in videoconferencing, Zoom does. I can't judge what it's like because I've never yet had to use Teams - not a single company that we deal with uses it, it's all Zoom and Chime - but I do hear friends who have to use it complain about it all the time. (Zoom is better than it used to be, but for all that is holy please get rid of the floating menu when we're sharing screens)


Looks like the best way to keep improving the models is to come up with really useful benchmarks and make them popular. ARC-AGI-2 is a big jump, I'd be curious to find out how that transfers over to everyday tasks in various fields.


Used an AI to populate some of 5.1 thinking's results.

Benchmark | Gemini 3 Pro | Gemini 2.5 Pro | Claude Sonnet 4.5 | GPT-5.1 | GPT-5.1 Thinking

---------------------------|--------------|----------------|-------------------|---------|------------------

Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%

ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%

GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%

AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%

MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%

MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%

ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%

CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A

OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A

Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A

LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A

Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A

SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A

t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A

Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A

FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A

SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A

MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A

Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A

MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A

Argh it doesn't come out write in HN


Used an AI to populate some of 5.1 thinking's results.

Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes

Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%

ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning

GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)

AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly

MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus

MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)

ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%

CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A


This is provably false. All it takes is a simple Google search and looking at the ARC AGI 2 leaderboard: https://arcprize.org/leaderboard

The 17.6% is for 5.1 Thinking High.


What? The 4.5 and 5.1 columns aren't thinking in Google's report?

That's a scandal, IMO.

Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.


It that true?

> For Claude Sonnet 4.5, and GPT-5.1 we default to reporting high reasoning results, but when reported results are not available we use best available reasoning results.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...


Every single time


We knew it would be a big jump and while it certainly is in many areas - its definitely not "groundbreaking/huge leap" worthy like some were thinking from looking at these numbers.

I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.

Personally I'm very interested in how they end up pricing it.


Looks like it will be on par with the contenders when it comes to coding. I guess improvements will be incremental from here on out.


> I guess improvements will be incremental from here on out.

What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.

Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.


A new benchmark comes out, it's designed so nothing does well at it, the models max it out, and the cycle repeats. This could either describe massive growth of LLM coding abilities or a disconnect between what the new benchmarks are measuring & why new models are scoring well after enough time. In the former assumption there is no limit to the growth of scores... but there is also not very much actual growth (if any at all). In the latter the growth matches, but the reality of using the tools does not seem to say they've actually gotten >10x better at writing code for me in the last year.

Whether an individual human could do well across all tasks in a benchmark is probably not the right question to be asking a benchmark to measure. It's quite easy to construct benchmark tasks a human can't do well in that you don't even need AI to do better.


Your mileage may vary, but for me, working today with the latest version of Claude Code on a non-trivial python web dev project, I do absolutely feel that I can hand over to the AI coding tasks that are 10 times more complex or time consuming than what I could hand over to copilot or windsurf a year ago. It's still nowhere close to replacing me, but I feel that I can work at a significantly higher level.

What field are you in where you feel that there might not have been any growth in capabilities at all?

EDIT: Typo


Claude 3.5 came out in June of last year, and it is imo marginally worse than the AI models currently available for coding. I do not think models are 10x better than 1 year ago, that seems extremely hyperbolic or you are working in a super niche area where that is true.


Are you using it for agentic tasks of any length? 3.5 and 4.5 are about the same for single file/single snippet tasks, but my observation has been that 4.5 can do longer, more complex tasks that were a waste of time to even try with 3.5 because it would always fail.


Yes, this is important. Gpt 5 and o3 were ~ equivalent for a one shot one file task. But 5 and codex-5 can just work for an hour in a way no model was able to before (the newer claudes can too)


I use the newer claudes and letting them work for 1 hour leads to horrible code over 50% of the time that does not work. Maybe I am not the target person for agentic tasks, all I use agents for is to do product searches for me on the internet when I have specific constraints and I don't want to waste an hour looking for something.


Your knowledge on the topic is at least six months out of date; April 2025 was a huge leap forward in usability, and recent releases in the last 30 days are at least what I would call a full generation newer technology than June of 2024. Summer 2025 was arguably the dawn of true AI assisted coding. Heck reasoning models were still bleeding edge in late December 2024. They might not be 10x better but their ability to competently use (and build their own) tools makes them almost incomparable to last year's technology.


Maybe I am just using them wrong, but I don't know how my knowledge can be out of date considering I use the tools every day and pay for Clause and Gemini. I genuinely think GPT 5 was worse than previous models for reference. They are for sure marginally better, but I don't even think 2x better let alone 10x better.


I'm in product management focused around networking. I can use the tools to create great mockups in a fraction of a time but the actual turnaround of that into production ready code has not been changing much. The team has been able to build test cases and pipelines a bit more quickly is probably the main gain on getting code written.


Google has had a lot of time to optimise for those benchmarks, and just barely made SOTA (or not even SOTA) now. How is that not incremental?


If we're being completely honest, a benchmark is like an honest exam: any set of questions can only be used once when it comes out. Otherwise you're only testing how well people can acquire and memorize exact questions.


If it’s on par in code quality, it would be a way better model for coding because of its huge context window.


Sonnet can also work on 1M context. Its extreme speed is the only thing Gemini has on others.


Can it now in Claude Code and Claude Desktop? When I was using it a couple of months ago it seemed only the API had 1M


The vending-bench 2 benchmark is kind of nutty [1].

Not sure 360 days is enough of a sample really but it's an interesting take on AI benchmarks.

Are there any other interesting benchmarks to look at?

[1] https://andonlabs.com/evals/vending-bench-2


very impressive. I wonder if this sends a different signal to the market regarding using TPUs for training SOTA models versus Nvidia GPUs. From what we've seen, OpenAI is already renting them to diversify... Curious to see what happens next


Big if true.

I'll wait for the official blog with benchmark results.

I suspect that our ability to benchmark models is waning. Much more investment required in this area, but what is the play out?


really great results although the results are so high i was trying a simple example of object detection and the performance was kind of poor in agentic frameworks. Need to see how this performs on other other tasks.


Why is Grok 4.1 not in the benchmarks?


nice numbers, but what does this actually mean?

What does this model do that others can't already.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: