More

aliljet · 2025-12-05T16:55:37 1764953737

Is there clarity right now around foreign students attempting to obtain h1bs in the future?

proberts · 2025-12-05T18:41:40 1764960100

Yes. F-1 students can get H-1B visas. The issue is the $100K payment which applies if the H-1B petition is filed not with a request to change status but with a request to notify a consulate. But if it's filed with and approved as a change of status, then a subsequent H-1B visa application will not trigger the $100K payment.

aliljet · 2025-12-03T21:57:19 1764799039

This is absolutely fantastic. I wish this included commercial loans like DSCRs...

mhashemi · 2025-12-03T22:03:22 1764799402

Thanks! Re: DSCRs, point me to the data!

aliljet · 2025-11-25T16:55:48 1764089748

I wonder how effective this is medical scenarios? Segmenting organs and tumors in cat scans or MRIs?

aliljet · 2025-11-24T19:17:29 1764011849

The real question I have after seeing the usage rug being pulled is what this costs and how usable this ACTUALLY is with a Claude Max 20x subscription. In practice, Opus is basically unusable by anyone paying enterprise-prices. And the modification of "usage" quotas has made the platform fundamentally unstable, and honestly, it left me personally feeling like I was cheated by Anthropic...

aliljet · 2025-11-20T15:06:40 1763651200

This is great, but I was hoping to read a bunch of hilarious poetry. Where is the actual poetry?!

aliljet · 2025-11-18T16:19:47 1763482787

Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...

Workaccount2 · 2025-11-18T17:56:15 1763488575

It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage.

ramesh31 · 2025-11-18T18:53:56 1763492036

>"It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage."

Indeed. It's almost impossible to truly know a model before spending a few million tokens on a real world task. It will take a step-change level advancement at this point for me to trust anything but Claude right now.

epolanski · 2025-11-18T19:20:36 1763493636

Imho Gemini 2.5 was by far the better model on non-trivial tasks.

oezi · 2025-11-18T19:38:09 1763494689

To this day, I still don't understand why Claude gets more acclaim for coding. Gemini 2.5 consistently outperformed Claude and ChatGPT mostly because of the much larger context.

WhyOhWhyQ · 2025-11-19T01:15:23 1763514923

I'm not sure about this. I used gemini and claude for about 12 hours a day for a month and a half straight in an unhealthy programmer bender and claude was FAR superior. It was not really that close. Going to be interesting to test gemini 3 though.

davidmurdoch · 2025-11-19T12:52:52 1763556772

Gemini 2.5 is prone to apology loops, and often confuses its own thinking to user input, replying to itself. Chat GPT 5 likes to refuse tasks with "sorry I can't help with that". At least in VSCode's GitHub Copilot Agent mode. Claude hasn't screwed up like that for me.

viraptor · 2025-11-18T20:26:10 1763497570

Different styles of usage? I see Gemini praised for being able to feed the whole project and ask changes. Which is cool and all but... I never do that. Claude for me is better for specific modifications to specific parts of the app. There's a lot of context behind what's "better".

Libidinalecon · 2025-11-18T23:10:36 1763507436

I can't really explain why I have barely used Gemini.

I think it was just timing with the way models came out. This will be the first time I will have a Gemini subscription and nothing else. This will be the first time I really see what it can do fully.

jarjoura · 2025-11-20T02:04:59 1763604299

Gemini 2.5 and now 3 seem to continue their trend of being horrific in agentic tasks, but almost always impress me with the single first shot request.

Claude Sonnet is way better about following up and making continuous improvements during a long running session.

For some reason Gemini will hard freeze-up on the most random queries, and when it is able to successfully continue past the first call, it only keeps a weird summarized version of its previous run available to itself, even though it's in the payload. It's a weird model.

My take is that, it's world-class at one-shotting, and if a task benefits from that, absolutely use it.

dist-epoch · 2025-11-18T22:15:30 1763504130

Gemini 2.5 couldn't apply an edit to a file if it's life depended on it.

So unless you love copy/pasting code, Gemini 2.5 was useless for agentic coding.

Great for taking it's output and asking Sonnet to apply it though.

decide1000 · 2025-11-19T06:35:09 1763534109

I use Gemini cli, Claude Code and Codex daily. If I present the same bug to all 3, Gemini often is the one missing a part of the solution or drawing the wrong conclusion. I am curious for G3.

nhumrich · 2025-11-19T00:47:53 1763513273

The secret sauce isn't Claude the model, but Claude code the tool. Harness > model.

brazukadev · 2025-11-19T00:49:37 1763513377

The secret sauce is the MCP that lots of people are starting to talk bad about.

artdigital · 2025-11-19T00:35:16 1763512516

Claude doesn’t gaslight me, or flat out refuses to do something I ask it to because it believes it won’t work anyway. Gemini does

Gemini also randomly just reverts everything because of some small mistake it found, makes assumptions without checking if those are true (eg this lib absolutely HAS TO HAVE a login() method. If we get a compile error it’s my env setup fault)

It’s just not a pleasant model to work with

khimaros · 2025-11-19T17:58:36 1763575116

confirmed, but also happens occasionally with Claude

svantana · 2025-11-18T16:27:28 1763483248

SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao [1].

Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.

[1] https://www.swebench.com/

usaar333 · 2025-11-18T17:39:44 1763487584

claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao

cube2222 · 2025-11-18T16:21:39 1763482899

Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5.

Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.

nico1207 · 2025-11-18T17:15:27 1763486127

They are not even leading in Terminal-Bench... GPT 5.1-codex is better than Gemini 3 Pro

pawelduda · 2025-11-18T16:28:13 1763483293

Why is this particular benchmark important?

aliljet · 2025-11-18T16:31:13 1763483473

Thus far, this is one of the best objective evaluations of real world software engineering...

RamtinJ95 · 2025-11-18T17:07:05 1763485625

I concur with the other commenters, 4.5 is a clear improvement over 4.

adastra22 · 2025-11-18T16:49:12 1763484552

Idk, Sonnet 4.5 score better than Sonnet 4.0 on that benchmark, but is markedly worse in my usage. The utility of the benchmark is fading as it is gamed.

meowface · 2025-11-18T16:51:16 1763484676

I think I and many others have found Sonnet 4.5 to generally be better than Sonnet 4 for coding.

adastra22 · 2025-11-18T17:07:06 1763485626

Maybe if you confirm to its expectations for how you use it. 4.5 is absolutely terrible for following directions, thinks it knows better than you, and will gaslight you until specifically called out on its mistake.

I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.

pawelduda · 2025-11-18T18:08:14 1763489294

I'm very happy with it TBH, it has some things that annoy me a little bit:

- slower compared to other models that will also do the job just fine (but excels at more complex tasks),

- it's very insistent on creating loads of .MD files with overly verbose documentation on what it just did (not really what I ask it to do),

- it actually deleted a file twice and went "oops, I accidentaly deleted the file, let me see if I can restore it!", I haven't seen this happen with any other agent. The task wasn't even remotely about removing anything

adastra22 · 2025-11-18T18:16:36 1763489796

The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

And yes, I have hooks to disable 'git reset', 'git checkout', etc., and warn the model not to use these commands and why. So it writes them to a bash script and calls that to circumvent the hook, successfully shooting itself in the foot.

Sonnet 4.5 will not follow directions. Because of this, you can't prevent it like you could with earlier models from doing something that destroys the worktree state. For longer-running tasks the probability of it doing this at some point approaches 100%.

ewoodrich · 2025-11-18T19:34:52 1763494492

> The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

Man I've had this exact thing happen recently with Sonnet 4.5 in Claude Code!

With Claude I asked it to try tweaking the font weight of a heading to put the finishing touches on a new page we were iterating on. Looked at it and said, "Never mind, undo that" and it nuked 45 minutes worth of work by running git restore.

It immediately realized it fucked up and started running all sorts of git commands and reading its own log trying to reverse what it did and then came back 5 minutes later saying "Welp I lost everything, do you want me to manually rebuild the entire page from our conversation history?

In my CLAUDE.md I have instructions to commit unstaged changes frequently but it often forgets and sure enough, it forgot this time too. I had it read its log and write a post-mortem of WTF led it to run dangerous git commands to remove one line of CSS and then used that to write more specific rules about using git in the project CLAUDE.md, and blocked it from running "git restore" at all.

We'll see if that did the trick but it was a good reminder that even "SOTA" models in 2025 can still go insane at the drop of a hat.

adastra22 · 2025-11-18T23:17:35 1763507855

The problem is that I'm trying to build workflows for generating sequences of good, high quality semantically grouped changes for pull requests. This requires having a bunch of unrelated changes existing in the work tree at the same time, doing dependency analysis on the sequence of commits, and then pulling out / staging just certain features at a time and committing those separately. It is sooo much easier to do this by explicitly avoiding the commit-every-2-seconds workaround and keeping things uncommitted in the work tree.

I have a custom checkpointing skill that I've written that it is usually good about using, making it easier to rewind state. But that requires a careful sequence of operations, and I haven't been able to get 4.5 to not go insane when it screws up.

As I said though, watch out for it learning that it can't run git restore, so it immediately jumps to Bash(echo "git restore" >file.sh && chmod +x file.sh && ./file.sh).

meowface · 2025-11-18T19:02:22 1763492542

I think this is probably just a matter of noise. That's not been my experience with Sonnet 4.5 too often.

Every model from every provider at every version I've used has intermingled brilliant perfect instruction-following and weird mistaken divergence.

adastra22 · 2025-11-18T23:20:18 1763508018

What do you mean by noise?

In this case I can't get 4.5 to follow directions. Neither can anyone else, aparantly. Search for "Sonnet 4.5 follow instructions" and you'll find plenty of examples. The current top 2:

https://www.reddit.com/r/ClaudeCode/comments/1nu1o17/45_47_5...

https://theagentarchitect.substack.com/p/claude-sonnet-4-pro...

epolanski · 2025-11-18T19:21:26 1763493686

Not my experience at all, 4.5 is leagues ahead the previous models albeit not as good as Gemini 2.5.

pertymcpert · 2025-11-18T16:53:58 1763484838

I find 4.5 a much better model FWIW.

spookie · 2025-11-18T16:37:51 1763483871

Does anyone trust benchmarks at this point? Genuine question. Isn't the scientific consensus that they are broken and poor evaluation tools?

energy123 · 2025-11-18T18:05:15 1763489115

They overly emphasize tasks with small context without noise and red herrings in the context.

rkozik1989 · 2025-11-19T14:45:10 1763563510

Honestly, I am inclined to think a lot of the people who are wowed by benchmarks and simple tech demos probably aren't doing very much at their day job and if they're either working on simple codebases or ones that don't have very many users(more users == more bugs found). When you throw these models at complex software projects like SOAs, big object-oriented codebases, etc. their output can be totally unusable.

mudkipdev · 2025-11-18T16:55:07 1763484907

I make my own automated benchmarks

ummonk · 2025-11-18T18:43:28 1763491408

Is there a tool / website that makes this process easy?

mudkipdev · 2025-11-18T18:54:41 1763492081

I coded it bun and openrouter(dot)ai. I have an array of benchmarks, each benchmark has a grader (for example, checking if it equals a certain string or grade the answer automatically using another LLM). Then I save all results to a file and render the percentage correct to a graph

ezekiel68 · 2025-11-19T07:47:39 1763538459

I mean... it achieved 76.2% vs the leader (Claude Sonnet) at 77.2%.

That's a "loss" I can deal with.

aliljet · 2025-11-18T15:38:34 1763480314

When will this be available in the cli?

_ryanjsalva · 2025-11-18T16:00:13 1763481613

Gemini CLI team member here. We'll start rolling out today.

evandena · 2025-11-18T17:38:06 1763487486

How about for Pro (not Ultra) subscribers?

aliljet · 2025-11-18T16:01:50 1763481710

This is the heroic move everyone is waiting for. Do you know how this will be priced?

Sammi · 2025-11-18T16:23:16 1763482996

I'm already seeing it in https://aistudio.google.com/

aliljet · 2025-11-18T15:22:41 1763479361

What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.

I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.

radial_symmetry · 2025-11-18T15:25:40 1763479540

SWE bench is weird because Claude has always underperformed on it relative to other models despite Claude Code blowing them away. The real test will be if Gemini CLI beats Claude Code, both using the agentic framework and tools they were trained on.

aliljet · 2025-11-12T19:14:12 1762974852

What we really desperately need is more context pruning from these LLMs. The ability to pull irrelevant parts of the context window as a task is brought into focus.

_boffin_ · 2025-11-12T19:25:25 1762975525

Working on that. hopefully release it by week's end. i'll send you a message when ready.

aliljet · 2025-11-06T20:58:48 1762462728

How does one effectively use something like this locally with consumer-grade hardware?

simonw · 2025-11-07T04:29:30 1762489770

Once the MLX community get their teeth into it you might be able to run it on two 512GB M3 Ultra Mac Studios wired together - those are about $10,000 each though so that would be $20,000 total.

Update: https://huggingface.co/mlx-community/Kimi-K2-Thinking - and here it is running on two M3 Ultras: https://x.com/awnihannun/status/1986601104130646266

oceansweep · 2025-11-07T01:38:20 1762479500

Epyc Genoa CPU/Mobo + 700GB of DDR5 ram. The model is a MoE, so you don't need to stuff it all into VRAM, you can use a single 3090/5090 to hold the activated weights, and hold the remaining weights in DDR5 ram. Can see their deployment guide for reference here: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...

tintor · 2025-11-06T21:57:03 1762466223

Consumer-grade hardware? Even at 4bits per param you would need 500GB of GPU VRAM just to load the weights. You also need VRAM for KV cache.

CamperBob2 · 2025-11-07T02:18:23 1762481903

It's MoE-based, so you don't need that much VRAM.

Nice if you can get it, of course.