Yes. F-1 students can get H-1B visas. The issue is the $100K payment which applies if the H-1B petition is filed not with a request to change status but with a request to notify a consulate. But if it's filed with and approved as a change of status, then a subsequent H-1B visa application will not trigger the $100K payment.
The real question I have after seeing the usage rug being pulled is what this costs and how usable this ACTUALLY is with a Claude Max 20x subscription. In practice, Opus is basically unusable by anyone paying enterprise-prices. And the modification of "usage" quotas has made the platform fundamentally unstable, and honestly, it left me personally feeling like I was cheated by Anthropic...
Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...
>"It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage."
Indeed. It's almost impossible to truly know a model before spending a few million tokens on a real world task. It will take a step-change level advancement at this point for me to trust anything but Claude right now.
To this day, I still don't understand why Claude gets more acclaim for coding. Gemini 2.5 consistently outperformed Claude and ChatGPT mostly because of the much larger context.
I'm not sure about this. I used gemini and claude for about 12 hours a day for a month and a half straight in an unhealthy programmer bender and claude was FAR superior. It was not really that close. Going to be interesting to test gemini 3 though.
Gemini 2.5 is prone to apology loops, and often confuses its own thinking to user input, replying to itself. Chat GPT 5 likes to refuse tasks with "sorry I can't help with that". At least in VSCode's GitHub Copilot Agent mode. Claude hasn't screwed up like that for me.
Different styles of usage? I see Gemini praised for being able to feed the whole project and ask changes. Which is cool and all but... I never do that. Claude for me is better for specific modifications to specific parts of the app. There's a lot of context behind what's "better".
I can't really explain why I have barely used Gemini.
I think it was just timing with the way models came out. This will be the first time I will have a Gemini subscription and nothing else. This will be the first time I really see what it can do fully.
Gemini 2.5 and now 3 seem to continue their trend of being horrific in agentic tasks, but almost always impress me with the single first shot request.
Claude Sonnet is way better about following up and making continuous improvements during a long running session.
For some reason Gemini will hard freeze-up on the most random queries, and when it is able to successfully continue past the first call, it only keeps a weird summarized version of its previous run available to itself, even though it's in the payload. It's a weird model.
My take is that, it's world-class at one-shotting, and if a task benefits from that, absolutely use it.
I use Gemini cli, Claude Code and Codex daily. If I present the same bug to all 3, Gemini often is the one missing a part of the solution or drawing the wrong conclusion. I am curious for G3.
Claude doesn’t gaslight me, or flat out refuses to do something I ask it to because it believes it won’t work anyway. Gemini does
Gemini also randomly just reverts everything because of some small mistake it found, makes assumptions without checking if those are true (eg this lib absolutely HAS TO HAVE a login() method. If we get a compile error it’s my env setup fault)
Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5.
Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.
Idk, Sonnet 4.5 score better than Sonnet 4.0 on that benchmark, but is markedly worse in my usage. The utility of the benchmark is fading as it is gamed.
Maybe if you confirm to its expectations for how you use it. 4.5 is absolutely terrible for following directions, thinks it knows better than you, and will gaslight you until specifically called out on its mistake.
I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.
I'm very happy with it TBH, it has some things that annoy me a little bit:
- slower compared to other models that will also do the job just fine (but excels at more complex tasks),
- it's very insistent on creating loads of .MD files with overly verbose documentation on what it just did (not really what I ask it to do),
- it actually deleted a file twice and went "oops, I accidentaly deleted the file, let me see if I can restore it!", I haven't seen this happen with any other agent. The task wasn't even remotely about removing anything
The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).
And yes, I have hooks to disable 'git reset', 'git checkout', etc., and warn the model not to use these commands and why. So it writes them to a bash script and calls that to circumvent the hook, successfully shooting itself in the foot.
Sonnet 4.5 will not follow directions. Because of this, you can't prevent it like you could with earlier models from doing something that destroys the worktree state. For longer-running tasks the probability of it doing this at some point approaches 100%.
> The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).
Man I've had this exact thing happen recently with Sonnet 4.5 in Claude Code!
With Claude I asked it to try tweaking the font weight of a heading to put the finishing touches on a new page we were iterating on. Looked at it and said, "Never mind, undo that" and it nuked 45 minutes worth of work by running git restore.
It immediately realized it fucked up and started running all sorts of git commands and reading its own log trying to reverse what it did and then came back 5 minutes later saying "Welp I lost everything, do you want me to manually rebuild the entire page from our conversation history?
In my CLAUDE.md I have instructions to commit unstaged changes frequently but it often forgets and sure enough, it forgot this time too. I had it read its log and write a post-mortem of WTF led it to run dangerous git commands to remove one line of CSS and then used that to write more specific rules about using git in the project CLAUDE.md, and blocked it from running "git restore" at all.
We'll see if that did the trick but it was a good reminder that even "SOTA" models in 2025 can still go insane at the drop of a hat.
The problem is that I'm trying to build workflows for generating sequences of good, high quality semantically grouped changes for pull requests. This requires having a bunch of unrelated changes existing in the work tree at the same time, doing dependency analysis on the sequence of commits, and then pulling out / staging just certain features at a time and committing those separately. It is sooo much easier to do this by explicitly avoiding the commit-every-2-seconds workaround and keeping things uncommitted in the work tree.
I have a custom checkpointing skill that I've written that it is usually good about using, making it easier to rewind state. But that requires a careful sequence of operations, and I haven't been able to get 4.5 to not go insane when it screws up.
As I said though, watch out for it learning that it can't run git restore, so it immediately jumps to Bash(echo "git restore" >file.sh && chmod +x file.sh && ./file.sh).
In this case I can't get 4.5 to follow directions. Neither can anyone else, aparantly. Search for "Sonnet 4.5 follow instructions" and you'll find plenty of examples. The current top 2:
Honestly, I am inclined to think a lot of the people who are wowed by benchmarks and simple tech demos probably aren't doing very much at their day job and if they're either working on simple codebases or ones that don't have very many users(more users == more bugs found). When you throw these models at complex software projects like SOAs, big object-oriented codebases, etc. their output can be totally unusable.
I coded it bun and openrouter(dot)ai. I have an array of benchmarks, each benchmark has a grader (for example, checking if it equals a certain string or grade the answer automatically using another LLM). Then I save all results to a file and render the percentage correct to a graph
What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.
I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.
SWE bench is weird because Claude has always underperformed on it relative to other models despite Claude Code blowing them away. The real test will be if Gemini CLI beats Claude Code, both using the agentic framework and tools they were trained on.
What we really desperately need is more context pruning from these LLMs. The ability to pull irrelevant parts of the context window as a task is brought into focus.
Once the MLX community get their teeth into it you might be able to run it on two 512GB M3 Ultra Mac Studios wired together - those are about $10,000 each though so that would be $20,000 total.
Epyc Genoa CPU/Mobo + 700GB of DDR5 ram.
The model is a MoE, so you don't need to stuff it all into VRAM, you can use a single 3090/5090 to hold the activated weights, and hold the remaining weights in DDR5 ram. Can see their deployment guide for reference here: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...
reply