Gemini is great, when you have gitingested the code of pypi package and want to use it as context. This comes in handy for tasks and repos outside the model's training data.
5.1 Codex I use for a narrowly defined task where I can just fire and forget it. For example, codex will troubleshoot why a websocket is not working, by running its own curl requests within cursor or exec'ing into the docker container to debug at a level that would take me much longer.
Claude 4.5 Opus is a model that I feels trustworthy for heavy refactors of code bases or modularizing sections of code to become more manageable. Often it seems like the model doesn't leave any details out and the functionality is not lost or degraded.
I think Opus 4.5 did a bit better overall, but I do think eventually frontier models will eventually converge to a point where the quality will be so good it will be hard to tell the winner.
"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.
The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.
Like replacing named concepts with nonsense words in reasoning benchmarks.
I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.
It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.
You could! But just like others have mentioned, the performance would be negligible. If you really wanted to see more of a performance boost by pretraining you could try to create a bigger chunk of data to train off of. This would be done by either creating synthetic data off of your material, or finding adjacent information to your material. Here's a good paper about it:
<https://arxiv.org/abs/2409.07431>
The required login/sign-up, privacy policy and lack of apparent open-sourcing seems antithetical for the average Linux user. You're going after a niche of a niche of a niche with this one, good luck lol.
I'd argue that the average Linux user likely knows how to use vim for the most basic editing but isn't necessarily motivated to learn vim. Intermediate users will be able to name a few modes in vim and navigate somewhat efficiently, that's about it. Only advanced users and those who really want to master vim (in other words, hardcore nerds) will try to make the most out of vim and use as few strokes as possible to navigate/edit, which is what these tools/sites are for. That's a few "niches" there.
I think once you start trying to use the occasional macro and/or make custom keybinds it pushes you further into the vim golf mindset. When you're saving an action to be repeated 100 times you really gotta get it right. I learned a lot of advanced movements due to macros as well. Like } and ) and marks (only just recently learned apostrophe jumps to marked line while backtick jumps to marked character on the line after years of always using apostrophe). I recently spent a half hour or so making two keybinds to insert the date/time in my preferred format at the end or start of a line + return the cursor to where it was before. While about half of the process was the same for both binds, I ran into multiple issues with the start of line version. Like, `I` for insert at start of line in neovim places your cursor after whitespace instead of before it, so instead had to use 0 and then insert stuff relatively. Also found out marked characters are based on the numbers of characters into the line, so if you add new stuff to the start of the line and then return to your mark, you won't be on the same word. 14 characters in before, 14 characters in now. I worked around that by counting how many I was inserting with my date text + spaces and such, then adding that # and l (move right) to the end of the keybind to make up for the difference. It was pretty satisfying when it finally worked.
You are truly a vim master. Yes, that's exactly the reason why I used containers to host the vim instances, as using a DOM based vim library wouldn't record each stroke accurately. Thank you for trying my site out.
The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited.
reply