Oh and I agree so much. I just shared a quick first observation in a real-world testing scenario (BTW re-ran Sonnet 4.5 with the same prompt, not much changed). I just keep seeing how LLM providers keep optimizing for benchmarks, but then I cannot reproduce their results in my projects.
I will keep trying, because Claude 4 generally is a very strong line of models. Anthropic has been on the AI coding throne for months before OpenAI with GPT-5 and Codex CLI (and now GPT-5-Codex) has dethroned them.
And sure I do want to keep them competing to make each other even better.
What would be the difference in prompts/info for Claude vs ChatGpt? Is this just based on anecdotal stuff or is there actually something I can refer to when writing prompts? I mostly use Claude, but don't really pay much attention to the exact wording of the prompts
1. Different LLMs require different prompts and information
2. They ignore LLMs non determinism, you should run the experiment several times