I have a fairly large web application with ~200k LoC.
Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).
"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"
Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).
GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.
My conclusion is clear: GPT-5-Codex is the clear winner, not even close.
I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.
The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.
I'm not trying to be offensive here, feel the need to indicate that.
But that prompt leads me to believe that you're going to get rather 'random' results due to leaving SO much room for interpretation.
Also, in my experience, punctuation is important - particularly for pacing and grouping of logical 'parts' of a task and your prompt reads like a run on sentence.
Making a lot of assumptions here - but I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours (depending on complexity and importance) with far more detail, including overlapping of descriptions of various tasks (ie; potentially describing the same thing more than once in different ways in context/relation to other things to establish relation/hierarchy).
I'm glad you got what you needed - but these types of prompts and approaches are why I believe so many people think these models aren't useful.
You get out of them what you put into them. If you give them structured and well written requirements as well as a codebase that utilizes patterns you're going to get back something relative to that. No different than a developer - if you gave a junior coder, or some team of developers the following as a feature requirement: `implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay` then you can't really be mad when you don't get back exactly what you wanted.
edit: To put it another way - spend a few more minutes on the initial task/prompt/description of your needs and you're likely to get back more of what you're expecting.
I think that is an interesting observation and I generally agree.
Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.
> The thing is my "experiment" is one that represents a fairly common use case
Valid as well. I guess I'm just nitpicking based on how much I see people saying these models aren't useful combined with seeing this example, triggered my "you're doing it wrong" mode :D
> GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results.
I have a reputation with family and co-workers of being quite verbose - this might be why I prefer Claude (though haven't tried Codex in the last month or so). I'm typically setting up context and spending a few minutes writing an initial prompt and iterating/adjusting on the approach in planning mode so that I _can_ just walk away (or tab out) and let it do it's thing knowing that I've already reviewed it's approach and have a reasonable amount of confidence that it's taking an approach that seems logical.
I should start playing with codex again on some new projects I have in mind where I have an initial planning document with my notes on what I want it to do but nothing super specific - just to see what it can "one shot".
Yeah, as someone who has been using Claude Code for about 4 months now, I’ve adopted a “be super specific by default”-workflow. It works very well.
I typically use zen-mcp-server’s planning mode to scope out these tasks, refine and iterate on a plan, clear context, and then trigger the implementation.
There’s no way I would have considered “implement fuzzy search” a small feature request. I’m also paranoid about introducing technical debt / crappy code, as in my experience is the #1 reason that LLMs typically work well for new projects but start to degrade after a while: there’s just a lot of spaghetti and debt built up over time.
I tend to tell claude to research what is already there, and think hard, and that gives me much better per-prompt results.
But you are right that codex does that all by default. I just get frustrated when I ask it something simple and it spends half an hour researching code first.
Some do this by using tools like RepoPrompt to read entire files into GPT-5 Pro, and then using GPT-5 Pro to send the relevant context and work plan to Codex so that it can skip needing to poke around files. If you give it the context, it won't spend that time looking for it. But then you spend time with Pro (which can ingest entire files at once instead of searching through them, and provide a better plan for Codex, though)
It worked on the first try, but did it work on the second?
I noticed in conversations with LLMs, much of what they come up with is non-deterministic. You regenerate the message and it disappears.
That appears to be the basic operating principe of the current paradigm. And agentic programming repeats this dice roll, dozens or hundreds of times.
I don't know enough about statistics to say if that makes it better (converging on the averages?) or worse (context pollution, hallucinating, focusing on noise?), but it seems worth considering.
I would think that to truly rank such things, you should run a few tests and look for a clear pattern. It's possible that something promoted claude to take "the easy way" while chatgpt didn't.
This would explain the LLM implementing the feature in a way you didn't prefer. But this does not explain why Sonnet would deliver a broken implementation that does not work in even the most basic sense.
Also, there is a threshold in which the time it takes to develop a prompt, allow the agent to run, review its output, and go through iterative loops to correct errors or implementation problems, can exceed the time it takes me (a lazy human) to achieve the same end result.
Pair this with the bypassing of the generation effect, reduced prefrontal dopamine, and increased working memory load (in part due to minimal motor-cognitive integration), and AI-generated code in contexts with legal and financial repercussions can be a much worse deal than using your own fingers.
> But this does not explain why Sonnet would deliver a broken implementation that does not work in even the most basic sense.
Depends not just on prompt but also the tooling / environment you use. Somebody using Claude Code CLI may get a totally different experience then somebody using CoPilot via VSC.
What do i mean by that? Look at how Copilot tries to save money by reading content only in small parts. Reading file X line 1-50, X line 51-100, ... And it starts working with this. Only if it finds a hint about something somewhere else, it will read in more context.
What i often see is that it misses context because it reads in so limited information and if there is no hint in your code or code doc, it will stop there. Try to run a local test on the code, passes, done... While it technically broke your application.
Example: If i tell it to refactor a API, it never checks if that API is used anywhere else because it only reads in that API code. So i need to manually add to the prompt to remind it, "the API is used in the system". And then it does its searching /... Found 5 files, Read X line 1...
And plop, good working code ... So if you know this limitation, you can go very far with a basic $10 CoPilot Claude Agent usage.
Where as a $200 Claude Code will give you a better experience out of the door, as it reads in a ton more. The same applies to GPT-5/Codex, what seems to be more willing to read in larger context of your project, thus resulting in less incomplete code.
This is just anecdotal from my point of view, but like with any LLM, hinting matters a lot. Its less about writing a full prompt with a ton of text but just including the right "do not forget about function name X, and module Y, and test Z". And Claude loves it hints on CoPilot because of that limited reading.
> I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours
Why would you need such extensive prompting just to get the model to not re-implement authentication logic, for example? It already has access to all of the existing code, shouldn't it just take advantage of what's already there? A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
> shouldn't it just take advantage of what's already there?
It's not a good idea to have any coding agent put unnecessary amounts of lines into the context window in order to understand your code base.
Performance of all llms drop drastically when the context window is filled or full. The purpose of being more specific with your prompts is that you spend a little bit more tokens up front to make the task a lot more efficient and more likely to result in success.
At least that's how it is today. We're probably a breakthrough or two away from the type of vibe coding experience non-coders want. Or it may never happen, and the developers who have coding knowledge will be the only ones to fully utilize coding agents and it will only become more powerful over time.
I'm not sure exactly what you mean by the vibe coding experience non-coders want, but if it's one-shotting a buildable codebase off of an unspecific prompt, the major breakthrough would have to be brain-computer interfaces so the agent can literally read the user's mind.
If that same person approached a software development company with the same prompt without following up with any other details, they won't get good code back, either. You're not saying it, but this idea that in the future you can tell a computer something like "create photoshop" and get what your expecting is an unrealistic dream that would need mind-reading or a major breakthrough and paradigm shift in understanding and interpreting language.
> the major breakthrough would have to be brain-computer interfaces so the agent can literally read the user's mind.
And even that would not be enough.
In reality, it would have to put the user to sleep and go through various dream scenarios to have the user's brain really build an internal model that is not there in the first place. No brain interface can help find what is not there.
We usually need interactions with reality to build the internal model of what we actually want step by step, especially for things we have not done before.
Even for info that is there, that's also a limit to fantasy or sci-fi brain scanning. The knowledge is not stored like in a RAM chip, even when it is there. You would have to simulate the brain to actually go through the relevant experiences to extract the information. Predicting the actual dynamic behavior of the brain would require some super-super sub-molecular level scan and then correctly simulating that, since what the neurons will actually do depends on much more than the basic wiring. Aaaaand you may get a different result depending on time of day, how well they slept, mood and when and what the person ate and what news they recently read, etc. :)
That is also not enough. An agent could build an application that functions, but you also need to have a well-designed underlying architecture if you want the application to be extensible and maintainable - something the original dreamer may not even be capable of - so perhaps a shared extended dream share with a Sr. architect is also needed. Oh wait .. I guess we're back to square 1 again? lol
Well, I don't have the context myself about what's happening in this example, though I don't see anything about auth myself.
And I left that window at 5-20x because, again, no real context. But unless I was already in the middle of a task and I was giving direction that there was already context for - my prompt is generally almost never _this_ short. (referring to the prompt in the top level comment)
> A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
It wouldn't be, given the additional context given by the author in a sibling comment to yours. But if you had specific expectations on the resulting code/functionality that 20x longer prompt is likely to save you time and energy in the back and forth adjustments you might have to make otherwise.
You're critiquing OP for not playing with how the models currently work (bad at gathering context on their own). Sure, if you bend over backwards and hop on one foot, you can get them to do what you want.
OP is critiquing the model as a product vs. the marketing promises. The model should be smart enough to gather context about the project to implement features properly on their own, if they are ever going to 'write 90% of all code THIS YEAR' as people like the founder of Anthropic claim.
> but these types of prompts and approaches are why I believe so many people think these models aren't useful.
100% agree. The prompt is a 'yolo prompt'. For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @, tell it how to test the outcome so it knows it has succeeded and closing the feedback loop, and guide it in implementation either via memory or via context with which existing libs or methods it should call on.
For greenfield tasks and projects I even provide architectural structure, interfaces, etc.
After reading twitter, reddit and hn complaints about models and coding tools I've come to the same conclusion as you.
That fact is pretty useless to draw any useful conclusions from with one random not so great example. Yes, it's an experiment and we got a result. And now what? If I want reliable work results I would still go with the strategy of being as concrete as possible, because in all my AI activities, anything else lets the results be more and more random. Anything non-standard (like, you could copy & paste directly from a Google or SO result), no matter how simple, I better provide the base step by step algorithm myself and only leave actual implementation to the AI.
> For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @…
- and my point is that you do not have to give ChatGPT those things. GP did not, and they got the result they were seeking.
That you might get a better result from Claude if you prompt it 'correctly' is a fine detail, but not my point.
(I've no horse in this race. I use Claude Code and I'm not going to switch. But I like to know what's true and what isn't and this seems pretty clear.)
But isn't the end goal to be able to get useful results without so much prompting?
I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
> But isn't the end goal to be able to get useful results without so much prompting?
See below about context.
> I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
Movies != real life
> To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
See below about context.
> Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
Context is king.
> I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
What I'm understanding from your comments here are that you should just be able to give it broad statements and it should interpret that into functional results. Sure - that works incredibly well, if you provide the relevant context and the model is able to understand and properly associate it where needed.
But you're comparing the LLMs to humans (this is a problem, but not likely to stop so we might as well address it) - but _what_ humans? You ask if that prompt would be sufficient for a senior developer - absolutely, if that developer already has the _context_ of the project/task/features/etc. They can _infer_ what's not specified. But if you give that same prompt to a jr dev who maybe has access to the codebase and has poked around inside the working application once or twice but no real in depth experience with it - they're going to _infer_ different things. They might do great, they might fail spectacularly. Flip a coin.
So - with that prompt in the top level comment - if that LLM is provided excellent context (via AGENTS.md/attached files/etc) then it'll do great with that prompt, most likely. Especially if you aren't looking for specifics in the resulting feature outside of what you mentioned since it _will_ have to infer some things. But if you're just opening codex/CC without a good CLAUDE.md/AGENTS.md and feeding it a prompt like that you have to expect quite a bit of variance to what you get - exactly the same way you would a _human_ developer.
You context and prompt are the project spec. You get out what you put in.
These things are being marketed as super intelligent magic answer machines. Judging them using the criteria the marketing teams have provided is completely reasonable.
> Movies != real life
Nobody claimed it was. This is about desires and expectations. The people charging money for these services and taking stacks of cash that would’ve otherwise been in in dev’s paychecks while doing so haven’t even tried to temper those expectations. They made their beds…
Quick data point that I've been able to get LLMs (recently whatever one clude gives me) to produce amazingly useful results for the purpose of understanding complex codebases, just by asking it to look at the code and tell me how it does xyz. No complicated long prompt. Basically exactly what I'd say to a human.
I have to agree with this assessment. I am currently going at the rate of 300-400 lines of spec for 1,000 LOC with Claude Code. Specs are AI-assisted also, otherwise you might go crazy. :-) Plus 2,000+ lines of AI-generated tests. Pretty restrictive, but then it works just fine.
When asking for change, there are the choices you know about and the ones you don't. I've gotten in the habit of describing some of the current state as well as my desired state, and using that to focus the LLM on the areas I'd like to have a stronger voice in.
Of course, I definitely appreciate when it makes choices that I don't know I need to make, and it chooses reasonable defaults.
I mean, I could say "make the visualization three columns", but there's a lot of ambiguity in that kind of thing, and the LLM is going to make a lot of choices about my intent.
Instead, "each team/quarter currently has a vertically stacked list of people assigned to that team, with two columns (staying on team, leaving team). change that to being three columns (entering team, staying on team, leaving team)."
As a bonus, it's much, much clearer to somebody reading the history later what the intent was.
tbh, I don't really understand it well enough to be able to give a response here. But here's a real prompt I just used on a project copy/pasted:
```
Something that seems to have been a consistent gotcha when working with llm's on this project is that there's no specific `placement` column on the table that holds the 'results' data. Our
race_class_section_results table has it's rows created in placement order - so it's inferred via the order relative to other records in the same race_class_section. But this seems to complicate
things quite a bit at times when we have a specific record/entry and want to know it's placement - we have to query the rest of them and/or include joins and other complications if we want to filter
results by the placements, etc.
Can you take a look at how this is handled, both with the querying of existing data by views/livewire components/etc and how we're storing/creating the records via the import processes and give me a
determination on whether you think it should be refactored to include a column on the database for `placement`? I think right now we've got 140,000 or so records on that table and it's got nearly
20 years worth of race records so I don't think we need to be too concerned with the performance of the table or added storage or anything. Think very hard, understand that this would be a rather
major refactor of the codebase (I assume, since it's likely used/referenced in _many_ places - thankfully though that most of the complicated queries it would be found in would be easily identified
by just doing a search of the codebase for the race_class_section_results table) and determine if that would be worth it for the ease of use/query simplification moving forward.
```
This comes with a rather developed CLAUDE.md that includes references to other .md documents that outline various important aspects of the application that should be brought into context when working in those areas.
This prompt was made in planning mode - the LLM will then dig into the code/application to understand things and, if needed, ask questions and give options to weigh before return with a 'plan' on how to approach. I then iterate on that plan with it before eventually accepting a plan that it will then begin work on.
that's kind of expected for me, but codex feels more like vibe coding tool and Claude code more like ai assisted development.
And I actually like Claude more because of that.
codex will indeed work more independently but you will have hard time when it isn't what you want. It will use python script just to do simple edits in files ( niesearch and replace for unique code snippets in small files) when it's wrong good look convincing it (it already have some outdated info like on latest docker image releases and convincing it that Debian base changed was challenging)
it uses context more effectively but it will lack explanation why it is doing what it is doing, asking it to explain will just cause it to do something else without any word.
and of course lack of proper permissions for running commands. sandbox is cool but I do not want it to be able to commit, I want it to just edit files or I want to have some more control over what it does.
you can run codex as mcp server, I prefer adding it to Claude and ask to do cooperative plan, codex will do great analysis and plan and I can comfortable work with Claude on the code that matches my style
Same experience here. In the last week I've successfully managed to build a complete C++20 XPath 1.0 parser with Codex, and am now onto supporting XPath 2.0. Codex has been nailing it time and again - the only caveat is that I have to use their cloud version as local execution is buggy.
Sonnet on the other hand gets tripped up constantly due to the complexity. I'm not seeing the improvement I was hoping for with 4.5, and it's just given up on attempting to implement support for date-time functionality. Codex has nailed the same task, yet Anthropic claim OpenAI have the inferior product?
I'm pretty sad about this as I'm gunning for Anthropic and would have loved to see them at least match Codex with this new release. If OpenAI stays on this course then Claude Code is toast without an important breakthrough. It doesn't help that they're also the more expensive product and have service quality issues pushing people to quit the platform.
I'm thinking about switching to ChatGPT Pro also. Any idea what maxes it out before I need to pay via the API instead? For context I'm using about 1b tokens a month so likely similar to you by the sounds of things.
On pro tier have not been able to trigger the usage cap.
Pro
Local tasks: Average users can send 300-1,500 messages every 5 hours with a weekly limit.
Cloud tasks: Generous limits for a limited time.
Best for: Developers looking to power their full workday across multiple projects.
Thank you, that's very helpful. I think I could get close to that in some coding sessions where I'm running multiple in parallel but I suspect it's very very rare. Even with token efficient gpt5-codex my OpenAI bill is quite high so I think I will switch to Pro now.
Oh and I agree so much. I just shared a quick first observation in a real-world testing scenario (BTW re-ran Sonnet 4.5 with the same prompt, not much changed). I just keep seeing how LLM providers keep optimizing for benchmarks, but then I cannot reproduce their results in my projects.
I will keep trying, because Claude 4 generally is a very strong line of models. Anthropic has been on the AI coding throne for months before OpenAI with GPT-5 and Codex CLI (and now GPT-5-Codex) has dethroned them.
And sure I do want to keep them competing to make each other even better.
What would be the difference in prompts/info for Claude vs ChatGpt? Is this just based on anecdotal stuff or is there actually something I can refer to when writing prompts? I mostly use Claude, but don't really pay much attention to the exact wording of the prompts
I must be using Codex wrong. I evaluated it with a task to do a pretty simple, mechanical string replacement across many files (moving from a prop spread in JSX to standard props, but only when the value being spread is a subscript of an object). It simply couldn't do it, and it wasn't even close. It was getting the syntax wrong, trying to fix it by deleting the code, then moving on to other files. Sonnet 4.1 wasn't perfect, but I was able to point out its errors and it fixed them and avoided doing it again.
I will say, Claude does seem to need a verbose prompt. Often I'll write my prompts as tasks in Notion and have it pull then via MCP (which is nice, because it tracks and documents its work in the process). But once you've given it a few paragraphs about the why and how, my experience is that it's pretty self sufficient. Granted, I'm using Cursor and not CC; I don't know if that makes much of a difference.
Codex cannot fail, it contains multitudes beyond your imagining. Nay, it can only be failed. Continue internalizing that the problem is you, not the tool. Perhaps a small infusion of one trillion USD would unlock it and your potential?
My first thought was I bet I could get Sonnet to fix it faster because I got something back in 3 minutes instead of 20 minutes. You can prompt a lot of changes with a faster model. I'm new to Claude Code, so generally speaking I have no idea if I'm making sense or not.
I think Codex working for 20 mins uninterrupted is actually a strength. It’s not “slow” as critics sometimes say - it’s thorough and autonomous. I can actually walk away and get something else done around the house while it does my work for me.
I swear cc in June/July used to spend a lot more time on tasks and felt more thorough like codex does now. Hard to remember much past the last week in this world though.
Interesting, in my experience Claude usually does okay with the first pass, often gets the best visual/ui output, but cannot improve beyond that even with repeated prompts and is terrible at optimising, GPT almost the opposite.
It's also my experience that Claude loves to reimplement the wheel instead of reading code to look for an existing implementation of what it wants to do.
I've been working with Opus 4 on ultrathink quite a bit recently and did some quick tests with Sonnet 4.5, I'm fairly impressed, especially with its speed but I did feel it was a lot less strict with my rules, existing patterns, etc. compared to Opus 4.
Maybe it's better with a better CLAUDE.md structure? I don't use those a lot, just telling Opus to think got 'good enough' results I guess. Not sure.
I hope there's an Opus 4.5 coming out soon too. In the meantime I'll see if I can get to do better with some extra prompting or I'll go back to Opus of if I don't need the speedier responses.
I've tried codex with GPT-5 a little bit and I haven't figured out how to get it to not be annoying. codex just constantly tries to gaslight and argue with me. For example, I was debugging an OpenGL render pipeline that went black and codex insisted it must be because I was ssh'd into a headless server. It really makes me appreciate the Claude "You're absolutely right!"s. Anyway as you can tell, I haven't cracked working with codex. But at the moment it just messes everything up and ways I've learned to work with claude don't seem to translate.
I ran the test again, took Claude ~4mins this time. There was no error now with the auth, but the functionality was totally broken. It could not even find the most basic stuff that matches perfectly.
I even added a disclaimer "anecdotal evidence". Believe me, I am not the biggest fan of Sam. I just happen to like the best tools available, have used most of the large models and always choose the one that works best - for me.
I have a fairly large web application with ~200k LoC.
Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).
"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"
Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).
GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.
My conclusion is clear: GPT-5-Codex is the clear winner, not even close.
I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.
The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.