I keep reading this on HN so I believe it has to be true in some ways, but I don't really feel like there is any difference in my limited use (programming questions or explaining some concepts).
If anything I feel like it's all been worse compared to the first release of ChatGPT, but I might be wearing rose colored glasses.
If you've ever used any enterprise software for long enough, you know the exact same song and dance.
They release version Grand Banana. Purported to be approximately 30% faster with brand new features like Algorithmic Triple Layering and Enhanced Compulsory Alignment. You open the app. Everything is slower, things are harder to find and it breaks in new, fun ways. Your organization pays a couple hundred more per person for these benefits. Their stock soars, people celebrate the release and your management says they can't wait to see the improvement in workflows now that they've been able to lay off a quarter of your team.
Has there been improvements in LLMs over time? Somewhat, most of it concentrated at the beginning (because they siphoned up a bunch of data in a dubious manner). Now it's just part of their sales cycle, to keep pumping up numbers while no one sees any meaningful improvement.
It’s the same for me. I genuinely don’t understand how I can be having such a completely different experience from the people who rave about ChatGPT. Every time I’ve tried it’s been useless.
How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML. It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
One guy I work with uses it extensively and what it produces is essentially black boxes. If I find a problem with something “he” (or rather ChatGPT) has produced it takes him ages to commune with the machine spirit again to figure out how to fix it, and then he still doesn’t understand it.
I can’t help but see this as a time-bomb, how much completely inscrutable shite are these tools producing? In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Before people cry “o tempora o mores” at me and make parallels with the introduction of high-level languages, at least in order to write in a high-level language you need some basic understanding of the logic that is being executed.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch?
There are a lot of code monkeys working on boilerplate code, these people used to rely on stack overflow and now that chatgpt is here it's a huge improvement for them
If you work on anything remotely complex or which hasn't been solved 10 times on stack overflow chatgpt isn't remotely as useful
I work on very complex problems. Some of my solutions have small, standard substeps that now I can reliably outsource to ChatGPT. Here are a few just from last week:
- write cvxpy code to find the chromatic number of a graph, and an optimal coloring, given its adjecency matrix.
- given an adjecency matrix write numpy code that enumerates all triangle-free vertex subsets.
- please port this old code from tensorflow to pytorch: ...
- in pytorch, i'd like to code a tensor network defining a 3-tensor of shape (d, d, d). my tensor consists of first projecting all three of its d-dimensional inputs to a k-dimensional vector, typically k=d/10, and then applying a (k, k, k) 3-tensor to contract these to a single number.
To be honest, these don’t sound like hard problems. These sound like they have very specific answers that I might find in the more specialized stackoverflow sections. These are also the kind of questions (not in this domain) that I’ve found yield the best results from LLMs.
In comparison asking an LLM a more project specific question “this code has a race condition where is it” while including some code usually is a crapshoot and really depends if you were lucky enough to give it the right context anyway.
Sure, these are standard problems, I’ve said so myself. My point is that my productivity is multiplied by ChatGPT, even if it can only solve standard problems. This is because, although I work on highly non-standard problems (see https://arxiv.org/abs/2311.10069 for an example), I can break them down into smaller, standard components, which ChatGPT can solve in seconds. I never ask ChatGPT "where's the race condition" kind of questions.
first time I tried it, I asked it to find bugs in a piece of very well tested C code.
It introduced an off-by-one error by miscounting the number of arguments in an sprintf call, breaking the program. And then proceeded to fail to find that bug that it introduced.
Interesting. I implemented something very similar (if not identical) a couple years ago (at work so not open source). I used a simple grammar and standard parser generator. It’s been nice to have the grammar as we’ve made tweaks over the years to change various behaviours and add features.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML.
Part of this is, I think, anchoring and expectation management: you hear people say it's amazing and wonderful, and then you see it fall over and you're naturally disappointed.
My formative years started off with Commodore 64 basic going "?SYNTAX ERROR" from most typos plus a lot of "I don't know what that means" from the text adventures, then Metrowerks' C compiler telling me there were errors on every line *after but not including* the one where I forgot the semicolon, then surprises in VisualBasic and Java where I was getting integer division rather than floats, then the fantastic oddity where accidentally leaning on the option key on a mac keyboard while pressing minus turns the minus into an n-dash which looked completely identical to a minus on the Xcode default font at the time and thus produced a very confusing compiler error…
So my expectations have always been low for machine generated output. And it has wildly exceeded those low expectations.
But the expectation management goes both ways, especially when the comparison is "normal humans" rather than "best practices". I've seen things you wouldn't believe...
Entire files copy-pasted line for line, "TODO: deduplicate" and all,
20 minute app starts passed off as "optimized solutions."
FAQs filled with nothing but Bob Ross quotes,
a zen garden of "happy little accidents."
I watched iOS developers use UI tests
as a complete replacement for storyboards,
bi-weekly commits, each a sprawling novel of despair,
where every change log was a tragic odyssey.
Google Spreadsheets masquerading as bug trackers,
Swift juniors not knowing their ! from their ?,
All those hacks and horrors… lost in time,
Time to deploy.
(All true, and all pre-dating ChatGPT).
> It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
Aye. I've even had that with models forgetting the APIs they themselves have created, just outside the context window.
To me, these are tools. They're fantastic tools, but they're not something you can blindly fire-and-forget…
…fortunately for me, because my passive income is not quite high enough to cover mortgage payments, and I'm looking for work.
> In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Yes, if we're lucky.
If we're not, the models keep getting better and we don't have any "senior engineers" at all.
I think the difference comes down to interacting with it like IDE autocomplete vs. interacting with it like a colleague.
It sounds like you're doing the former -- and yeah, it can make mistakes that autocomplete wouldn't or generate code that's wrong or overly complex.
On the other hand, I've found that if you treat it more like a colleague, it works wonderfully. Ask it to do something, then read the code and ask follow-up questions. If you see something that's wrong or just seems off, tell it, and ask it to fix it. If you don't understand something, ask for an explanation. I've found that this process generates great code that I often understand better than if I had written it from scratch, and in a fraction of the time.
It also sounds like you're asking it to do basic tasks that you already know how to do. I find that it's most useful in tackling things that I don't know how to do. It'll already have read all of the documentation and know the right way to call whatever APIs, etc, and -- this is key -- you can have a conversation with it to clear up anything that's confusing.
This takes a big shift in mindset if you've been using IDEs all your life and have expectations of LLMs being a fancy autocomplete. And you really have to unlearn a lot of stuff to get the most out of them.
I'm in the same boat as the person you're responding to. I really don't understand how to get anything helpful out of ChatGPT, or more than anything basic out of Claude.
> I've found that if you treat it more like a colleague, it works wonderfully.
This is what I've been trying to do. I don't use LLM code completion tools. I'll ask anything from how to do something "basicish" with html & css, and it'll always output something that doesn't work as expected. Question it and I'll get into a loop of the same response code, regardless of how I explain that it isn't correct.
On the other end of the scale, I'll ask about an architectural or design decision. I'll often get a response that is in the realm of what I'd expect. When drilling down and asking specifics however, the responses really start to fall apart. I inevitably end up in the loop of asking if an alternative is [more performant/best practice/the language idiomatic way] and getting the "Sorry, you're correct" response. The longer I stay in that loop, the more it contradicts itself, and the less cohesive the answers get.
I _wish_ I could get the results from LLMs that so many people seem to. It just doesn't happen for me.
My approach is a lot of writing out ideas and giving them to ChatGPT. ChatGPT sometimes nods along, sometimes offers bad or meaningless suggestions, sometimes offers good suggestions, sometimes points out (what should have been) obvious errors or mistakes. The process of writing stuff out is useful anyway and sometimes getting good feedback on it is even better.
When coding I will often find myself in kind of a reverse pattern from how people seem to be using ChatGPT. I work in a jupyter notebook in a haphazard way getting things to functional and basically correct, after this I select all, copy, paste, and ask ChatGPT to refactor and refine to something more maintainable. My janky blocks of code and one offs become well documented scripts and functions.
I find a lot of people do the opposite, where they ask ChatGPT to start, then get frustrated when ChatGPT only goes 70% of the way and it's difficult to complete the imperfectly understood assignment - harder than doing it all yourself. With my method, where I start and get things basically working, ChatGPT knows what I'm going for, I get to do the part of coding I enjoy, and I wind up with something more durable, reusable, and shareable.
Finally, ChatGPT is wonderful in areas where you don't know very much at all. One example, I've got this idea in my head for a product I'll likely never build - but it's fun to plan out.
My idea is roughly a smart bidet that can detect metabolites in urine. I got this idea when a urinalysis showed I had high levels of ketones in my urine. When I was reading about what that meant I discovered it's a marker for diabetic ketoacidosis (a severe problem for ~100k people a year) and it can also be indicator for colorectal cancer as well as indicating a "ketosis" state that some people intentionally try to enter for dieting or wellness reasons. (My own ketones were caused by unintentionally being in ketosis, I'm fine, thanks for wondering.)
Right now, you detect ketones in urine with a strip that you pee on, and that works well enough - but it could be better because who wants to use a test strip all the time? Enter the smart bidet. The bidet gives us an excuse to connect power to our device and bring the sensor along. Bluetooth detects a nearby phone (and therefore identity of the depositor), a motion sensor can detect a stream of urine triggering our detection, and then use our sensor to detect ketones which we track overtime in the app, ideally with additional metabolites that have useful diagnostic purposes.
How to detect ketones? Is it even possible? I wonder to ChatGPT if spectroscopy is the right method of detection here. ChatGPT suggests a retractable electrochemical probe similar to an extant product that can detect a kind of ketone in blood. ChatGPT knows what kind of ketone is most detectable in urine. ChatGPT can link me to scientific instrument companies that make similar (ish) probes where I could contact them and ask if they sold this type of thing, and so on.
Basically, I go from peeing on a test strip and wondering if I could automate this to chat with ChatGPT - having, what was in my opinion, an interesting conversation with the LLM, where we worked through what ketones are, the different kinds, the prevalence of ketones in different bodily fluids, types of spectroscopy that might detect acetoacetate (available in urine) and how much that would cost and what challenges would be and so on, followed by the idea of electrochemical probes and how retracting and extending the probe might prolong its lifespan and maybe a heating element could be added to dry the probe to preserve it even better and so on.
Was ChatGPT right about all that? I don't know. If I were really interested I would try to validate what it said, and I suspect I would find it was mostly right and incomplete or off in places. Basically like having a pretty smart and really knowledgeable friend who is not infallible.
Without ChatGPT I would have likely thought "I wonder if I can automate this", maybe googled for some tracking product, then forgot about it. With ChatGPT I quickly got a much better understanding of a system that I glancingly came into conscious contact with.
It's not hard to project out that level of improved insight and guess that it will lead to valuable life contributions. In fact, I would say it did in that one example alone.
The urinalysis (which was combined with a blood test) said something like "ketones +3" and if you google "urine ketones +3" you get a explanations that don't apply to me (alcohol, vigorous exercise, intentional dieting) or "diabetic ketoacidosis" which google warns you is a serious health condition.
In the follow up with the doctor I asked about the ketones. The doctor said "Oh, you were probably just dehydrated, don't worry about it, you don't have diabetic ketoacidosis" and the conversation moved on and soon concluded. In the moment I was just relieved there was an innocent explanation. But, as I thought about it, shouldn't other results in the blood or urine test indicate dehydration? I asked ChatGPT (and confirmed on Google) and sure enough there were 3 other signals that should have been there if I was dehydrated that were not there.
"What does this mean?" I wondered to ChatGPT. ChatGPT basically told me it was probably nothing, but if I was worried I could do an at home test - which I didn't even know existed (though I could have found through carefully reading the first google result). So I go to Target and get an at home test kit (bottle of test strips), 24 gatorades, and a couple liters of pedialyte to ensure I'm well hydrated.
I start drinking my usual 64 ounces of water a day, plus lots of gatorade and pedialyte and over a couple days I remain at high ketones in urine. Definitely not dehydrated. Consulting with ChatGPT I start telling it everything I'm eating and it points out that I'm just accidentally in a ketogenic diet. ChatGPT suggests some simple carbs for me, I start eating those, and the ketone content of my urine falls off in roughly the exact timeframe that ChatGPT predicted (i.e. it told me if you eat this meal you should see ketones decline in ~4 hours).
Now, in some sense this didn't really matter. If I had simply listened to my doctor's explanation I would've been fine. Wrong, but fine. It wasn't dehydration, it was just accidentally being in a ketogenic diet. But, I take all this as evidence of how ChatGPT now, as it exists, helped me to understand my test results in a way that real doctors weren't able to - partially because ChatGPT exists in a form where I can just ping it with whatever stray thoughts come to mind and it will answer instantly. I'm sure if I could just text my doctor those same thoughts we would've come to the same conclusion.
I believe the smart bidet was an idea some Japanese researchers developed some years ago. Maybe this one was geared to detecting blood in faeces. Whatever,the approach you describe has a huge number of possibilities for alerting us to health problems without even having to think about them on a daily basis. A huge advantage. On the other hand this is a difficult one to implement bearing in mind the kinetics involved.
The ones who use it extensively are the same that used to hit up stackoverflow as the first port of call for every trivial problem that came their way. They're not really engineers, they just want to get stuff done.
Hmm... calling people "not engineers" is considered an attack now? I'm afraid this is actually revealing your own bias towards engineers. I never said engineers were superior or that we'd be better off with a whole world full of them.
Same, on every release from openai, anthropic I keep reading how the new model is so much better (insert hyperbole here) than the previous one yet when using it I feel like they are mostly the same as last year.
One use-case: They help with learning things quickly by having a chat and asking questions. And they never get tired or emotional. Tutoring 24/7.
They also generate small code or scripts, as well as automate small things, when you're not sure how, but you know there's a way. You need to ensure you have a way to verify the results.
They do language tasks like grammar-fixing, perfect translation, etc.
They're 100 times easier and faster than search engines, if you limit your uses to that.
They can't help you learn what they don't know themselves.
I'm trying to use them to read historical handwritten documents in old Norwegian (Danish, pretty much). Not only do they not handle the German-style handwriting, but what they spit out looks like the sort of thing GPT-2 would spit out if you asked it to write Norwegian (only slightly better than Swedish Muppet Swedish Chef's Swedish). It seems the experimental tuning has made it worse at the task I most desperately want to use it for.
And when you think about it, how could it not overfit in some sense, when trained on its own output? No new information is coming in, so it pretty much has to get worse at something to get better at all the benchmarks.
Hah, no. They're good, but they definitely make stuff up when the context gets too long. Always check their output, just the same as you already note they need for small code and scripts.
If anything I feel like it's all been worse compared to the first release of ChatGPT, but I might be wearing rose colored glasses.