You originally made a broad claim about the technology as a class though. It's not particularly surprising that chatgpt (from 2 years ago no less) isn't state of the art at translations given that it isn't optimized for that (at least that I'm aware). The same approaches that are used to construct LLMs can be applied to construct a model intended for machine translation. That is where the value proposition lies.
I'm not sure whether you're missing the point by a mile, or whether I am.
The transformer architecture, later modified to create GPT models, was originally designed for translation. The modifications to make it do predictive text in a chatbot style make it much, much worse at translating: one of the biggest issues is that generative systems fail silently. Using the tech appropriately gives you things like Project Bergamot: https://browser.mt/.
I get your point about the surprising news that gpt4 does so poorly at translation. I didn’t know that!
However, I think the idea is that LLM technologies have improved considerably since then. Do you still feel that Claude or ChatGPT perform worse than DeepL? It would be really nice to have an objective benchmark for comparison
Well there's this[0] but unfortunately I don't see any results and I'm not about to put in the time to run it myself.
Responding to GP, I won't object that LLMs aren't optimized for translation but I would generally expect them to perform quite well at it given that it seems to come with the territory of being able to respond to natural language prompts in multiple languages.
> massive, overfit generative models are awful at translation
> part-of-speech classification + dictionary lookup + grammar mapping – an incredibly simplistic system with performance measurable in chapters per second – does a better job
Those are two distinct claims you made and I'm not inclined to accept either of them without evidence given how unexpected both of them would be from my perspective.
Transformer models are good at translation, given appropriate training data (i.e., "this text but in multiple languages") – though you still have to watch out for them translating "English" as "Deutsch", "Français", etc. Asking them to do repeated next token prediction isn't asking them to translate, though, especially not after the RLHF passes that OpenAI does to their ChatGPT models. When you test it, you get exactly the failure modes you'd expect: translations that start off okay, but go off on tangents; translations that "correct" the original text, so the translations aren't faithful; attempts (usually successful) to cover up a gap in "knowledge" that prevents the model from translating correctly.
When considering these failure modes, which have come up every single time I've seen ChatGPT used for translation, it's clear that the simplistic system I described would work better. It'll output gibberish often (just as LibreTranslate does when given decontextualised Chinese and asked to translate to English), but that's better than a GPT model, which will just confabulate something in the same circumstance. The goal isn't "maximise the amount of successful translatedness": it's "reduce the language barrier as much as possible", something the benchmarks don't test.
It's not surprising news. It's obvious news. When I was making those claims, I hadn't tested it: I was making purely theoretical arguments, based on having glanced at the relevant papers in the GPT-2 days.
That GPT models are bad at translation, and will always be bad at translation (while, perhaps, "improving" where they're overfit on specific benchmarks), is obvious to anyone with even a cursory understanding of how they work.