AWS keeps making grand statements about Trainium but not a single customer comes on stage to say how amazing it is. Everyone I talked to that tries it says there were too many headaches and they moved on. AWS pushes it hard but “more price performant” isn’t a benefit if it’s a major PITA to deploy and run relative to other options. Chips without a quality developer experience isn’t gonna work.
Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.
The inf1/inf2 spot instances are so unpopular that they cost less than the equivalent cpu instances. Exact same (or better) hardware but 10-20% cheaper.
We're not quite seeing that on the trn1 instances yet, so someone is using them.
Heh, I was looking at an eks cluster recently that was using Cast AI autoscalar. Scratching my head as there was a bunch of inf instances. Then I realized it must be cheap spot pricing.
Not just AWS, looks like Anthropic uses it heavily as well. I assume they get plenty of handholding from Amazon though. I'm surprised any cloud provider does not invest drastically more into their SDK and tooling, nobody will use your cloud if they literally cannot.
Well AWS says Anthropic uses it but Anthropic isn’t exactly jumping up and down telling everyone how awesome it is, which tells you everything you need to know.
If Anthropic walked out on stage today and said how amazing it was and how they’re using it the announcement would have a lot more weight. Instead… crickets from Anthropic in the keynote
AWS has built 20 data centers in Indiana full of half a million Trainium chips explicitly for Anthropic. Anthropic is using them heavily. The same press announcement that Anthropic has made about Google TPUs is the exact same one they made a year ago about Trainium. Hell, even in the Google TPU press release they explicitly mention how they are still using Trainium as well.
I met a AWS engineer a couple of weeks ago and he said Trainium is actually being used for Anthropic model inference, not for training. Inferentia is basically defected Trainiums chips that nobody wants to use.
With GCP announcing they built Gemini 3 on TPUs the opposite is true. Anthropic is under pressure to show they don’t need expensive GPUs. They’d be catching up at this point, not leaking some secret sauce. No reason for them to not boast on stage today unless there’s nothing to boast about.
Anthropic is not going to interrupt their competitors if their competitors don't want to use trainium. Neither would you, I, nor anyone else. The only potential is downside. There's no upside potential for them at all in doing so.
From Anthropic's perspective, if the rest of us can't figure out how to make trainium work? Good.
Amazon will fix the difficulty problem with time, but that's time Anthropic can use to press their advantages and entrench themselves in the market.
> I'm surprised any cloud provider does not invest drastically more into their SDK and tooling
I used to work for an AI startup. This is where Nvidia's moat is - the tens of thousands of man-hours that has gone into making the entire AI ecosystem work well with Nvidia hardware and not much else.
It's not that they haven't thought of this, it's just that they don't want to hire another 1k engineers to do it.
>I'm surprised any cloud provider does not invest drastically more into their SDK and tooling, nobody will use your cloud if they literally cannot.
Building an efficient compiler from high-level ML code to a TPU is actually quite a difficult software engineering feat, and it's not clear that Amazon has the kind of engineering talent needed to build something like that. Not like Google which have developed multiple compilers and language runtimes.
Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.