A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.
2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.
Kimi 2.6 is very close to the Opus family from my experience. Also it does absolutely not require $700k to be able to run locally in an interactive fashion. We are talking more in the range of $10k for a slow Q2 with degraded perplexity, to ~$35k for an acceptably fast 200k context Q4 (quasi lossless perplexity).
I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.
You're maybe missing the article's point, which is to use local models appropriately:
> “But Local Models Aren’t As Smart”
> Correct.
> But also so what?
> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.
> And for those tasks, local models can be truly excellent.
I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.
All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.
I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.
1) Again, I suspect you're missing the point of the article. The iPhone's on-device LLM is (apparently) ~3 Bn parameters - and runs well/fast enough to be used in the manner described. Of course, the iPhone has its GPU to leverage.
2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.
3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)
One of my hobbyist workflows involved transcribing ETF prospecti into yaml for an optimizer to optimize over.
Used to take me maybe 10-20 minutes per sheet.
Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.
My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…
These arguments against AWS are boring. 99% of the negative comments are along the line of "so i have a dead simple product, I dont know anything about AWS, I logged in and it was super complicated and it seemed pricey".
Well guess what, if you have a CRUD website and 100 users you're just not the target. Move on.
Some days ago I wanted to sketch a 3D model of my TV remote. I opened blender and what a mess of complicated windows and panes. I closed it immediatly. Do I think Blender is an over complicated mess? No, I just think I'm not the target. And I'm not offended to be too noob to use it.
I agree, this is a common story and your point stands for some significant percentage of the complaints.
It should be made clear though, that some of us helped spend many millions in obviously wasteful on-prem infra in the nineties, bought into AWS wholeheartedly when it came out, fought through the ignorance, developed the ability to deliver highly scaled applications on the platform over many years and at least some of us still carry those same beliefs:
- It's more complicated than it needs to be
- It's more expensive than it should be
- Pricing is more opaque than it should be
Meanwhile, the cost of other options (including self-managed, on-prem infra) has fallen massively since those early days of AWS.
Prior to the RAM crunch you could buy 4 or 5 servers ~50k that would be more than capable to handle many enterprises needs. The thing is the industry has sorta lost the skill set to host and maintain them. The people who can do this still exist of course but they are outnumbered by the YAML jockeys 10 to 1.
There are also other things that the cloud hides in its price as well. Redundant networking, provisioning, rack space, internet connections, firewalls, UPS backup, power usage.
Still I think a lot of startups would benefit from hosting their own stuff if they intend to be a long term business instead of just shooting their shot and hoping to be acquired.
No, you misunderstand, it's not that we lack the knowledge or skills (we don't!) it's that the backbones and pipelines all converge on these hyperscalers and that's where you get the best throughput and least latency.
I clearly remember having a discussion with a very VERY large company I worked for at the time about getting some NVidia hardware for our own enterprise data centers and they flat out refused. Now, they have lost any advantage they could have had.
The issue with AWS is that they started off cheap, easy, simple and grew into an enterprise mess complete with opaque pricing. That's an issue. The complexity itself has created a whole new lane of work for the SRE where they can specialize in AWS and not do anything else. It's grown beyond just a cloud provider. People who are still expecting a cloud provider are going to be sour about it.
This is borne out by the fact that there are alternatives that are:
- dramatically simpler
- cheaper
- easier to budget
while retaining the scale-on-demand and hide-the-actual-hardware properties that the industry jumped for joy at. What they don't have is the nobody-got-fired-for-rearchitecting-to-aws bit.
There's always someone making this claim when negative comments about AWS come up.
They almost always come from people that don't have experience running substantive infra at scale without AWS, so they can't make an informed comparison. The complexity of doing so, for a lot of infra, turns out to be lower than using AWS. Also, you end up with transferable skills and a deeper understanding of the foundational protocols and systems. And you save a lot of money, both because you don't have to pay to manage that complexity, and the systems themselves are cheaper.
If you want to design TV remotes, you better learn Blender.
If you want to host something complex enough to warrant AWS, you should also understand how to run it yourself.
These arguments for AWS are boring and sound like uninspired regurgitation of their sales pitch. I recall hearing the same about IIS and Windows a few decades back.
Turns out, they both have pretty good marketing departments!
I see a lot of learned helplessness around this stuff. People managed fleets of servers before the cloud you know, it's not impossible.
Cloud has pros and cons, both for small and large setups. I've spent ca 10 years working with GCP, and as the article says, there's a lot of complexity in these systems as well. And the network cost.. yikes
Nope. We have an incredibly complicated product, a bunch of actual experts and paid up high level enterprise support.
It is about 8x more expensive to run it on AWS than it was on actual hardware. And that's using their reference architecture and designs. And the sprawling nature of AWS services and uptake makes it pretty damn hard to get out. We are slowly and quietly migrating everyting to IaaS / kubernetes so we can get it out again. Just moving to kubernetes and packing stuff tight on EKS and thus kubernetes has shaved 30% of our costs off already.
We were sold a lie and fell for it hook, line and sinker.
Edit: also fuck things like Lambda. It's literally the most horrible experience that the universe can muster. Moved most of our lambdas to simple boring http services on top of Go and just leave 20 instances running. Just not having to deal with CloudWatch saved us more money than Lambda could have.
> Edit: also fuck things like Lambda. It's literally the most horrible experience that the universe can muster. Moved most of our lambdas to simple boring http services on top of Go and just leave 20 instances running. Just not having to deal with CloudWatch saved us more money than Lambda could have.
imagine if instead of being a tied in to aws special interfaces lambda had shown up as closer to cloud run!
Though hopefully not the knative style that azure first went with and the LOOOOONG start times.
It'd still suck compared to a completely boring process you can just run on your desktop by ./'ing the executable and looking at the console output. Then chuck it in kubernetes as a ReplicaSet.
But that's not what this article is? The author is clearly a long time AWS user and former evangelist who has soured on it as it has become increasingly bloated.
It's true the comments get it wrong. But their main point stands; they shouldn't use AWS.
It's also true that most companies which AWS does target shouldn't use it either, unless you have a good reason why ( like you need data centers in every continent or to quickly scale to 10+ thousands of cpus ).
that's not a great argument: any professional who doesn't know their operating costs is barely a professional
would you be more enamored by roofers who came to your house and couldn't break down your quote because they were too professional to know the cost of asphalt shingles?
is it more sophisticated to you that you go to a fish market and the price of the goods isn't listed and you have to ask the cashier for every catch?
perhaps we should all be artists who walk in to supply stores purchasing oil paints not caring what the tubes costs because you're not the target if you want to know the cost of your materials
I dont think the problem is parties by themselves. I think it's more the fact that the US system cannot accomodate more than 2 parties.
Plenty of other democracies have parties, including cross government branches.
What makes the US unique, and fragile, is that no party other than democrats and republicans can realistically exist.
It over emphasizes partisanship above anything (including honesty, morality) because career politicians in one party just have nowhere to go if they are dissident.
You can see that in plain sight currently, with republicans being in the total incapacity of contradicting their party line on anything, even the most obvious of lies.
In most other democracies, dissidents would have just created a new party and moved on, that wouldnt be "carrier ending" for them.
Snapshotting a filesystem is trivial with e.g. btrfs. You can hook snapshot creation in your agent.
That is a single one liner of btrfs subvolume snapshot, in a single hook configuration file, ready to be valued at $10B as quantum agentic versioned sandbox startup.
Part of the appeal (subjective, I know) of versioning is stuff like human-in-the-loop approvals. Think of a pull request: a change is requested by an agent, a human approves, changes get merged atomically. Even if other changes were applied since creation.
The local inference space is leaning to MoE models, and a lot of them have decent tokens / second, but horrible TTFT.
reply