More

Galanwe · 2026-05-11T11:29:01 1778498941

My advice: don't just look at tokens per second, but also at time to first token (TTFT).

The local inference space is leaning to MoE models, and a lot of them have decent tokens / second, but horrible TTFT.

Galanwe · 2026-05-11T09:36:11 1778492171

Yes it can, but the experience is not great.

A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.

2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.

Galanwe · 2026-05-11T09:31:21 1778491881

Kimi 2.6 is very close to the Opus family from my experience. Also it does absolutely not require $700k to be able to run locally in an interactive fashion. We are talking more in the range of $10k for a slow Q2 with degraded perplexity, to ~$35k for an acceptably fast 200k context Q4 (quasi lossless perplexity).

Galanwe · 2026-05-10T19:52:11 1778442731

I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.

mft_ · 2026-05-10T19:56:14 1778442974

You're maybe missing the article's point, which is to use local models appropriately:

> “But Local Models Aren’t As Smart”

> Correct.

> But also so what?

> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.

> And for those tasks, local models can be truly excellent.

Galanwe · 2026-05-10T20:10:04 1778443804

This is a bit naive IMHO...

I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.

All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.

I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.

mft_ · 2026-05-10T21:01:46 1778446906

1) Again, I suspect you're missing the point of the article. The iPhone's on-device LLM is (apparently) ~3 Bn parameters - and runs well/fast enough to be used in the manner described. Of course, the iPhone has its GPU to leverage.

2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.

3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)

mikrl · 2026-05-10T20:11:46 1778443906

One of my hobbyist workflows involved transcribing ETF prospecti into yaml for an optimizer to optimize over.

Used to take me maybe 10-20 minutes per sheet.

Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.

My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…

Galanwe · 2026-05-10T17:12:14 1778433134

These arguments against AWS are boring. 99% of the negative comments are along the line of "so i have a dead simple product, I dont know anything about AWS, I logged in and it was super complicated and it seemed pricey".

Well guess what, if you have a CRUD website and 100 users you're just not the target. Move on.

Some days ago I wanted to sketch a 3D model of my TV remote. I opened blender and what a mess of complicated windows and panes. I closed it immediatly. Do I think Blender is an over complicated mess? No, I just think I'm not the target. And I'm not offended to be too noob to use it.

anymouse123456 · 2026-05-10T17:26:59 1778434019

I agree, this is a common story and your point stands for some significant percentage of the complaints.

It should be made clear though, that some of us helped spend many millions in obviously wasteful on-prem infra in the nineties, bought into AWS wholeheartedly when it came out, fought through the ignorance, developed the ability to deliver highly scaled applications on the platform over many years and at least some of us still carry those same beliefs:

- It's more complicated than it needs to be

- It's more expensive than it should be

- Pricing is more opaque than it should be

Meanwhile, the cost of other options (including self-managed, on-prem infra) has fallen massively since those early days of AWS.

tempest_ · 2026-05-10T18:13:43 1778436823

Prior to the RAM crunch you could buy 4 or 5 servers ~50k that would be more than capable to handle many enterprises needs. The thing is the industry has sorta lost the skill set to host and maintain them. The people who can do this still exist of course but they are outnumbered by the YAML jockeys 10 to 1.

There are also other things that the cloud hides in its price as well. Redundant networking, provisioning, rack space, internet connections, firewalls, UPS backup, power usage.

Still I think a lot of startups would benefit from hosting their own stuff if they intend to be a long term business instead of just shooting their shot and hoping to be acquired.

reactordev · 2026-05-10T22:15:25 1778451325

No, you misunderstand, it's not that we lack the knowledge or skills (we don't!) it's that the backbones and pipelines all converge on these hyperscalers and that's where you get the best throughput and least latency.

I clearly remember having a discussion with a very VERY large company I worked for at the time about getting some NVidia hardware for our own enterprise data centers and they flat out refused. Now, they have lost any advantage they could have had.

The issue with AWS is that they started off cheap, easy, simple and grew into an enterprise mess complete with opaque pricing. That's an issue. The complexity itself has created a whole new lane of work for the SRE where they can specialize in AWS and not do anything else. It's grown beyond just a cloud provider. People who are still expecting a cloud provider are going to be sour about it.

regularfry · 2026-05-10T22:49:33 1778453373

This is borne out by the fact that there are alternatives that are:

- dramatically simpler

- cheaper

- easier to budget

while retaining the scale-on-demand and hide-the-actual-hardware properties that the industry jumped for joy at. What they don't have is the nobody-got-fired-for-rearchitecting-to-aws bit.

jph00 · 2026-05-11T00:01:21 1778457681

There's always someone making this claim when negative comments about AWS come up.

They almost always come from people that don't have experience running substantive infra at scale without AWS, so they can't make an informed comparison. The complexity of doing so, for a lot of infra, turns out to be lower than using AWS. Also, you end up with transferable skills and a deeper understanding of the foundational protocols and systems. And you save a lot of money, both because you don't have to pay to manage that complexity, and the systems themselves are cheaper.

senko · 2026-05-10T17:34:13 1778434453

If you want to design TV remotes, you better learn Blender.

If you want to host something complex enough to warrant AWS, you should also understand how to run it yourself.

These arguments for AWS are boring and sound like uninspired regurgitation of their sales pitch. I recall hearing the same about IIS and Windows a few decades back.

Turns out, they both have pretty good marketing departments!

voidUpdate · 2026-05-11T09:09:07 1778490547

If you want to do actual design, I'd recommend a parametric modeller. Blender really just doesn't cut it for that kind of thing, even with addons

gizzlon · 2026-05-10T21:14:09 1778447649

I see a lot of learned helplessness around this stuff. People managed fleets of servers before the cloud you know, it's not impossible.

Cloud has pros and cons, both for small and large setups. I've spent ca 10 years working with GCP, and as the article says, there's a lot of complexity in these systems as well. And the network cost.. yikes

cryo32 · 2026-05-10T19:07:29 1778440049

Nope. We have an incredibly complicated product, a bunch of actual experts and paid up high level enterprise support.

It is about 8x more expensive to run it on AWS than it was on actual hardware. And that's using their reference architecture and designs. And the sprawling nature of AWS services and uptake makes it pretty damn hard to get out. We are slowly and quietly migrating everyting to IaaS / kubernetes so we can get it out again. Just moving to kubernetes and packing stuff tight on EKS and thus kubernetes has shaved 30% of our costs off already.

We were sold a lie and fell for it hook, line and sinker.

Edit: also fuck things like Lambda. It's literally the most horrible experience that the universe can muster. Moved most of our lambdas to simple boring http services on top of Go and just leave 20 instances running. Just not having to deal with CloudWatch saved us more money than Lambda could have.

tacticus · 2026-05-10T23:44:17 1778456657

> Edit: also fuck things like Lambda. It's literally the most horrible experience that the universe can muster. Moved most of our lambdas to simple boring http services on top of Go and just leave 20 instances running. Just not having to deal with CloudWatch saved us more money than Lambda could have.

imagine if instead of being a tied in to aws special interfaces lambda had shown up as closer to cloud run!

Though hopefully not the knative style that azure first went with and the LOOOOONG start times.

cryo32 · 2026-05-11T05:58:55 1778479135

It'd still suck compared to a completely boring process you can just run on your desktop by ./'ing the executable and looking at the console output. Then chuck it in kubernetes as a ReplicaSet.

padjo · 2026-05-11T05:27:17 1778477237

But that's not what this article is? The author is clearly a long time AWS user and former evangelist who has soured on it as it has become increasingly bloated.

smallnix · 2026-05-10T17:17:17 1778433437

The main issue with account suspension is not boring to me.

akvadrako · 2026-05-10T20:46:18 1778445978

It's true the comments get it wrong. But their main point stands; they shouldn't use AWS.

It's also true that most companies which AWS does target shouldn't use it either, unless you have a good reason why ( like you need data centers in every continent or to quickly scale to 10+ thousands of cpus ).

Hendrikto · 2026-05-11T08:55:03 1778489703

> like you need data centers in every continent or to quickly scale to 10+ thousands of cpus

Which for some reason many people think they need, while in reality 1% actually need it.

pier25 · 2026-05-10T17:26:47 1778434007

Maybe but that doesn't mean that the AWS console isn't a royal mess.

bellowsgulch · 2026-05-10T20:42:15 1778445735

that's not a great argument: any professional who doesn't know their operating costs is barely a professional

would you be more enamored by roofers who came to your house and couldn't break down your quote because they were too professional to know the cost of asphalt shingles?

is it more sophisticated to you that you go to a fish market and the price of the goods isn't listed and you have to ask the cashier for every catch?

perhaps we should all be artists who walk in to supply stores purchasing oil paints not caring what the tubes costs because you're not the target if you want to know the cost of your materials

ray_v · 2026-05-10T19:42:16 1778442136

Did blender charge you thousands of dollars when you touched it wrong when you tried to learn to use it? /s

pseudohadamard · 2026-05-11T06:00:58 1778479258

> Did blender charge you thousands of dollars when you touched it wrong

No, but it did press charges. We settled out of court, but my wife left me over the whole affair.

Galanwe · 2026-05-08T09:39:35 1778233175

uuid.uuidv4() recently switched to "adaptive entropy" instead of "xmax entropy" in an effort to save costs on non-premium users.

Galanwe · 2026-05-07T23:28:58 1778196538

At this point, I see no identity verification or proof of some kind of humanity working.

I think what we need is the equivalent of what was done for CORS: client/server cooperation.

That is, APIs should mark that they are human only, and harnesses should cooperate with such flags and prevent calling said APIs.

It's not perfect, as it's client side enforcement, and one could still theorically build their own harnesses without, but that's the only way forward.

Galanwe · 2026-05-07T22:07:43 1778191663

Im not sure proof of identity solves anything. People will still have LLMs with their real identity verified.

SV_BubbleTime · 2026-05-07T22:11:06 1778191866

I’m imagining like, a physical place you would go and get your text spoken out of your personal speaker directly into someone else’s microphones.

Galanwe · 2026-05-07T18:57:38 1778180258

I dont think the problem is parties by themselves. I think it's more the fact that the US system cannot accomodate more than 2 parties.

Plenty of other democracies have parties, including cross government branches.

What makes the US unique, and fragile, is that no party other than democrats and republicans can realistically exist.

It over emphasizes partisanship above anything (including honesty, morality) because career politicians in one party just have nowhere to go if they are dissident.

You can see that in plain sight currently, with republicans being in the total incapacity of contradicting their party line on anything, even the most obvious of lies.

In most other democracies, dissidents would have just created a new party and moved on, that wouldnt be "carrier ending" for them.

Galanwe · 2026-05-06T18:05:46 1778090746

Snapshotting a filesystem is trivial with e.g. btrfs. You can hook snapshot creation in your agent.

That is a single one liner of btrfs subvolume snapshot, in a single hook configuration file, ready to be valued at $10B as quantum agentic versioned sandbox startup.

ozkatz · 2026-05-06T18:27:31 1778092051

Part of the appeal (subjective, I know) of versioning is stuff like human-in-the-loop approvals. Think of a pull request: a change is requested by an agent, a human approves, changes get merged atomically. Even if other changes were applied since creation.