Hacker Newsnew | past | comments | ask | show | jobs | submit | mickeyp's commentslogin

Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.

It's prefill; slow prefill kills agentic workloads dead.

If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:

    You have: 100000 / (150/s)

    You want: hms

     11 min + 6.6666667 sec
Which is quite a wait indeed.

Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for.

This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.


The prefix cache is working properly 100k doesn’t prefill more than once

When you're using OpenCode it's easy to reach 100,000 tokens after a while.

I wonder if this could be usefully mitigated with a combination of prompt (prefix) caching and an agent that let you control what the prompt prefix consisted of. The goal would be to incur that slow prefill once to build the prompt cache, then have subsequent prompts consist of mostly this fixed prefix plus specific instructions.

For a language like C++ where modules are split into definition (.h) and implementation (.cpp) parts, one choice of prefix would be all the header files for the project (which aren't likely to change much).

More generally the idea would be to have an agent that had cached-prefix reuse as it's primary context management goal.

Another possibility, to support caching of files that have since changed, would be for the agent to build the context as a fixed prefix reflecting some or all of the codebase in its start-of-session state, then append any changes to that, with appropriate prompting to only use the latest definition of a function.

e.g.

Say file A initially contains functions X, Y and Z, then the prompt prefix is built to include X Y Z. If the user then modifies Y -> Y', then just add that to the context, so that the cached prefix is unchanged, giving X Y Z Y'.


Can't you structure things like loading a codebase or priming with reference material to happen overnight or during meal breaks etc? I guess it's frustrating if you want to switch to a project and have the LLM begin co-working with immediately, but even the best human collaborator would require a long period to get up to speed before being able to make meaningful contributions.

A quick search say that this is a standard feature you cache the prefill and load it at PCIe bandwidth so it should be about 0.2s

No the insight here is that you went _back_ and got your PhD with years of experience building professional software.

Expecting a 20-year old undergrad or a 23-year old postgrad to do as well as someone who left and came back to uni to finish their degree(s) is... uncharitable.


I became lead of that MMORTS within 4yrs of starting my career. I've worked with lots of PhD students who came back after more than 5 years working in big tech; not one had the experience and abilities I did. I've also worked with fresh grads in games who were miles ahead of 99% of the software engineers and PhDs I've encountered in my time.

Again, I've also run into equally talented fresh grads in big tech, but they were much more rare.

Take my anecdotes as you will.


This test would be a lot more useful if the author used images the models obviously hadn't seen before. Pulling images from Wikipedia? They'll have seen 'em before, and the metadata, and all the pages they were casually linked to.

The premise that the long prompt only made the model think 'a second longer' may have more to do with the fact that it knows about the images. So why think harder if you know the answer?

At no point does the author contemplate that.


It might be more useful, but as is, it is still dispositive: 5.5 is significantly worse than o3 at geo-guessing. And the “magic” prompt doesn’t matter that much, at least in o3’s case.


They say they threw in some indoor images, presumably from around where they were.


Indeed. Kitbashing is a thing, and it was always a thing. Designers I worked with would spend hours doomscrolling pinterest, google images, etc. looking for their, uh... 'spark' when they were given a briefing.

This is just a really cool way of building.

I'm impressed. I tried Google Stitch but it was slow and useless. Sad, because Gemini has a pretty good creative flair, ironically enough.


Stitch has been very good for me to prototype some designs, and the exporting design feature is great.

But jeez, is it buggy, slow and unintuitive at times.

Complete shift in google's old engineering culture of high quality - they seem to be shipping quickly in favor of stability


That culture died forever ago, google has been launching half-baked shit that they kill in 18 months after no updates for a decade now.


"in favor" is hard to parse; "instead"?


I'll go one further and say that if you're reaching for DISTINCT and you have joins, you may have joined the data the wrong way. It's not a RULE, but it's ALWAYS a 'smell' when I see a query that uses DISTINCT to shove away duplicate matches. I always add a comment for the exceptions.


Right. But faceting data is also part of what a good database designer does. That includes views over the data; materialisation, if it is justified; stored procedures and cursors.

I've never had to do 18 joins to extract information in my career. I'm sure these cases do legitimately exist but they are of course rare, even in large enterprises. Most companies are more than capable of distinguishing OLTP from OLAP and real-time from batch and design (or redesign) accordingly.

Databases and their designs shift with the use case.


> I've never had to do 18 joins to extract information in my career.

Really? You're not representing particularly complex entities with your data.

I work on a student information system. 18 joins isn't even weird. If I want a list of the active students, the building they're in, and their current grade level, that's a join of 8 tables right there. If I also want their class list, that's an additional 5 or 6. If you also want the primary teacher, add another 4. If you want secondary staff, that's another 5.

The whole system is only around 500 GB, but it's close to 2,000 tables. Part of the reason is tech debt archaic design from the vendor, but that's just as likely to reduce the number of tables as it is to increase them. The system uses a monolithic lookup table design, and some of the tables have over 300 columns. If they were to actually properly normalize the entire system to 3NF, I have no doubt that it would be in the hundreds of thousands of tables.


You say that with the wisdom of experience.

But there's still value in people exploring new spaces they find interesting, even if they do not meet your personal definition of pareto-optimal.


Exploring with AI doesn’t lead to the same level of learning. They are doing the equivalent of paying to skip the level up of their character and going to the final boss with level 1 armor


I look at it more like speedrunning a level. You're skipping the parts of the level that take up the most time, some times using hacks. Is it universally as much fun as playing the game? No, just like using AI to prototype might get you to the same place, but without the experience of discovery and blockers along the way.


If you're ok with a model provider that goes down all the time and has such a poor inference engine setup that once you get past 50k tokens you're going to get stuck in endless reasoning loops.


The old joke Zawinski made about picking regex "and now you have two problems" applies here.

If you pick Elasticsearch, useful as it is, you now have more than two problems. You have Elastic the company; Elasticsearch the tool; and also the clay-footed colossus, Java, to contend with.


It doesn't help that academia loooves ColBERT and will happily tell you how amazing -- and, look, for how tiny the models are, 20M params and super fast on a CPU, it is -- they are at seemingly everything if only you...

- Chunk properly;

- Elide "obviously useless files" that give mixed signals;

- Re-rank and rechunk the whole files for top scoring matches;

- Throw in a little BM25 but with better stemming;

- Carry around a list of preferred files and ideally also terms to help re-rank;

And so on. Works great when you're an academic benchmaxing your toy Master's project. Try building a scalable vector search that runs on any codebase without knowing anything at all about it and get a decent signal out of it.

Ha.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: