It took 25797 input tokens and 1225 input tokens, for a total cost (calculated using https://tools.simonwillison.net/llm-prices ) of $2.11! It took 154 seconds to generate.
It’d be great if someone would do that with the same data and prompt to other models.
I did like the formatting and attributions but didn’t necessarily want attributions like that for every section. I’m also not sure if it’s fully matching what I’m seeing in the thread but maybe the data I’m seeing is just newer.
Thanks for sharing. To me, purely on personal preference, the Gemini models did best on this task, which also fits with my personal experience using Googles models to summarize extensive, highly specialized text. Geminis 2.0 models do especially well on Needle in Haystack type tests in my experience.
Seeing the other models, I actually come away impressed with how well GPT-4.5 is organizing the information and how well it reads. I find it a lot easier to quickly parse. It's more human-like.
I noticed 4o mini didn't follow the directions to quote users. My favourite part of the 4.5 summary was how it quoted Antirez. 4o mini brought out the same quote, but failed to attribute it as instructed.
It's fascinating, but while this does mean it strays from the given example, I actually feel the result is a better summary. The 4.5 version is so long you might just read the whole thread yourself.
Interesting, thanks for doing this. I'd say that (at a glance) for now it's still worth to use more passes with smaller models than one pass with 4.5
Now, if you'd want to generate training data, I could see wanting to have the best answers possible, where even slight nuances would matter. 4.5 seems to adhere to instructions much better than the others. You might get the same result w/ generating n samples and "reflect" on them with a mixture of models, but then again you might not. Going through thousands of generations manually is also costly.
Compared to GPT-4.5 I prefer the GPT-4o version because it is less wordy. It summarizes and gives the gist of the conversation rather than reproducing it along with commentary.
Didn't seem to realize that "Still more coherent than the OpenAI lineup" wouldn't make sense out of context. (The actual comment quoted there is responding to someone who says they'd name their models Foo, Bar, Baz.)
"For example, there are now a bunch of vendors that sell 'respond to RFP' AI products... paying 30x for marginally better performance makes perfect sense." — hn_throwaway_99 (an uncommon opinion supporting possible niche high-cost uses).
? You think hn_throwaway_99's comment is sarcastic? It makes perfect sense to me read "straight."
That is, sales orgs save a bunch of money using AI to respond to RFPs; they would still save a bunch of money using a more expensive AI, and any marginal improvement in sales closed would pay for it.
It maybe excessively summarized his comment which confused you-- but this is the kind of mistake human curators of quotes make, too.
I don't know why but something about this section made me chuckle
"""
These perspectives highlight that there remains nuance—even appreciation—of explorative model advancement not solely focused on immediate commercial viability
"""
I disagree with most of the knee-jerk negativity in LLM threads, but in this case it mostly seems warranted. There are no "boundaries being pushed" here, this is just a desperate release from a company that finds itself losing more and more mindshare to other models and companies.
Hey, check this one out with all the different flavors that existed out there. I think I made something better. https://cofyt.app
As far as I am aware, feel free to test it head-to-head. This is better than gecall, and you can chat with a transcript for detailed answers to your prompts
But as I mentioned, my main concern is what will happen in 6 months when you fail to get traction and abandon it. Because that's what happened to previous 5 products I tried which were all "good enough" .
Getrecall seems to have a big enough user base that will actually stick around.
I understand your perfectly reasonable argument to make from your position (user).
First let me tell you that I saw a lot of things out there including getrecall before starting building this and felt there was nothing out there that had a good UX/UI that actually makes it a enjoyable product (nice and clean).
I’m confident in the direction and committed to seeing it through by building something better for me and maybe for you to by doing it with more care.
Appreciate your feedback and while no one can control the future I´ve added this thread to my calendar do come back here in 6months.
Hundreds that specifically focus on noticing a page you’re currently viewing has been not only posted to but undergone significant discussion on HN, and then providing a summary of those conversations?
What I want is something that can read the thread out loud to me, using a different voice per user, so I can listen to a busy discussion thread like I would listen to a podcast.
The headline and section: "Dystopian and Social Concerns about AI Features" are interesting. It's roughly true... but somehow that broad statement seems minimize the point discussed.
I'd headline that thread as "Concerns about output tone". There were comments about dystopian implications of tone, marketing implications of tone and implementation issues of tone.
Of course, that I can comment about the fine-points of an AI summary shows it's made progress. But there's a lot riding on how much progress these things can make and what sort. So it's still worth looking at.
Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c280293d399...
It took 25797 input tokens and 1225 input tokens, for a total cost (calculated using https://tools.simonwillison.net/llm-prices ) of $2.11! It took 154 seconds to generate.