Hacker Newsnew | past | comments | ask | show | jobs | submit | tootyskooty's commentslogin

I'm wondering about this too. Would be nice to see an ablation here, or at least see some analysis on the reasoning traces.

It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.


The model probably recognizes the need for a grassroots effort to solve the problem, to "show it's work".


See pretraining section of the prerelease_notes.md:

https://github.com/DGoettlich/history-llms/blob/main/ranke-4...


I was curious, they train a 1900 base model, then fine tune to the exact year:

"To keep training expenses down, we train one checkpoint on data up to 1900, then continuously pretrain further checkpoints on 20B tokens of data 1900-${cutoff}$. "


Since it now includes 4 thinking levels (minimal-high) I'd really appreciate if we got some benchmarks across the whole sweep (and not just what's presumably high).

Flash is meant to be a model for lower cost, latency-sensitive tasks. Long thinking times will both make TTFT >> 10s (often unacceptable) and also won't really be that cheap?


Google appears to be changing what flash is “meant for” with this release - the capability it has along with the thinking budgets make it superior to previous Pro models in both outcome and speed. The likely-soon-coming flash-lite will fit right in to where flash used to be - cheap and fast.


both have had questionable content for a while, it's a wonder people are still paying for them. especially given that LLMs exist (and youtube for that matter).


If I were a professor at a decent school, I'd probably look at the landscape of MOOCs and go "Why am I spending any time on this?" It seemed like something new and potentially exciting at one point. I certainly wouldn't today.


Since no one has mentioned it yet: note that the benchmarks for large are for the base model, not for the instruct model available in the API.

Most likely reason is that the instruct model underperforms compared to the open competition (even among non-reasoners like Kimi K2).


Shameless plug: if OP is looking to stay on d3, he could also try slotting in my C++/WASM versions[1] of the main d3 many-body forces. Not the best, but I've found >3x speedup using these for periplus.app :)

[^1]: https://www.npmjs.com/package/d3-manybody-wasm


Working on a new interface for learning with LLMs that creates courses on any topic.

https://periplus.app

The goal was to make the learning material very malleable, so all content can be viewed through different "lenses" (e.g. made simpler, more thorough, from first principles, etc.). A bit like Wikipedia it also allows for infinite depth/rabbit holing. Each document links to other documents, which link to other documents (...).

I'm also currently in the middle of adding interactive visualizations which actually work better than expected! Some demos:

https://x.com/mato_gudelj/status/1975547148012777742


I've been getting a lot of vulnerability "spam mail" recently that's clearly AI-generated.

It's a surprise every public bounty program isn't completely buried in automatic reports by now, but it likely won't take long.


would be nice to finally see multi-turn coding benchmarks. everything we have so far is single-turn and that's clearly not a realistic scenario.


I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.

[0]: https://github.com/KellerJordan/modded-nanogpt


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: