I am surprised the author thought the project passed quality control. The LLM reviews seem mostly false.
Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.
The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."
This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).
Examples: tptacek gets an 'A' for his comment on DF which the LLM claiming that the user
"captured DF's unforgiving nature, where 'can't do x or it crashes is just another feature to learn' which remained true until it was fixed on ..."
So the LLM is praising a comment as describing DF as unforgiving (a characterization of the present then, not a statement about the future). And worse, it seems like tptacek may in fact be implying the opposite of the future (e.g., x will continue to crash when it was eventually fixed.)
Here is the original comment: "
tptacek on Dec 2, 2015 | root | parent | next [–]
If you're not the kind of person who can take flaws like crashes or game-stopping frame-rate issues and work them into your gameplay, DF is not the game for you. It isn't a friendly game. It can take hours just to figure out how to do core game tasks. "Don't do this thing that crashes the game" is just another task to learn."
Note: I am paraphrasing the LLM review, as the website is also poorly designed, with one unable to select the text of the LLM review!
N.b., this choice of comment review is not overly cherry picked. I just scanned the "best commentators" and tptacek was number two, with this particular egregiously unrelated-to-prediction LLM summary given as justifying his #2 rating.
Are you sure? The third section of each review lists the “Most prescient” and “Most wrong” comments. That sounds exactly like what you're looking for. For example, on the "Kickstarter is Debt" article, here is the LLM's analysis of the most prescient comment. The analysis seems accurate and helpful to me.
phire
> “Oculus might end up being the most successful product/company to be kickstarted… > Product wise, Pebble is the most successful so far… Right now they are up to major version 4 of their product. Long term, I don't think they will be more successful than Oculus.”
With hindsight:
Oculus became the backbone of Meta’s VR push, spawning the Rift/Quest series and a multi‑billion‑dollar strategic bet.
Pebble, despite early success, was shut down and absorbed by Fitbit barely a year after this thread.
That’s an excellent call on the relative trajectories of the two flagship Kickstarter hardware companies.
Until someone publishes a systematic quality assessment, we're grasping at anecdotes.
It is unfortunate that the questions of "how well did the LLM do?" and "how does 'grading' work in this app?" seem to have gone out the window when HN readers see something shiny.
Yes. And the article is a perfect example of the dangerous sort of automation bias that people will increasingly slide into when it comes to LLMs. I realize Karpathy is sort of incentivized toward this bias given his career, but he doesn't even spend a single sentence even so much as suggesting that the results would need further inspection, or that they might be inaccurate.
The LLM is consulted like a perfect oracle, flawless in its ability to perform a task, and it's left at that. Its results are presented totally uncritically.
For this project, of course, the stakes are nil. But how long until this unfounded trust in LLMs works its way into high stakes problems? The reign of deterministic machines for the past few centuries has ingrained a trust in the reliability of machines in us that should be suspended when dealing with an inherently stochastic device like an LLM.
I get what you're saying, but looking at some examples, they look kinda of right, but there are a lot of misleading facts sprinkled, making his grading wrong. It is useful, but I'd suggest to be careful to use this to make decisions.
Some of the issues could be resolved with better prompting (it was biased to always interpret every comment through the lens of predictions) and LLM-as-a-judge, but still. For example, Anthropic's Deep Research prompts sub-agents to pass original quotes instead of paraphrasing, because it can deteriorate the original message.
Some examples:
Swift is Open Source (2015)
===========================
sebastiank123 got a C-, and was quoted by the LLM as saying:
> “It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.”
Now, let's read his full comment:
> Great news! Coding in Swift is fantastic and I would love to see it coming to more platforms, maybe even on servers. It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.
I don't interpret it as a prediction, but a desire. The user is praising Swift. If it went the server way, perhaps it could replace JS, to the user's wishes. To make it even clearer, if someone asked the commenter right after: "Is that a prediction? Are you saying Swift is going to become a serious Javascript competitor?" I don't think its answer would be 'yes' in this context.
How to be like Steve Ballmer (2015)
===================================
Most wrong
----------
> corford (grade: D) (defending Ballmer’s iPhone prediction):
> Cited an IDC snapshot (Android 79%, iOS 14%) and suggested Ballmer was “kind of right” that the iPhone wouldn’t gain significant share.
> In 2025, iOS is one half of a global duopoly, dominates profits and premium segments, and is often majority share in key markets. Any reasonable definition of “significant” is satisfied, so Ballmer’s original claim—and this defense of it—did not age well.
Full quote:
> And in a funny sort of way he was kind of right :) http://www.forbes.com/sites/dougolenick/2015/05/27/apple-ios...
> Android: 79% versus iOS: 14%
"Any reasonable definition of 'significant' is satisfied"? That's not how I would interpret this. We see it clearly as a duopoly in North America. It's not wrong per se, but I'd say misleading. I know we could take this argument and see other slices of the data (premium phones worldwide, for instance), I'm just saying it's not as clear cut as it made it out to be.
> volandovengo (grade: C+) (ill-equipped to deal with Apple/Google):
>
> Wrote that Ballmer’s fast-follower strategy “worked great” when competitors were weak but left Microsoft ill-equipped for “good ones like Apple and Google.”
> This is half-true: in smartphones, yes. But in cloud, office suites, collaboration, and enterprise SaaS, Microsoft became a primary, often leading competitor to both Apple and Google. The blanket claim underestimates Microsoft’s ability to adapt outside of mobile OS.
That's not what the user was saying:
> Despite his public perception, he's incredibly intelligent. He has an IQ of 150.
>
> His strategy of being a fast follower worked great for Microsoft when it had crappy competitors - it was ill equipped to deal with good ones like Apple and Google.
He was praising him and he did miss opportunities at first. OC did not make predictions of his later days.
[Let's Encrypt] Entering Public Beta (2015)
===========================================
- niutech: F "(endorsed StartSSL and WoSign as free options; both were later distrusted and effectively removed from the trusted ecosystem)"
Full quote:
> There are also StartSSL and WoSign, which provide the A+ certificates for free (see example WoSign domain audit: https://www.ssllabs.com/ssltest/analyze.html?d=checkmyping.c...)
>
> pjbrunet: F (dismissed HTTPS-by-default arguments as paranoid, incorrectly asserted ISPs had stopped injection, and underestimated exactly the use cases that later moved to HTTPS)
Full quote:
> "We want to see HTTPS become the default."
>
> Sounds fine for shopping, online banking, user authorizations. But for every website? If I'm a blogger/publisher or have a brochure type of website, I don't see point of the extra overhead.
>
> Update: Thanks to those who answered my question. You pointed out some things I hadn't considered. Blocking the injection of invisible trackers and javascripts and ads, if that's what this is about for websites without user logins, then it would help to explicitly spell that out in marketing communications to promote adoption of this technology. The free speech angle argument is not as compelling to me though, but that's just my opinion.
I thought the debate was useful and so did pjbrunet, per his update.
I mean, we could go on, there are many others like these.
I haven’t looked at the output yet, but came here to say,LLM grading is crap. They miss things, they ignore instructions, bring in their own views, have no calibration and in general are extremely poorly suited to this task. “Good” LLM as a judge type products (and none are great) use LLMs to make binary decisions - “do these atomic facts match yes / no” type stuff - and aggregate them to get a score.
I understand this is just a fun exercise so it’s basically what LLMs are good at - generating plausible sounding stuff without regard for correctness. I would not extrapolate this to their utility on real evaluation tasks.
Have you read the literature? Do you have a background in machine learning or statistics?
Yes. We know that LLMs can be trained by predicting the next token. This is a fact. You can look up the research papers, and open source training code.
I can't work it out, are you advocating a conspiracy theory that these models are trained with some elusive secret and that the researchers are lying to you?
Being trained by predicting one token at a time is also not a criticism??! It is just a factually correct description...
> Have you read the literature? Do you have a background in machine learning or statistics?
Very much so. Decades.
> Being trained by predicting one token at a time is also not a criticism??! It is just a factually correct description...
Of course that's the case. The objection I've had from the very first post in this thread is that using this trivially obvious fact as evidence that LLMs are boring/uninteresting/not AI/whatever is missing the forest for the trees.
"We understand [the I/Os and components of] LLMs, and what they are is nothing special" is the topic at hand. This is reductionist naivete. There is a gulf of complexity, in the formal mathematical sense and reductionism's arch-enemy, that is being handwaved away.
People responding to that with "but they ARE predicting one token at a time" are either falling into the very mistake I'm talking about, or are talking about something else entirety.
I mean, yeah, statistics works. It's not that surprising that super amazing statistical modelling can approximate a distribution. Of course, thoughts, words, arguments are distributions, and with a powerful enough model you can simulate them.
None of this is surprising? Like, I think you just lack a good statistical intuition. The amazing thing is that we have these extremely capable models, and methods to learn them. That process is an active area of research (as is much of statistics), but it is just all statistics...
Saying we understand the training process of LLMs does not mean that LLMs are not super impressive. They are shining testiments to the power of statistical modelling / machine learning. Arbitrarily reclassifying them as something else is not useful. It is simply untrue.
There is nothing wrong with being impressed by statistics... You seem to be saying that statistics is interesting and there for to say that LLMs are statistics dismissed them. I think perhaps you are just implicitly biased against statistics! :p
I think the person you are responding to is using a strange definition of "know."
I think they mean "do we understand how they process information to produce their outputs" (i.e., do we have an analytical description of the function they are trying to approximate).
You and I mean, we understand the training process that produces their behaviour (and this training process is mainly standard statistical modelling / ML).
I agree. The two of us are talking past each other, and I wonder if it's because there's a certain strain of thought around LLMs that believes that epistemological questions and technology that we don't fully understand are somehow unique to computer science problems.
Questions about the nature of knowledge (epistemology and other philosophical/cognitive studies) in humans are still unsolved to this day, and frankly may never be fully understood. I'm not saying this makes LLM automatically similar to human intelligence, but there are plenty of behaviors, instincts, and knowledge across many kinds of objects that we don't fully understand the origin of. LLMs aren't qualitatively different in this way.
There are many technologies that we used that we didn't fully understand at the time, even iterating and improving on those designs without having a strong theory behind them. Only later did we develop the theoretical frameworks that explain how those things work. Much like we're now researching the underpinnings of how LLMs work to develop more robust theories around them.
I'm genuinely trying to engage in a conversation and understand where this person is coming from and what they think is so unique about this moment and this technology. I understand the technological feat and I think it's a huge step forward, but I don't understand the mysticism that has emerged around it.
What do you mean? what do you think statistical modelling is?
I am very confused by your stance.
The aim of the function approximation is to maximize the likelihood of the observed data (this is standard statistical modelling), using machine learning (e.g., stochastic gradient decent) on a class of universal function approximators is a standard approach to fitting such a model.
How is that a misconception? LLMs are just advanced statistical modelling (unsupervised machine learning) with small tweaks (e.g., some fine-tuning for human preference).
At the core, they are just statistical modelling. The fact that statistical modelling can produce coherent thoughts is impressive (and basically vindicates materialism) but that doesn't change the fact it is all based on statistical modelling. ...? What is your view?
You realize that they will gladly hallucinate science...
You should check the papers it claims to reference as see if the claims it makes are actually backed up.
In my experience, it can completely mischaracterize scientific literature. For example, I asked it if a codebase was a faithful implementation of an algorithm described in a CS paper, and is said "no" and then proceeded to list a dozen small changes. Every single change was incorrect. The codebase was in fact a completely faithful implementation.
In short, college students nowdays have lower reading comprehension than young children in the 1850s. That is not what I would call progress.
Speaking personally, I believe I would potentially have significantly worse critical reasoning abilities if I had grown up using LLMs. It is very clear to me the temptation of using them as an ersatz for engagement and thought.
I think you are perhaps conflating technological progress (yes technology has improved) with demographic progress. Demographic progress is far from monotonically increasing (reading comprehension is newly plummeting, maths scores are dropping in America, science per scientist is stalling compared to 50 years ago, etc...)
Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.
The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."
This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).