Yeah, they are putting two facts together to heavily imply that they are part of a single story, but there is no evidence presented that they are. "UN leaders are gathering!" "There is a huge SIM farm that could disrupt communications!" Both true, but seemingly unrelated. All those car warranty texts have to come from somewhere - this is probably where.
Exactly. And the whole point of a cellular network architecture is that it's resistant to DoS attacks (what the rubes call "unexpectedly heavy usage"). Sure, you can take a cell out with a hundred fake phones, and all the users in that cell will hop to the next one. Or at worst walk a block over to find another. The attack doesn't scale, at all.
And even if you wanted to deploy custom hardware to do it, it would be far easier to just use a high power jammer on the band anyway than mucking around with all those SIMs.
These are for making actual use of the telecom facilities at scale, with the anonymity you get from burner SIMs. It's fraud, not terrorism.
Some parts of it are (DoS resistant.) And some carriers are more resistant than others. Verizon's CDMA from the 90s / early 2000s was NOTORIOUS for falling over when too many people texted at the same time. But yeah, it's been a while since things were that bad.
I don't understand why we need more data for training. Assuming we've already digitized every book, magazine, research paper, newspaper, and other forms of media, why do we need this "second internet?" Legal issues aside, don't we already have the totality of human knowledge available to us for training?
The goal/theory behind the LLM investment explosion is that we can get to AGI by feeding them all the data. And to be clear, by AGI I don't mean "superhuman singularity", just "intelligent enough to replace most humans" (and, by extension, hoover up all the money we're spending on their salaries today).
But if we've already fed them all the data, and we don't have AGI (which we manifestly don't), then there's no way to get to AGI with LLMs and the tech/VC industry is about to have a massive, massive problem justifying all this investment.
That's only true if model intelligence is the limiting factor. I don't believe it is. Right now the limiting factor for AI impact is mostly the effort required to wire it up to all the systems we want it to automate.
We don't have anything close to the totality of human knowledge digitized, much less in a form that LLMs can easily take advantage of. Even for easily verifiable facts powering modern industry, details like appropriate lube/speeds/etc for machining molybdenum for this or that purpose just don't exist outside of the minds of the few people who actually do it. Moreover, _most_ knowledge is similarly locked up inside a few people rather than being written down.
Even when written down, without the ability to interact with and probe the world like you did growing up it's not possible to meaningfully tell the difference between 9/11 hoaxers and everyone else save for how frequent the relative texts appear. They don't have the ability to meaningfully challenge their world model, and that makes the current breadth of written content even less useful than it might otherwise appear.
Let’s keep in mind that we don’t have most of the renaissance through the early modern period (1400-1800) because it was published in neolatin with older typefaces— and only about 10% is even digitized.
We probably don’t have most of the Arabic corpus either — and barely any Sanskrit. Classical Chinese is probably also lacking — only about 1% of it is translated to English.
The volume of text in English and digitized from the past few years dwarfs the volume of Latin text from all time. Unless you are wondering about a very niche historical topic there’s more written in English than Latin about basically everything.
Well, if you are looking for diversity of perspective— temporal diversity may be valuable.
Marsilio Ficino was hired by the Medici to translate Plato and other classical Greek works into Latin. He directly taught DaVinci, Raphael, Michelangelo, Toscanelli, etc. I mean to say that his ideas and perspectives helped spark the renaissance.
Insofar as we hope for an AI renaissance and not an AI apocalypse, it might benefit us to have the actual renaissance in the training data.
If you make a cursory search you can also find other translations of his works, various biographies, and a wide range of commentary and criticism by later authors.
Many of Ficino's originals are also in the corpus of scanned and OCRed or recently republished texts. I'm sure there are archives here or there with additional materials which have not been digitized, but it seems questionable whether those would make any significant difference to a process as indiscriminate and automatic as LLM training.
Consiglio contro la pestilenza was apparently written in the Florentine language. You can find a nice scan at https://archive.org/details/ita-bnc-in1-00000486-001/ and the corresponding mediocre OCR at https://archive.org/stream/ita-bnc-in1-00000486-001/ita-bnc-... (using better OCR software could give a version with few errors; I don't think anyone has produced a carefully checked digital text). There's some discussion at https://www.jstor.org/stable/40606241 as well as plenty of other commentary around. If you are trying to figure out specific details about this book, you should just check the book. I wouldn't expect adding a carefully produced copy to an LLM training corpus would make that much difference unless you have niche questions about it.
I interpreted it as a roundabout way of increasing quality. Take any given subreddit. You have posts and comments, and scores, but what if the data quality isn't very good overall? What if instead of using it as is, you instead had an AI evaluate and reason about all the posts, and classify them itself based on how useful the posts and comments are, how well they work out in practice (if easily simulated), etc? Essentially you're using the AI to provide a moderated and carefully curated set of information about the information that was already present. If you then ingest this information, does that increase the quality of the data? Probably(?), since you're throwing compute and AI reasoning at the problem ahead of time reducing compute and lowering the low quality data by adding additional high quality data.
The point is that current methods are unable to get more than the current state-of-the-art models' degree of intelligence out of training on the totality of human knowledge. Previously, the amount of compute needed to process that much data was a limit, but not anymore.
So now, in order to progress further, we either have to improve the methods, or synthetically generate more training data, or both.
What does synthetic training data actually mean? Just saying the same things in different ways? It seems like we're training in a way that's just not sustainable.
One example: when we want to increase performance on a task which can be automatically verified, we can often generate synthetic training data by having the current, imperfect models attempt the task lots of times, then pick out the first attempt that works. For instance, given a programming problem, we might write a program skeleton and unit tests for the expected behavior. GPT-5 might take 100 attempts to produce a working program; the hope is that GPT-6 would train on the working attempt and therefore take much less attempts to solve similar problems.
As you suggest, this costs lots of time and compute. But it's produced breakthroughs in the past (see AlphaGo Zero self-play) and is now supposedly a standard part of model post-training at the big labs.
That is only true if your path to AGI is to take models similar to current models, and feed them with tons of data.
Advances in architecture and training protocols can and will easily dwarf "more data". I think that is quite obvious from the fact that humans learn to be quite intelligent using only a fraction of the data available to current LLMs. Our advantage is a very good pre-baked model, and feedback-based training.
What makes you think that? Especially given that fact that GI (without the 'A') is evidently very much possible with only a tiny fraction of the "totality of human knowledge".
One and the same. It would be like if I tried to call my product Tactical Software as a Service
It would still only be software as a service, but I would just brand it in a way to make it more appealing to certain buyer personas without any actual investment or commitment on my part.
Ha ha but you're not wrong. The waterfall methodology — to the extent that it ever existed as a real thing rather than a strawman for agile consultants to criticize — was originally defined to produce predictable results for complex defense software projects. It actually sort of worked some of the time.
reply