Hacker Newsnew | past | comments | ask | show | jobs | submit | geetee's commentslogin

Title should be "AI labs are raping the planet"

Now I'm curious... How the hell do you synchronize clocks to such an extreme accuracy? Anybody have a good resource before I try to find one myself?

Look up PTP White Rabbit.

Thank you!

Is the author's point that every business should be an employee-owned cooperative? How do we get there?


Aren't many or even most of the startups essentially "employee owned"? I've heard that the average employee at NVidia is worth $25m.


So 900b total among its 36000 employees for a total of 900b/4.3 trillion market cap. Or roughly 20% employee owned (if that 25m number is correct).

This article (warning: obnoxious ads) is the only one I can find that claims to know who nvidea’s shareholders are and puts the number at 4.3%

https://capital.com/en-int/analysis/nvidia-shareholder-who-o...

I would not consider that to be employee owned (although I certainly wouldn’t mind the 25m)


I think the phrase "employee owned" means that employees own some of the company.

Do you feel that outside shareholders invalidate the claim?


Yes, my definition of employee owned would be “employees own a controlling share of the company”


Quit your job and start a co-op I guess.


Unionize


Sounds very Lumon.


Mysterious AND important.


This is presented as if it's part of something like a terror plot, but my money is on it being related to your car warranty expiring.


Yeah, they are putting two facts together to heavily imply that they are part of a single story, but there is no evidence presented that they are. "UN leaders are gathering!" "There is a huge SIM farm that could disrupt communications!" Both true, but seemingly unrelated. All those car warranty texts have to come from somewhere - this is probably where.


It’s not. The Secret Service already has identified nation stare actors as being responsible.


That doesn't mean it wasn't money-making scams. North Korea engages in crypto theft all the time.


I'm sure they'll find someone specific eventually


Exactly. And the whole point of a cellular network architecture is that it's resistant to DoS attacks (what the rubes call "unexpectedly heavy usage"). Sure, you can take a cell out with a hundred fake phones, and all the users in that cell will hop to the next one. Or at worst walk a block over to find another. The attack doesn't scale, at all.

And even if you wanted to deploy custom hardware to do it, it would be far easier to just use a high power jammer on the band anyway than mucking around with all those SIMs.

These are for making actual use of the telecom facilities at scale, with the anonymity you get from burner SIMs. It's fraud, not terrorism.


Some parts of it are (DoS resistant.) And some carriers are more resistant than others. Verizon's CDMA from the 90s / early 2000s was NOTORIOUS for falling over when too many people texted at the same time. But yeah, it's been a while since things were that bad.


Yes, they were using these to commit crimes, and will miss them.


Wait. What? My car warrant is expiring? If only there were some way to get more information and perhaps extend it ...


I wonder if the author could have replicated the couchdb database locally to make their life easier.


I don't understand why we need more data for training. Assuming we've already digitized every book, magazine, research paper, newspaper, and other forms of media, why do we need this "second internet?" Legal issues aside, don't we already have the totality of human knowledge available to us for training?


The goal/theory behind the LLM investment explosion is that we can get to AGI by feeding them all the data. And to be clear, by AGI I don't mean "superhuman singularity", just "intelligent enough to replace most humans" (and, by extension, hoover up all the money we're spending on their salaries today).

But if we've already fed them all the data, and we don't have AGI (which we manifestly don't), then there's no way to get to AGI with LLMs and the tech/VC industry is about to have a massive, massive problem justifying all this investment.


That's only true if model intelligence is the limiting factor. I don't believe it is. Right now the limiting factor for AI impact is mostly the effort required to wire it up to all the systems we want it to automate.


We don't have anything close to the totality of human knowledge digitized, much less in a form that LLMs can easily take advantage of. Even for easily verifiable facts powering modern industry, details like appropriate lube/speeds/etc for machining molybdenum for this or that purpose just don't exist outside of the minds of the few people who actually do it. Moreover, _most_ knowledge is similarly locked up inside a few people rather than being written down.

Even when written down, without the ability to interact with and probe the world like you did growing up it's not possible to meaningfully tell the difference between 9/11 hoaxers and everyone else save for how frequent the relative texts appear. They don't have the ability to meaningfully challenge their world model, and that makes the current breadth of written content even less useful than it might otherwise appear.


Let’s keep in mind that we don’t have most of the renaissance through the early modern period (1400-1800) because it was published in neolatin with older typefaces— and only about 10% is even digitized.

We probably don’t have most of the Arabic corpus either — and barely any Sanskrit. Classical Chinese is probably also lacking — only about 1% of it is translated to English.


The volume of text in English and digitized from the past few years dwarfs the volume of Latin text from all time. Unless you are wondering about a very niche historical topic there’s more written in English than Latin about basically everything.


Well, if you are looking for diversity of perspective— temporal diversity may be valuable.

Marsilio Ficino was hired by the Medici to translate Plato and other classical Greek works into Latin. He directly taught DaVinci, Raphael, Michelangelo, Toscanelli, etc. I mean to say that his ideas and perspectives helped spark the renaissance.

Insofar as we hope for an AI renaissance and not an AI apocalypse, it might benefit us to have the actual renaissance in the training data.


And here you can e.g. find Ficino's correspondence translated into English, with commentary, https://archive.org/details/lettersofmarsili0000fici

If you make a cursory search you can also find other translations of his works, various biographies, and a wide range of commentary and criticism by later authors.

Many of Ficino's originals are also in the corpus of scanned and OCRed or recently republished texts. I'm sure there are archives here or there with additional materials which have not been digitized, but it seems questionable whether those would make any significant difference to a process as indiscriminate and automatic as LLM training.


Yes, but many of his books are not translated or ocr’d. For instance, La pestilenzia or de mysteriis.

And he is one of the most central figures of the renaissance. Less than 20% of neolatin has been digitized, let alone translated.

It is fine to question whether including neolatin, Arabic or Sanskrit in AI training will make AI better.

But for me, it is a core set of humanism that would be a shame to neglect.


Consiglio contro la pestilenza was apparently written in the Florentine language. You can find a nice scan at https://archive.org/details/ita-bnc-in1-00000486-001/ and the corresponding mediocre OCR at https://archive.org/stream/ita-bnc-in1-00000486-001/ita-bnc-... (using better OCR software could give a version with few errors; I don't think anyone has produced a carefully checked digital text). There's some discussion at https://www.jstor.org/stable/40606241 as well as plenty of other commentary around. If you are trying to figure out specific details about this book, you should just check the book. I wouldn't expect adding a carefully produced copy to an LLM training corpus would make that much difference unless you have niche questions about it.


Don't most models learn from different languages sets already?


I interpreted it as a roundabout way of increasing quality. Take any given subreddit. You have posts and comments, and scores, but what if the data quality isn't very good overall? What if instead of using it as is, you instead had an AI evaluate and reason about all the posts, and classify them itself based on how useful the posts and comments are, how well they work out in practice (if easily simulated), etc? Essentially you're using the AI to provide a moderated and carefully curated set of information about the information that was already present. If you then ingest this information, does that increase the quality of the data? Probably(?), since you're throwing compute and AI reasoning at the problem ahead of time reducing compute and lowering the low quality data by adding additional high quality data.


The point is that current methods are unable to get more than the current state-of-the-art models' degree of intelligence out of training on the totality of human knowledge. Previously, the amount of compute needed to process that much data was a limit, but not anymore.

So now, in order to progress further, we either have to improve the methods, or synthetically generate more training data, or both.


What does synthetic training data actually mean? Just saying the same things in different ways? It seems like we're training in a way that's just not sustainable.


One example: when we want to increase performance on a task which can be automatically verified, we can often generate synthetic training data by having the current, imperfect models attempt the task lots of times, then pick out the first attempt that works. For instance, given a programming problem, we might write a program skeleton and unit tests for the expected behavior. GPT-5 might take 100 attempts to produce a working program; the hope is that GPT-6 would train on the working attempt and therefore take much less attempts to solve similar problems.

As you suggest, this costs lots of time and compute. But it's produced breakthroughs in the past (see AlphaGo Zero self-play) and is now supposedly a standard part of model post-training at the big labs.


The totality of human knowledge is a rounding error to what’s needed for AGI


That is only true if your path to AGI is to take models similar to current models, and feed them with tons of data.

Advances in architecture and training protocols can and will easily dwarf "more data". I think that is quite obvious from the fact that humans learn to be quite intelligent using only a fraction of the data available to current LLMs. Our advantage is a very good pre-baked model, and feedback-based training.


What makes you think that? Especially given that fact that GI (without the 'A') is evidently very much possible with only a tiny fraction of the "totality of human knowledge".


A lot of newspapers seem to be stuck behind paywalls, even when in the public domain.


Reality distortion, or they're just using military terminology?


One and the same. It would be like if I tried to call my product Tactical Software as a Service

It would still only be software as a service, but I would just brand it in a way to make it more appealing to certain buyer personas without any actual investment or commitment on my part.


What's wrong with that?


Nothing, their argument is that it’s not worth adopting Palantir’s marketing wording.


When applying strange military terminology to something clearly non-military, is that not a distortion of reality?


I mean, let's be realistic.... should they just use an excel spreadsheet?


Back in my day we killed people using the waterfall method and we liked it.


Ha ha but you're not wrong. The waterfall methodology — to the extent that it ever existed as a real thing rather than a strawman for agile consultants to criticize — was originally defined to produce predictable results for complex defense software projects. It actually sort of worked some of the time.


It's more like $0.30+ when you account for delivery, which costs more than supply. And then some fees on top of that.


Thanks. When the cost in different regions are compared, I thought it was customary to only include the generation, not the delivery.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: