This doesn't seem like an article that was made with proper research or proper sincerity.
The claim is that there is one million times more data to feed to LLMs, citing a few articles. The articles estimate that there is 180-200 zettabytes (the number mentioned in TFA) of data total in the world, including all cloud services, including all personal computers, etc. The vast majority of that data are not useful to train LLMs at all, they will be movies, games, databases. There is a massive amount of duplication in that data. Only a tiny-tiny fraction will be something useful.
> Think of today’s AI like a giant blender where, once you put your data in, it gets mixed with everyone else’s, and you lose all control over it. This is why hospitals, banks, and research institutions often refuse to share their valuable data with AI companies, even when that data could advance critical AI capabilities.
This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk.
Let alone that a lot of them will be, well-structured, yes, but completely useless information for LLM training. You will not get any improvement in the perceived "intellect" of a model by overfitting it with terabytes of tables with bank transaction records.
"This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk."
(OP) You make great points. I think we're actually more in agreement than might be obvious. Part of the reason you need to "give" data to an LLM is because of the way LLMs are constructed... which creates the privacy risk.
The principle of attribution-based control suggested in this article would break that principle, enabling each data owner to control which AI predictions they make more intelligent (as opposed to only controlling which IA models they help train).
So to your point... this is a very rigorous privacy protection. Another way to TLDR the article is "if we get really good at privacy... there's a LOT more data out there... so let's start really caring about privacy"
Anyway... I agree with everything in your comment. Just thought I'd drop by and try to lend clarity to how the article agrees with you (sounds like there's room for improvement on how to describe attribution-based control though).
> Written in Zig for optimal performance and memory safety
Ironic to see this, as even a cursory glance over the code shows a lot of memory bugs, such as dangling pointers, data races, some potential bugs like double free, etc.
It is good that the OP is learning things, but I would caution against relying on LLMs and taking on a bigger project like this, before the basics are well understood.
Personally, this is one of the reasons I dislike the LLM hype, people are enabled to produce much more code, code that they aren't qualified to support or even understand.
While the project linked is clearly designated as strictly for "learning purposes", the applications we will get in the large will be of no better quality.
The difference being, before LLMs, those, who didn't have qualification, wouldn't even approach problems like this, now they can vibecode something that works on a lucky run, but is otherwise completely inadequate.
There is no such question when using D2 either. It was only an issue with D1, which was discontinued almost 15 years ago and was irrelevant for longer.
This isn't mentioned anywhere on the page, but fork is generally not a great API for these kinds of things. In a multi-threaded application, any code between the fork and exec syscalls should be async-signal-safe. Since the memory is replicated in full at the time of the call, the current state of mutexes is also replicated and if some thread was holding them at the time, there is a risk of a deadlock. A simple print! or anything that allocates memory can lead to a freeze. There's also an issue of user-space buffers, again printing something may write to a user-space buffer that, if not flushed, will be lost after the callback completes.
The claim is that there is one million times more data to feed to LLMs, citing a few articles. The articles estimate that there is 180-200 zettabytes (the number mentioned in TFA) of data total in the world, including all cloud services, including all personal computers, etc. The vast majority of that data are not useful to train LLMs at all, they will be movies, games, databases. There is a massive amount of duplication in that data. Only a tiny-tiny fraction will be something useful.
> Think of today’s AI like a giant blender where, once you put your data in, it gets mixed with everyone else’s, and you lose all control over it. This is why hospitals, banks, and research institutions often refuse to share their valuable data with AI companies, even when that data could advance critical AI capabilities.
This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk.
Let alone that a lot of them will be, well-structured, yes, but completely useless information for LLM training. You will not get any improvement in the perceived "intellect" of a model by overfitting it with terabytes of tables with bank transaction records.