Probably overkill for content moderation, I'd think. You can identify bad words looking only at audio, and you can probably do nearly as good a job of identifying violence and nudity examining still images. And at YouTube scale, I imagine the main problem with moderation isn't so much as being correct, but of scaling. statista.com (what's up with that site, anyway?) suggests that YouTube adds something like 8 hours of video per second. I didn't run the numbers, but I'm pretty sure that's way too much to cost effectively throw something like Gemini Pro at.
Or.. google supplies some kind of local LLM tool which processes your videos before uploaded. You pay for the gpu/electricity costs. Obviously this would need to be done in a way that can't be hacked/manipulated. Might need to be highly integrated with a backend service that manages the analyzed frames from the local machine and verifies hashes/tokens after the video is fully uploaded to YouTube.
I guess it could also be associated with views per time period to optimize better. If the video is interesting, people will share and more views will happen quickly.
People assume that we can scale the capabilities of LLMs indefinitely, I on the other side strongly suspect we are probably getting close to diminishing returns territory.
There's only so much you can do by guessing the next probably token in a stream. We will probably need something else to achieve what people think that will soon be done with LLMs.
Like Elon Musk probably realizing that computer vision is not enough for full self-driving, I expect we will soon reach the limits of what can be done with LLMs.
Content moderation is one of the hardest task we have at hand, we're burning though human souls looking at god awful stuff, lose their sanity, because simple filters just won't cut it.
For instance right now many rules exclude all nudity and the false positive rate is through the roof, while some of the nudity should actually be allowed and the rule in itself is hurting and should ideally be changed.
Even with our current simplistic rules I don't see automatic filters doing their job ("let me talk to an human" is our collective cry for help). When setting up more sensible rules ("nudity is OK when not sexualized,
but not of minors, except for babies, if the viewer's coubtry allows for it"), I assume the resources and tuning needed to make that work on an automated systems would be of epic scale.
That’s only 8 calls with a full context window per second. If that costs so much it makes Google do a double take, then maybe these AI things are just too expensive.
If it costs $1 per call, then over a year the entire perfect moderation of Youtube would cost roughly $250M. That seems sort of reasonable?
But probably pointless for most videos that are never watched by anyone other than the uploader, so maybe you just do this thing before anyone else watches the video and cut your costs by 50+%
They do “moderate” videos never watched by anyone and it can be totally ridiculous. I had a private channel where I had uploaded a few hundred screen recordings (some of them video conferences) over a year or two, all set to private and never shared with anyone. One day the channel was suddenly taken down because it violated their policy on “impersonation”… Of course the dispute I’m allegedly entitled to was never answered.
I have no idea how YouTube currently moderates its content, but there may be some benefit with Gemini. I'm sure Googlers have been considering this option.