1. The next frame is easy, but multiple frames is not
2. What works for text doesn't work for video.
Then Sora comes out and shows multiple frames and someone tweets gotcha.
He then tweets without saying he misspoke ..... goes on about the model doesn't understand physics.
And his project, V-JEPA, is the best
He keeps saying stuff about "sucks as a mental model" but doesn't say why that would not apply to text.
https://twitter.com/ylecun/with_replies
Me: If text doesn't need a mental model, I see no reason video needs it. His argument sucks or is badly worded.
1. The next frame is easy, but multiple frames is not
2. What works for text doesn't work for video.
Then Sora comes out and shows multiple frames and someone tweets gotcha.
He then tweets without saying he misspoke ..... goes on about the model doesn't understand physics.
And his project, V-JEPA, is the best
He keeps saying stuff about "sucks as a mental model" but doesn't say why that would not apply to text.
https://twitter.com/ylecun/with_replies
Me: If text doesn't need a mental model, I see no reason video needs it. His argument sucks or is badly worded.