Tencent is releasing a ton of stuff though! https://aivideo.hunyuan.tencent.com/...

liuliu · 2025-02-04T20:48:07 1738702087

The training process in OmniHuman-1 seems to be straightforward to replicate once Tencent releases their image-to-video model too.

echelon · 2025-02-04T20:52:36 1738702356

T2V is already I2V if you're enterprising enough to open up the model and play with the latents. The I2V modality is almost just a trick.

liuliu · 2025-02-04T21:03:02 1738702982

Yes, the Llava model can encode image, and you can encode image into 3D vae space. Without fine-tune the model though, you are not going to have fidelity to original (if only use Llava's SigLIP to encode), or end up with image with limited motion (3D vae encoded latents as the first frame then doing vid2vid).