Github is overflowing with Tencent, Alibaba, and Ant Group models. Typically licensed as Apache 2, and replete with pretrained weights and fine tuning scripts.
Yes, the Llava model can encode image, and you can encode image into 3D vae space. Without fine-tune the model though, you are not going to have fidelity to original (if only use Llava's SigLIP to encode), or end up with image with limited motion (3D vae encoded latents as the first frame then doing vid2vid).
https://aivideo.hunyuan.tencent.com/
Github is overflowing with Tencent, Alibaba, and Ant Group models. Typically licensed as Apache 2, and replete with pretrained weights and fine tuning scripts.