Yes, the Llava model can encode image, and you can encode image into 3D vae space. Without fine-tune the model though, you are not going to have fidelity to original (if only use Llava's SigLIP to encode), or end up with image with limited motion (3D vae encoded latents as the first frame then doing vid2vid).