Tools are still evolving out of the VLM/LLM split [0]. The reason image-to-image...

Tools are still evolving out of the VLM/LLM split [0]. The reason image-to-image tasks are so variable in quality and vastly inferior to text-to-image tasks is because there is an entirely separate model that is trained on transforming an input image into tokens in the LLM's vector space.

The naive approach that gets you results like ChatGPT is to produce output tokens based on the prompt and generate a new image from the output. It is really difficult to maintain details from the input image with this approach.

A more advanced approach is to generate a stream of "edits" to the input image instead. You see this with Gemini, which sometimes maintains original image details to a fault; e.g. it will preserve human faces at all cost, probably as a result of training.

I think the round-trip through SVG is an extreme challenge to train through and essentially forces the LLM to progressively edit the SVG source, which can result in something like the Gemini approach above.

[0]: https://www.groundlight.ai/blog/how-vlm-works-tokens