Very cool! I don't have this pain point currently but I can absolutely see the utility. I like the in built demo tool (although it sadly means you have no need for DemoTime lol).
The demo.camelqa needs some styling. I would invest a few minutes here. Maybe a loading spinner too if you're expecting 15second latency.
Technically is this doing clever things with markup, or literally just feeding the image into a multimodal LLM and getting function calls in response?
Thanks for the feedback! We'll add some styling to the demo page. We're processing the image with an object detection model and classification model and also using some accessibility element data to get a better understanding of what is interactive on the screen.
GPT-4V is great for reasoning about what is on the screen. However, it struggles with precision. For example, it is not able to specify the coordinates to tap when it decides to tap an icon. That's where the object detection and accessibility elements help. We can precisely locate interactive elements.
Thanks! Yes, we experimented with that! I think because of the way that GPT sees images in patches it has a hard time with absolute positioning but that's just a guess.
The demo.camelqa needs some styling. I would invest a few minutes here. Maybe a loading spinner too if you're expecting 15second latency.
Technically is this doing clever things with markup, or literally just feeding the image into a multimodal LLM and getting function calls in response?