Look up VLA models; that's essentially plugging the guts of a language model int...

Look up VLA models; that's essentially plugging the guts of a language model into a transformer that handles joint motion/vision. They get trained on "episodes" i.e. videos from the PoV of a robot doing a task, after training you can ask the model things like: "pick up the red ball and put it into the green cup" etc. Really cool stuff.