One issue with that is that the model may learn to smuggle data. You as a human think that the plain reading of the words is what is doing the reasoning, but (part of) the processing is done by the exact comma placement and synonym choice etc.
Data smuggling is a known phenomenon in similar tasks.
I don't think data smuggling is relevant in star style scenarios. You're still validating the final output. If it works on test data, what could be even smuggled.
Data smuggling is a known phenomenon in similar tasks.