This seems to be incorporated into current LLM generations already -- when code execution is enabled both GPT-5.x and Claude 4.x automatically seem to execute Python code to help with reasoning steps.
If you compare the outputs of a CoT input vs a control input, the outputs will have the reasoning step either way for the current generation of models.