OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:
> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.
reply