Definitely possible to run high-quality speech-to-text in realtime in the browse...

Definitely possible to run high-quality speech-to-text in realtime in the browser: https://whisper.ggerganov.com/

Made by the same guy who created the popular llama.cpp LLM library. The model uses log-mel spectrograms as input.

Using modern algorithms, FFT is actually really fast to compute. Definitely dwarfed by the evaluation of the model itself, even when using many threads and Wasm SIMD.