The fact that you have to wait the end of the 30 seconds before it starts writing the text is a fundamental limitation of the current transformer algorithm because the first and the last bits of speech influence each other so you have to wait until the end.
Yeah I get that. I tried another open source voice input that was real time and the quality was horrible. But this is something that can be worked around for Whisper. One thing that comes to mind is an option to append and reprocess the audio every few centiseconds (needs a fairly powerful device though), and update the text output as needed. This could also open the door for an edit-by-voice feature.