stiffler01's comments

stiffler01 · on Jan 30, 2024

That is something we are looking forward to as well. Stay tuned for updates on Jetson support.

We tested it on 3090 and 4090 works as expected.

stiffler01 · on Jan 29, 2024

WhisperLive supports both TensorRT and faster-whisper. We didn’t reduce the chunk size rather use padding based on the chunk size received from the client. Reducing the segment size should be a more optimised solution in the Live scenario.

For streaming we continuously stream audio bytes of fixed size to the server and send the completed segments back to the client while incrementing the timestamp_offset.

albertzeyer · on Jan 29, 2024

Ah, but that sounds like a very inefficient approach, which probably still has quite high latency, and probably also performs bad in terms of word-error-rate (WER).

But I'm happy to be proven wrong. That's why I would like to see some actual numbers. Maybe it's still okish enough, maybe it's actually really bad. I'm curious. But I don't just want to see a demo or a sloppy statement like "it's working ok".

Note that this is a highly non-trivial problem, to make a streamable speech recognition system with low latency and still good performance. There is a big research community working on just this problem.

I actually have worked on this problem myself. E.g. see our work "Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition" (https://arxiv.org/abs/2309.08436), which will be presented at ICASSP 2024. E.g. for a median latency of 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which is almost as good as the offline model with 7.4% WER. This is a very good result (only very minor WER degradation). Or with a latency of 0.78 sec, we get 7.7% WER. Our model currently does not work too well when we go to even lower latencies (or the computational overhead becomes impractical).

Or see Emformer (https://arxiv.org/abs/2010.10759) as another popular model.

huac · on Jan 30, 2024

whisper is simply not designed for this, in many ways, and it's impressive engineering to try and overcome its limitations, but I can't help but feel that it is easier to just use an architecture that is designed for the problem.

I was impressed by Kaldi's models for streaming ASR: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index... ; I suspect that the Nvidia/Suno Parakeet models will also be pretty good for streaming https://huggingface.co/nvidia/parakeet-ctc-0.6b

Oranguru · on Jan 30, 2024

Very interesting. Thanks for the references. Have you released the code or pre-trained models yet or do you plan to do so at some point?

albertzeyer · on Jan 30, 2024

The code is all released already. You find it here: https://github.com/rwth-i6/returnn-experiments/tree/master/2...

This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.

We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.

stiffler01 · on Jan 29, 2024

Indeed a great point. Waiting for a specific cue, before responding, is an interesting idea. It would make the interaction more natural, especially in situations where the user is thinking aloud or formulating their thoughts before seeking the AI's input.

Interruption is something that is already in the pipeline and we are working on it. You should see an update soon.

localhost · on Jan 29, 2024

Thanks! Really looking forward to interruptions.

I think about the cue as kind of being like "Hey Siri/Alexa/Cortana" but in reverse.

stiffler01 · on Jan 29, 2024

We thought about doing this in Whisper itself, since its already working in the audio space.

stiffler01 · on Jan 29, 2024

Yes, this is something we want to look into in more detail, really appreciate sharing the research.

stiffler01 · on Jan 25, 2024

Tried this on 4090 and the responsiveness and real-time communication it offers are truly impressive. It has significantly improved my overall experience, especially in scenarios where minimal delay is crucial.

Compared to WhisperFusion, Rabbit R1 feels like it's stuck in the past, they could maybe use the OpenSource WhisperFusion.