I can make a blog post later, but at a high level:
A rust TTS server hosts two models: a mel inference model and a mel inversion model. The ones I'm using are glow-tts and melgan. They fit together back to back in a pipeline.
I chose these models not for their fidelity, but for their performance. They're 10x faster at inference than Tacotron 2. If you want something that sounds amazing, you're better off with a denser set of networks, like Tacotron 2 + WaveGlow. You should use these for achieving superior offline results for multimedia purposes.
Instead of using graphemes, I'm using ARPABET phonemes, and I get these from a lookup table called "CMUdict" from Carnegie Mellon. In the future I'll supplement this with a model that predicts phonemes for missing entries.
Each TTS server only hosts one or two voices due to memory constraints. These models are huge. This fleet is scaled horizontally. A proxy server sits in front and decodes the request and directs it to the appropriate backend based on a ConfigMap that associates a service with the underlying model. Kubernetes is used to wire all of this up.
This is incredibly cool. Do you mind sharing how big the models are, and what kind of instances you're deploying them on?
I ask because I help maintain an open source ML infra project ( https://github.com/cortexlabs/cortex ) and we've recently done a lot of work around autoscaling multi-model endpoints. Always curious to see how others are approaching this.
(All the voices use the same melgan, or derivations of it.)
I'll edit my post later with my deployment and cluster architecture. In short, it's sharded and proxied from a thin microservice at the top of the stack. I'll probably introduce a job queue soon.
I tried "Watch as the cat sniffs the flower, eats it, and then vomits. This is classic feline behavior" with Attenborough. He seems to slip into a bit of a German accent on the second sentence. What's the cause of that?
Thanks for sharing, though. Very interesting project!
I can come back and post a write up. Please refresh this post later today.
I scaled for today, but it's pretty cheap to run day to day.
I also have some architectural optimizations to make that will greatly reduce the costs. Right now, nodes are responsible for two speakers apiece. This is an under-utilization since most speakers don't get used.
Do you have a GitHub or technical documentation about how you build this sort of thing to work at scale?