I skimmed your about, where you mention it as a hobby demo of your deep work. Do...

echelon · on July 27, 2020

I can make a blog post later, but at a high level:

A rust TTS server hosts two models: a mel inference model and a mel inversion model. The ones I'm using are glow-tts and melgan. They fit together back to back in a pipeline.

I chose these models not for their fidelity, but for their performance. They're 10x faster at inference than Tacotron 2. If you want something that sounds amazing, you're better off with a denser set of networks, like Tacotron 2 + WaveGlow. You should use these for achieving superior offline results for multimedia purposes.

Instead of using graphemes, I'm using ARPABET phonemes, and I get these from a lookup table called "CMUdict" from Carnegie Mellon. In the future I'll supplement this with a model that predicts phonemes for missing entries.

Each TTS server only hosts one or two voices due to memory constraints. These models are huge. This fleet is scaled horizontally. A proxy server sits in front and decodes the request and directs it to the appropriate backend based on a ConfigMap that associates a service with the underlying model. Kubernetes is used to wire all of this up.

calebkaiser · on July 27, 2020

This is incredibly cool. Do you mind sharing how big the models are, and what kind of instances you're deploying them on?

I ask because I help maintain an open source ML infra project ( https://github.com/cortexlabs/cortex ) and we've recently done a lot of work around autoscaling multi-model endpoints. Always curious to see how others are approaching this.

echelon · on July 27, 2020

glow-tts:

    total 4.2G
    -rw-r--r-- 1 bt bt 110M glow-tts_alan-rickman_ljstx_2020.07.22_expr-1_chkpt-4765.torchjit
    -rw-r--r-- 1 bt bt 110M glow-tts_anderson_cooper_ljstx_2020.07.21_expr-1_chkpt-6622.torchjit
    -rw-r--r-- 1 bt bt 110M glow-tts_arnold_schwarzenegger_ljstx_2020.07.16_expr-2_chkpt-9045.torchjit
    -rw-r--r-- 1 bt bt 110M glow-tts_barack_obama_ljstx_2020.06.28_expr-1_chkpt-1729.torchjit
    -rw-r--r-- 1 bt bt 110M glow-tts_ben-stein_ljstx_2020.07.21_expr-1_chkpt-7516.torchjit
    -rw-r--r-- 1 bt bt 110M glow-tts_betty_white_ljstx_2020.06.28_expr-1_chkpt-1666.torchjit
    ...

melgan:

    -rw-r--r-- 1 bt bt 17M melgan_manyvoice5.0_2020-07-23_12d5838_10760.torchjit

(All the voices use the same melgan, or derivations of it.)

I'll edit my post later with my deployment and cluster architecture. In short, it's sharded and proxied from a thin microservice at the top of the stack. I'll probably introduce a job queue soon.

superasn · on July 28, 2020

> I can make a blog post later, but at a high level:

A detailed blog post about this would be amazing! I wish there was a hn bot like Reddit bots to ping me when you do post it so i don't miss it.

spdustin · on July 27, 2020

> "…Instead of using graphemes, I'm using ARPABET phonemes…"

Is this why some examples I tried seemed to skip some of the words?

echelon · on July 27, 2020

Exactly. If you type "I am a dangerous asdhfjahdsff velociraptor, rawr."

There aren't entries for

- asdhfjahdsff

- rawr

I added around 500 new words, but I missed a lot of stuff.

The ultimate fix is to have grapheme -> phoneme prediction so that all unseen words can be mapped to potential phonemes (polyphones).

nmstoker · on July 27, 2020

Are you logging the words people submit? That'd be a good source for the most common OOV tokens to add.

maccormack · on July 28, 2020

I tried "Watch as the cat sniffs the flower, eats it, and then vomits. This is classic feline behavior" with Attenborough. He seems to slip into a bit of a German accent on the second sentence. What's the cause of that?

Thanks for sharing, though. Very interesting project!

minerjoe · on July 27, 2020

Can you share the cost of running this system?

echelon · on July 27, 2020

I can come back and post a write up. Please refresh this post later today.

I scaled for today, but it's pretty cheap to run day to day.

I also have some architectural optimizations to make that will greatly reduce the costs. Right now, nodes are responsible for two speakers apiece. This is an under-utilization since most speakers don't get used.