Have you had any success with using speaker embeddings to generate voices with fewer samples of speech? I did some cursory experiments but I couldn't get too far beyond getting pitch similar to the target speaker.
My reasoning for this approach: IMO, if the model learns a "universal human voice", it shouldn't need too much additional information to get a target voice.
I did! I tried creating a multi-speaker embedding model for practical concerns: saving on memory costs. I'm going to have to add additional layers, because it didn't fit individual speakers very well. I wish I'd saved audio results to share. I might be able to publish my findings if I look around for the model files.
I think you're right in that if we can get such a model to work, training new embeddings won't require much data.
My reasoning for this approach: IMO, if the model learns a "universal human voice", it shouldn't need too much additional information to get a target voice.