Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Speech generation has gotten really good, but there's simply no way to faithfully recreate someone's vocal idiosyncracies and cadence with just "a few seconds" of real audio. That's where the models tend to fall short.


This was my thought as well, but someone pointed out to me that regional accent identification captures a large percentage of cadence and inflection differences (specific word choices and turns of phrase obviously would still not be there).


I don't think it's hard to get more than a few seconds of voice from many people.

'hi, sry to call you I'm Cindy and I'm from your insurance. I'm calling regarding your car crash ...'.


Few seconds means less than a minute. That’s not nothing. Look at a clock and talk for a minute — it’s longer than you might think.

Do you think you could give a recording of a minute of someone talking to a talented impressionist and they could impersonate that person to some degree? It doesn’t seem that far fetched to me.


"Few" doesn't mean <60 it typically means ~5 or <10.

Getting high-quality audio for an arbitrary private citizen via public means isn't that easy, especially for folks like me that don't post video on public social media and use automated call screening and never say a word until the caller has been vetted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: