Reliability needs to be way higher then what we're currently seeing with google ...

jjwiseman · on May 12, 2015

For specific domains, like drone control, you can achieve much better accuracy then you typically see in "dictation"-style speech recognition. You can use a statistical language model that represents the things you're most likely to hear.

For example, Google & Siri kind of need to be able to handle anything I throw at them: "What is Ke$ha's new album?" "What year was the Hardy Boys book 'Hunting For Hidden Gold' written?" They may use a language model that favors grammatical language, that is: "What is Ke$ha" is a more likely speech recognition hypothesis than "What hiss kush ball", but they still need a big model to represent that.

For drone control, you have much more constrained language, which helps recognition accuracy significantly. The model can tell the recognizer that if it heard "Go <unsure> 100 feet" that the <unsure> word is most likely to be a direction like forward/back/left/right/up/down, and not "neutrino".

It's a lot like the way that Norvig uses n-grams to illustrate writing a spelling corrector: http://norvig.com/ngrams/ch14.pdf Having a model lets you fix errors in the input.

Having constrained language and a good model is often critical to creating a successful speech interface.