Where did the idea to write an article about speech recognition come from? It all started with this scene from «The Big Bang Theory».
This is a good example for the limits of speech recognition. Siri doesn't understand that Barry has a speech impediment. The language assistant can’t learn – doesn’t have enough Artificial Intelligence (AI) – to recognise and incorporate these pronunciation errors.
Let’s begin with a simple misunderstanding and some good old toilet humour. Instead of playing the song «Splish Splash (I Was Taking a Bath)», Alexa searches for «Splish Splash I Was Taking a Crap».
Here’s a video of Microsoft CEO Satya Nadella giving a keynote at a congress. As chance would have it, he’s speaking about Artificial Intelligence and is presenting Microsoft’s virtual assistant Cortana. And Cortana messes up the demo. Instead of «Show me my most at-risk opportunities», the assistant understands «Show me to buy milk at this opportunity». Someone backstage has to help him out in the end.
This one should teach parents a lesson: activate child safety lock. Instead of searching for a song from this little boy’s favourite book («Digger, digger»), Alexa suggests pornographic content. Would be helpful if virtual assistants learned to tell by the voice who they’re dealing with to avoid this in the future (if the child safety lock is activated).
Let’s get to the really interesting fails. Language is complex; some words sound the same but have completely different meanings. In this video, four/for and two/to are examples of what is referred to as homophones. Homophones baffle virtual assistants and it takes creative approaches to make sure the right word is understood. This fail is another classic when it comes to showing the limits of speech recognition.
The same applies to words with multiple meanings – homonyms. And there are loads of them. Let’s look at the word «date»: it can either be a fruit, a romantic meeting or simply a specific day in the calendar. Or «type»: This can either mean writing by means of a computer or a category of something. These are just two examples; the list goes on forever.
There aren’t only 6,000-7,000 languages worldwide, there are also countless dialects. With such a range, it’s not surprising that speech recognition focuses on one standard language. Virtual assistants would need a lot more Artificial Intelligence to be able to understand dialects, too. Keeping in mind that even humans have trouble understanding some dialects, it could take a long time for virtual assistants to learn this.
The following video shows a speech assistant trying to figure out what a Scottish guy is saying. The success is, let’s say, modest.
You're not connected to the Internet. Please check that your connection is enabled to keep browsing the site.