Why Home, Alexa, and Cortana suxx in 2017
I’m working on SmartHome / VocalAssistant (and Computer Vision) since 2012 on an opensource project named SARAH (video). It use technologies like Microsoft Speech Recognition, Kinect Microphone, SAPI Voices, etc … And has been installed by +7000 french users.
TL;DR
GoogleHome, Alexa, and Cortana are tremendous devices with excellent SpeechRecognition but their APIs missed a lot of Voice UX features that lead people to really love their assistant.
Here is some key features Google, Amazon, Microsoft and the others should add to their APIs…
Name and Voice define the soul
People have many cultures, languages, ages, references, etc … They want to name their personal assistant. In SARAH community 40% of people renamed SARAH to JARVIS because they want to be Tony Starks.
Some parents told me that SARAH is part of the family, their children chat with her (even if it’s only voice commands). The name AND the voice are the key to believe in the assistant.
HotWord or Keyword Spotting are algorithm used client to perfom Name detection. But it require many audio samples (+20K for « Ok Google »). GAFAM SHOULD implement a feature to set new hotword.
- Sensory has chip to learn HotWord for toys
- SnowBoy, own by Kitt.ai, with Amazon behind has a Keyword Spotting feature
- Snips.ai sold this feature in their APIs
- At VISEO we are working on it with TensorFlow
Voice is also VERY important (think about HER movie). Unix is VERY bad with voices (it sounds like WarGames) compared to SAPI voices made for instance by Voxygen.
Notification and Proactive Speech
Google and Amazon are rushing to know who has the bigger … set of skills/actions. But that’s not the point, and their APIs are too limited by a Request/Response mechanism:
- User ask a question
- It trigger a Function in the cloud
- Then send back an answer or a prompt
At the beginning, framework like SARAH or Microsoft BotBuilder, did the same mistake. The workaround is to close the request and introduce a way to send « answers or prompt ».
- This lead to notification and proactive message !
- This allow multiple answers: « let me think … » => « I found that … » => « What do you like ? »
- This fix request timeout, slow APIs, timers, …
- This allow to push the answer on an other communication channel
Wait ! I don’t want to receive audio spam !
Off course, NUI (Native User Interface) implies to give power back to users cho choose what brand to listen. It’s exactly the same thing in AR/VR.
Vocal UX should power IoT
There is a big misunderstanding, we are talking about:
- a communication device (like « Google Home »)
- a virtual assistant (like « Google Assistant »)
The communication device should stay a channel for communication, available through an API. It should not be the center of the UX.
The problem is that Actions (or Skills) are tied to NLP « intents ». There is no « super brain » to manage pool of action/skills.
Worst ! Google did a great job to combine Home with Chromecast but Cast API is not official. It’s a magical M2M Built-in with no 3rd party API. Developers can’t push content to the TV like an NeTV.
That’s why Alexa has 15K (almost useless) skills but no UI to make things work together (handling context, database, personal data, etc …). Current APIs are too straightforward.
In VISEO Bot Framework (aka SARAH v5), we use IBM Node-RED to handle that flow of actions/skills. GoogleHome and API.ai are single nodes, parts of a bigger ecosystem. Cortana or Alexa are a way to communicate with the user aside to Facebook Messenger, a LoRa push button or a blinking LED.
I really hope APIs will evolve to let more flexibility on server side.
About me
My name is Jean-Philippe Encausse, Software Engineer for 17 years, working on IoT, Computer Vision, VR/AR, Blockchain, … Maker of SARAH (a vocal assistant for SmartHome) and a ChatBot Framework on Top of Microsoft Bot Builder and Node-RED.