Microsoft Voice Service Integration

Contents

Microsoft Voice Client Implementation on Loomo​

Integrating Amazon's Alexa service is promising because of the rich set of functionality that comes with it. On the other hand, we take a look at alternative services to implement voice assistance for Loomo, such as the Microsoft Voice Service.

Microsoft Voice Client Implementation on Loomo​

Project Documentation

Requirements

One of the requirements for this practical course was to build a mobile voice assistant. It was envisaged to use an existing speech recognition service like Amazon Alexa Voice Services (Alexa) [1] to simplify the process of speech understanding. Furthermore, dialogs should be developed to interact with the robot. Afterwards a technical concept should be established to control the robot with voice commands. Alexa Skills and the IOT API are to be used to achieve this.

During researching and the proof of concept phase of the project it was decided to use Bing Speech API (Speech API) [2] in conjunction with Azure natural language understanding services (LUIS) [3] instead of Alexa and Alexa Skills. The Speech API is a Speech-To-Text service from the Azure Cognitive Services. It can convert spoken audio to text and the other way around [2]. LUIS is described by the developers as: “a cloud-based service that applies custom machine-learning to a user’s conversational, natural language text to predict overall meaning and pull out relevant, detailed information” [4]. The Speech API has a dedicated Android library, in contrast to Alexa, that has only a C++, or a community-based library with low level APIs.

The combination of Speech API and LUIS works well together to create voice commands, that can easily be interpreted and executed. This library needs only the API-keys for the services and takes care of the allocation and release of the microphone resource. It returns a JSON with the recognized command and its parameters with corresponding probabilities. There is no need to develop dedicated skills and additional IOT commands like it is needed for Alexa.

There are some implemented conceptual dialogs to show and evaluate how more sophisticated commands can be developed. For example, you can set the volume of the application by saying: “Set the volume”. Loomo will ask you for the value to set the volume to. The whole dialog would play out as: “Okay Loomo. Set the volume.” – “What do you want to set the volume to?” – “50 percent” – “Okay, the volume is set to 50 percent.”

As a technical concept to control the robot, there are commands to set the brightness of the screen and the volume of the app. The implementation of those shows how to access Android systems services and variables. It also demonstrates how to manipulate the execution environment. There are two methods to start and stop the exploration of the room. Additional useful commands that are implemented are the date and time methods. The Date command gives you the current date in a natural spoken way. The time command reports the current time in a similar fashion. There is currently no way to tell the robot to move or turn in a specified direction. However, it is quite easy to implement new commands. You only need to extend the “IntentsLibrary” class with a method of the same name that is returned by LUIS.

Technical Description

To start the voice interaction with Loomo the Loomo SDK wakeup is used, which recognizes if one of the wakeup words is spoken and then starts the Azure voice recognition. Some of the possible wake up phrases that can be used [5]:

  • “Okay Loomo.”
  • “Hi Loomo.”
  • “Hello Loomo.”
  • “Loomo Loomo.”

After the wakeup a sound is played to indicate the start of the voice recognition to the user. Now the user can speak to Loomo in a natural way and Loomo generates a corresponding intent, which then invokes the desired functionality.

List of the already implemented intents and example phrases:

  • None
    • default intent: called when in IntentsLibary not implemented intent is recognized
  • Confirm
    • “Yes.”
  • Decline
    • “No.”
  • Cancel
    • “Cancel, please.”
  • CloseApplication / PowerOff
    • “Shut up!”
    • “Shut down!”
  • AreYouListening
    • “Can you hear me?”
  • Time
    • “What’s the time?”
  • Date
    • “What day is today?”
  • SetBrightness
    • “Set the brightness to medium / 20 / 40 percent.”
  • SetVolume
    • “Set the volume to medium / 15 / 35 percent.”
  • StartExploration
    • “Look around.”
  • StopExploration
    • “Stop exploring.”

If an intent needs an entity (e.g. the actual value for SetVolume) but the user did not provide one a dialog is started and Loomo asks for it, e.g. “Okay Loomo. Set the brightness.” – “What do you want to set the brightness to?” “Medium.” – “Okay, the brightness is set to medium.”

In case the Azure voice recognition does not recognize any phrases (NoMatch or InitialSilenceTimeout) Loomo will ask for repetition of the last phrase (e.g. “Pardon? Can you repeat that, please?”) until a valid phrase is detected.

Every detected intent has a confidence, i.e. how likely it is that this intent was intended by the user. When the confidence of an intent is too low Loomo will ask the user and assure that the right functionality is executed. (e.g. “Okay Loomo. Volume.” – “Do you want to set the volume?” – “Yes.” – “Okay. What do you want to set the volume to?” – “30 percent.” – “Okay, the volume is set to 30 percent.”)

Description of the Program Components

The Loomo assistant application is split into two main parts. Each has its own package. The first is the “navigationrobot”. It contains all elements needed for the exploration and mapping of the room. The “MainActivity” hooks into the second package “loomospeech” to leverage the voice commands.

The “loomospeech” package is basically an application in its own. The MainActivity class functions as entry point and hooks into the app lifecycle to register, start, stop and return system services and necessary libraries. The MainActivity in ”navigationrobot” is derived from this class so that it does not need to implement the lifecycle hooks itself. The MessageHandler and WordToNumber classes are helper classes to handle messages that should be displayed to the user respectively convert recognized number words to actual numbers, that will be used to modify settings of the app. The LoomoSoundPool class is used to load and play notification sounds for confirmation or errors. It can hold additional sounds for advanced user interaction. LoomoTextToSpeech class is responsible for the acoustic response to the user. It takes a callback function to continue the dialog after it is done speaking. The LoomoWakeUpRecognizer is a class that leverages a Loomo firmware API to listen in the background for a wake-up-keyword. If it detects the keyword it starts the speech recognition and intent services, otherwise it remains silent. The AzureSpeechRecognition class listens for the voice commands or dialogs and invokes the language understanding. On successful recognition and understanding it calls the command by its intent name from the IntentsLibray. The IntentsLibrary class holds all usable commands and lets you know if the is not implemented. This class executes the voice commands that were understood by LUIS. By extending or deriving this class more commands can easily be added.

Problems and Restrictions

The Loomo voice assistant technology works best in a small silent environment. That is that the voice commands can easily be overlapped by noise, music and other groups of people talking. Even though there are multiple microphones installed in the robot it gets harder to capture the sound of spoken words the farther away the person is. In a normal room with few people there is typically no problem in understanding what is spoken to the robot. In a large hall there is often reverberation and the distances dampen the loudness of the voice. Furthermore, there are often a lot of people in large rooms talking that make it hard to distinguish the commands from clutter. Loomo will ask you if you could repeat the command if it could not understand it correctly. If this happens often in succession it can get frustrating. The firmware wake-up function tends to work a little bit better than the actual voice command recognition. That might be attributed to the di↵erent accessing rules the two components have. The firmware has direct access to the microphone, were the SpeechAPI can only access it via the operating system. If the voice assistant is used in a normal o”ce or apartment it performs quite well.

The LoomoTextToSpeech class internally uses the build in android text to speech engine. This is easy to use and deeply integrated in the system. It frees the developer from selecting the proper language and localization settings and lets the user set the preferences systemwide. There is however a problem with this specific text to speech engine. In the used major version of android this engine cannot be shutdown without it generating an error message. The system resources are freed properly, and the service can be used afterwards without problems. Unfortunately, this malfunction was only fixed in the next major version of android.

Suggested Improvements or Alternatives

To improve the overall usability the two main shortcomings from the chapter above should be fix or minimized. The microphone volume for the wake up und command recognition should be increased if possible. This would make it easier for the SpeechAPI to understand the commands and distinguish them from background noise. For the text to speech error it would be best to update to a newer version of android. But because the Loomo robot is a proprietary system this is not likely to happen soon. To prevent the error from occurring during the apps lifecycle the service powering the text to speech engine should not be released. Although this is not advised from the android system developers it is a valid short-term solution in the developer community until the system could be updated. If this is not desired the engine could be swapped with the text to speech engine from the SpeechAPI. This is not integrated in the system and must be configured accordingly.

As an alternative to the SpeechAPI and LUIS combination, the AzureSpeechRecognition class could be substituted with an Amazon Alexa or similar services as a drop-in-replacement. Is should only be able to invoke the correct methods in the IntentsLibray.

Advice for Further Work

To extend or modify the voice commands and how they are recognized you need to have a Microsoft Azure Account. The Bing Speech API and LUIS services can be used for free but are necessary to refine the language intents and its parameters. You only have to set the API subscription keys in the resources. The next step in the development would be to build more skills. There could be a weather app or an app to remind you of upcoming appointments. This could incorporate Azure-Bots to synchronize your calendars or even make Skype calls. It would also be nice if the robot moved via voice commands. The start/stop exploration commands give you a feeling how this could work. But it would be nice if the robot could follow you through the room or you could simply send it away by telling so.

References

[1] Amazon, ”developer.amazon.com,” [Online]. Available: https:// developer.amazon.com/de/alexa-voice-service. [Accessed 07 06 2018].

[2] Microsoft, ”Microsoft Azure Cloud Computing Platform & Services,” [Online]. Available: https://azure.microsoft.com/en-us/services/cognitive-services/speech/. [Accessed 07 06 2018].

[3] Microsoft, ”LUIS,” [Online]. Available: https://www.luis.ai/home. [Accessed 07 06 2018].

[4] Microsoft, ”About Language Understanding (LUIS) in Azure — Microsoft Docs,” Microsoft Azure, [Online]. Available: https://docs.microsoft.com/en-us/azure/cognitive-services/LUIS/Home. [Accessed 07 06 2018].

[5] Segway Robotics Documentation [Online]. Available: https://developer.segwayrobotics.com/developer/documents/segway-robot-overview.html [Accessed 09 06 2018].