Speech recognition is useful for VR not only for simulating conversations with AI agents but also for the user to communicate with any application that requires a great number of options. Typing out a response or command might be too impractical, and overcrowding the application with buttons or other GUI elements could get confusing very fast. But anyone who is capable of speech can easily speak while they are in a VR experience.
Unity Labs’ virtual reality (VR) authoring platform Carte Blanche will have a personal assistant called U, with whom the user will be able to speak in order to easily perform certain actions. We at Labs have been researching speech recognition and analysis tools that could be used to implement these voice commands.
The first section of this article presents concepts and theory behind speech recognition. It serves as a primer introducing related concepts and links for the reader to get more information on this field. The second section presents a Unity Asset Store package and public repository we are making available that provides a wrapper for several speech-to-text solutions and a sample scene that compares text transcriptions from each API.
Speech recognition is the transcription from speech to text by a program. Semantic analysis goes a step further by attempting to determine the intended meaning of this text. Even the best speech recognition and semantic analysis software today is far from perfect. Although we humans solve these tasks very intuitively and without much apparent effort, trying to get a program to perform both poses problems that are much more difficult to solve than one might think.
One vital component of today’s statistically-based speech recognition is acoustic modeling. This process involves starting with a waveform and from it determining the probabilities of distinct speech sounds, or phonemes (e.g. “s”, “p”, “ē”, and “CH” in “speech”), occurring at discrete moments in time. The acoustic model most commonly used is a hidden Markov model (HMM), which is a type of probability graph called a Bayesian network (Fig. 1). HMMs are named so because some of the states in the model are hidden - you only have the outputs of these states to work with, and the goal is to use these to determine what the hidden states might have been. For acoustic modeling, this means looking at the waveform output and trying to figure out the most probable input phonemes - what the speaker intended to say.
Once you have probabilities for granular sounds, you need to string these sounds into words. In order to do this, you need a language model, which will tell you how likely it is that a particular sequence of sounds corresponds to a particular word or sequence of words. For example, the phonemes “f”, “ō”, “n”, “ē”, and “m” in sequence quite clearly correspond to the English word “phoneme”. In some cases more context is needed - for instance, if you are given the phonetic word “T͟Her”, it could correspond to “there”, “their”, or “they’re”, and you must use the context of surrounding words to determine which one is most likely. The problem of language modeling is similar to the problem of acoustic modeling, but at a larger scale. So it is unsurprising that similar probabilistic AI systems such as HMMs and artificial neural networks are used to automate this task. Figuring out words from a person’s speech, even given a sequence of phonemes, is easier said than done because languages can be extremely complex. That’s not even to say for all the different accents and intonations one might use. Even humans can have difficulty understanding each other, so it’s no wonder why this is such a difficult challenge for an AI agent.
In any case, at this point you have speech that has been transcribed into text, and now your agent needs to determine what that text means. This is where semantic analysis comes in. Humans practice semantic analysis all the time. For example, even before reading this sentence, you were probably pretty confident that you would see an example of how humans practice semantic analysis. (How’s that for meta?) That’s because you were able to use the context clues in previous sentences (e.g. “Humans practice semantic analysis all the time.”) to make a very good guess at what the next few sentences might include. So in order for a VR experience with simulated people to feel real, its AI needs to be skilled at analyzing your words and giving an appropriate response.
Good semantic analysis involves a constant learning process for the AI system. There are tons of different ways to express the same intent, and the programmer cannot possibly account for all of them in the initial set of phrases to watch out for. The AI needs to be good at using complex neural networks to connect words and phrases and determine the user’s intent. And sometimes it doesn’t just need to understand words and their meanings to do this - it needs to understand the user as well. If an application is designed to be used by a single person for an extended period of time, a good AI will pick up on their speech patterns and not only tailor its responses to this person, but also figure out what to expect them to say in specific situations.
It’s also worth noting that both semantic analysis and speech recognition can be improved if your AI is used for a very specific purpose. A limited number of concepts to worry about means a limited number of words, phrases, and intents to watch out for. But of course, if you want an AI to resemble a human as much as possible, it will have to naturally respond to anything the user might say to it, even if it does serve a specific purpose.
Labs’ initial research on speech recognition has involved the evaluation of existing speech-to-text solutions. We developed a package for the Asset Store that integrates several of these solutions as Unity C# scripts. The package includes a sample scene that compares the text transcriptions from each API side-by-side and also allows the user to select a sample phrase from a given list of phrases, speak that phrase, and see how quantitatively accurate each result is. The code is also available in a public repository.
The speech-to-text package interfaces Windows dictation recognition, Google Cloud Speech, IBM Watson, and Wit.ai. All of these respond to background speech relatively well, but some of them, such as Windows and Wit.ai, will insert short words at the beginning and end of the recording, probably picking up on some of the beginning and ending background speech that is not obscured by foreground speech. Each solution has its own quirks and patterns and its own methods for dealing with phrases designed to provide challenges for speech recognition.
Windows dictation recognition was recently added to Unity (under Unity.Windows.Speech). Because the asset package is specifically for speech-to-text transcriptions, it only uses this library’s DictationRecognizer, but the Windows Speech library also has a KeywordRecognizer and a GrammarRecognizer. Windows uses streaming speech-to-text, which means it collects small chunks of audio as they are recorded and returns results in real time. The interim results returned as the user is speaking are temporary - after a pause in speech, the recognizer will come up with a hard result based on the entire block of speech.
We integrated Watson streaming and non-streaming (where the entire recording is sent at once) speech-to-text into the package as well - in fact, IBM has its own Watson SDK for Unity. Like Windows, it also has built-in keyword recognition. Watson currently supports US English, UK English, Japanese, Spanish, Brazilian Portuguese, Modern Standard Arabic, and Mandarin. An interesting feature we discovered about Watson is that it detects pauses in speech such as um and uh and replaces them with %HESITATION. So far we haven’t seen any other types of replacement by Watson.
Google Cloud Speech also has support for both streaming and non-streaming recognition. From what we have tested, Google appears to have the widest vocabulary out of all of the four options - it even recognizes slang terms such as cuz. Google Cloud Speech also supports over 80 languages. It is currently in beta and open to anyone who has a Google Cloud Platform account.
Wit.ai not only includes streaming and non-streaming speech recognition, but it also has an easy to use conversational bot creation tool. All you need to do is specify several different ways to express each intent needed by your bot, create stories that describe a potential type of conversation with the bot, and then start feeding it data - the AI can learn from the inputs it receives. Wit.ai even includes a way to validate the text it receives against the entities (traits, keywords, free text) it observes, as well as a way to validate speech-to-text transcriptions. Our Asset Store package only includes non-streaming Wit.ai speech-to-text due to time constraints.
The sample scene in our speech-to-text package includes several test phrases - many of which were found on websites that listed good phrases to test speech recognition (one is a list of Harvard sentences and the other is an article about stress cases for speech recognition). For example, “Dimes showered down from all sides,” (a Harvard sentence) includes a wide range of phonemes, and “They’re gonna wanna tell me I can’t do it,” (a sentence we thought up ourselves) includes contractions and slang terms. The Windows speech-to-text solution seems to be the only one that has a hard time picking up on “they’re” instead of “there” or “their”, even though the context makes it clear which one is needed, and Windows does pick up on “we’re”. Most of the APIs usually preserve the terms “gonna” and “wanna” as they are, but Google interprets them as “going to” and “want to”, which is strange considering it also uses the term “cuz” (Wit.ai can also recognize “cuz”). A funny test phrase found in Will Styler’s article is “I’m gonna take a wok from the Chinese restaurant.” We never once got the word “wok” to appear - they all translated it as “walk” every time, which still makes perfect sense even given the context of the sentence. This kind of sentence is a huge stress test - even plenty of humans would need more clarification than just the context of that one sentence itself. For example, if you know that the “I” in the sentence is a thief, that would make “wok” much more likely than “walk”.
The package we developed is meant to be an easy way to compare a few of the biggest speech-to-text options in Unity and integrate them into your projects. And if there are other APIs you would like to try out in Unity, this package should make it relatively easy to create a class that inherits from one of the base speech-to-text services and integrate it into the sample scene and/or widgets. In addition to the individual speech-to-text SDKs, the package includes several helper classes and functions (a recording manager, audio file creation and conversion, etc.) to facilitate the integration and comparing of more APIs.
What makes speech recognition so difficult to get right is the sheer multitude of variables to look out for. For each language you want to recognize, you need a ton of data about all the words that exist (including slang terms and shortened forms), how those words are used in conjunction with each other, the range of tones and accents that can affect pronunciation, all the redundancies and contradictions inherent to human language, and much more.
Our Asset Store package currently integrates a few speech-to-text solutions - but these are enough to easily compare some of the biggest solutions out there and to see what general strengths and weaknesses exist among today’s speech recognition tools. It is a starting point for Unity developers to see what works for their specific needs and to add further functionality. You can integrate more speech-to-text tools, add a semantic analysis step to the architecture, and add whatever other layers are necessary for your Unity projects. Refer to this article for a review of several semantic analysis tools.
This research was motivated by Carte Blanche’s initial plan to integrate AI agent U to respond to voice commands. Accomplishing this involves speech-to-text transcription and keyword recognition. Another interesting yet difficult challenge would be creating an agent with whom the user can have a conversation. We humans often speak in sentences or sentence fragments and throw in “um”s and “ah”s and words that reflect our feelings. If an AI agent in a VR application can understand not just keywords but every part of a person’s conversational speech, then it will introduce a whole other level of immersion inside the VR environment.
The ability to have a natural conversation with something in VR (keyword “something” - even if we can get it to feel real to the user, in the end it’s still a user interface) is widely applicable to a variety of applications outside of just Carte Blanche - for example, virtual therapy such as SimSensei, the virtual therapist developed by the USC Institute for Creative Technologies (ICT) . However as we all know, there are a variety of different ways to express the same intent, even in just one language - so creating natural conversations is no easy task.
You can find the speech-to-text package on the Asset Store. The BitBucket repository for the package can be found here. Anyone is welcome to create their own forks. Drop us a note at email@example.com if you find it useful, we'd love to hear from you!
Images: Amy DiGiovanni
Amy DiGiovanni & Dioselin Gonzalez work at Unity Labs; Amy is a Software Engineer and Dio is a VR Principal Engineer.