In this post I would like to share opinions about Text To Speech (TTS) technology in the context of gaming. I am not a gaming expert – rather, my expertise is in signal and speech processing. I am working at the IBM Research division.
I hope this post will trigger a discussion and opinion sharing. My purpose here is to discuss “quality”. I plan to discuss technology trends in follow on posts.
The potential benefit of “good quality” TTS technology for game developers is clear. But it is still considered as delivering “insufficient quality”. What is “quality” anyway, in our context?
1. The basic quality of modern TTS is good, in general. State of the art machine learning algorithms enable good prediction of the prosody (“intonation”, duration, loudness, emphasis and more) from the text, and the synthesized speech achieves good scores in subjective quality tests. This is no more the “robotic” sound it used to be. It sounds natural and “clean”. For applications such as announcements or commercial question answering, modern TTS provides a good alternative.
2. Natural speech, however, needs correct emphasis of different words across the sentences. Due to the ambiguity of the natural language, the algorithms (and humans as well) cannot always determine “correctly” the emphasis from the text of isolated sentences or utterances, without full knowledge of the entire context. This can limit the quality achievable by modern TTS technology.
3. When we consider gaming applications, additional needs arise. For example, using a formal style for generating the speech for a scene of action would sound weird – where are the emotions? The emotional content in human speech is essential for conveying messages. This is certainly an important aspect of “quality”.
4. Yet another aspect that relates to “quality”, at least in the broader sense, is the variety of voices. Most modern TTS technologies are based on pre-recorded human voices (recording of voice talents uttering a large collection of sentences). As recording, and processing the recorded speech, are expensive and time consuming, the variety of voices in typical TTS products is limited, often a few and less commonly several tens of different voices per “major” language. Moreover, gaming often requires non-human voices, such as “cartoonish” ones, to best support the different characters. To summarize, using repeatedly the same voices across games and characters, amounts to less than optimal experience, or in other words – to lower “quality”.
I hope this provides some initial insights – from the perspective of a speech technology researcher. I hope to get feedback from the gaming experts. Am I right? What have I missed? What would allow game developers to start benefiting from the TTS technology?
All the best, Aharon.