By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.
The SSML language
If you’re new to markup languages, take a look at this example, as it may be useful as a reference to understand SSML syntax.
< ?xml version="1.0"? >
< speak version="1.0" xml:lang="en-US" >
< voice name="Dave" >
Hello, world; my name is Dave.
< /voice >
< /speak >
This example shows that the voice named “Dave” should pronounce: “Hello, world; my name is Dave.” (Keep in mind that the spaces adjacent to the angel brackets should not be observed in the real-world application) As in other XML-based markup languages, SSML is composed of elements. The root element is
Figure 5 below is a table that shows how the SSML elements are associated to the five points of Text Analysis.
The prosody tag you will use a lot if you intend to create separate voice characters from only one TTS resource. With prosodic control you can manage the tempo and pitch of the voices.
Listen to this XML example of the ‘Grandson’ talk scenario. And see below the markup tag to make it play at a higher pitch.
< prosody pitch="+4.2st" > I believe Visual Purple’s products have among the best where NPC voice quality is concerned. < /prosody >
As far as TTS engines go, this is a pretty effective example. Here, rather than emphasizing one or several individual harmonics as occurs with the wood or metal in music instruments, the vocal tract emphasizes an entire band of harmonics, called formants. Each vowel sound has characteristic bands of higher intensity harmonics. In a word, the character of the original voice clip is largely retained, even when the voice’s pitch has been raised. Beware that not all TTS engines do so well when processed with markup languages.
Listen to an XML example of the ‘Grandson’ talk scenario and see below the markup tag that makes the above paragraph’s sample play at a faster tempo.
< prosody rate="+5%" > I believe Visual Purple’s products have among the best where NPC voice quality is concerned. < /prosody >
Note the glaring sonic artifacts in this example. It plays way too quickly to sound ‘natural’! In my own research I have noticed that many of the TTS engines available do not give the user a fine-control when entering tempo parameters into markup tags. The results are usually too fast or too slow. And in some of those engines that do respond to fine control, sonic artifacts such as static or scratchiness is introduced.
Listen to this XML example of the ‘Grandpa’ talk scenario where we use a markup tag to make it play at a lower pitch.
< prosody pitch="-3.8st" > There are a number of synthetic voice vendors available. It seems though, that many of these vendors are reselling the same voice actors, so try to get your license from the source. < /prosody >
It’s interesting to note that this is the same TTS engine that performed so well with the raised pitch formats, but shows some sonic artifacts in this example where the frequencies are pitched lower. Listen carefully and observe the slight scratchiness. It’s as though you can hear tiny, rhythmic interruptions all through the sound data.
Listen to an XML example of the ‘Grandpa’ talk scenario and see below the markup tag that makes the above Grandpa sample play at a slower tempo.
< prosody rate="-10%" > There are a number of synthetic voice vendors available. It seems though, that many of these vendors are reselling the same voice actors, so try to get your license from the source. < /prosody >
This one mirrors the same tempo defect revealed in the above Grandson tempo exercise. It plays way too quickly to be useful (unless you are going for that classic Hal-the-computer voice near the end of the movie 2001 where the robot meets his slow demise).
Synthetic speech can make effective voice-actors when techniques are carefully deployed (especially with regard to adjusting tempo and pitch to improve your NPC’s realism). At this juncture there appears to be no one go-to solution. For good results, we’ve had to utilize a combination of 3rd-party software with XML tags, though I have to admit we seem to resort to 3rd-party software more and more.
IF markup language deployment were as robust as we wish they were, we would be able to include an XML parser in our commercially available development tools. We have clients that have expressed an interest in having the capability of building their own virtual world simulations where all they need do is type in their avatars’ text, and the voice syncing-to-animation just happens automatically for them. The bottleneck, though, appears to be a too dramatic hit on frame rate, where the TTS speech and/or animation quality suffers. There is great demand put on a CPU when it has to display high quality images and process real-time audio manipulations simultaneously. This is why, in the meantime, we pre-render scenarios so that our content looks and sounds glorious.
Well, our technologists will figure out a solution to the real-time problem, though. Visual Purple is all about quality – and providing the tools that our customers want!
In a future blog post – Comedic treatment in TTS voices. Can robots be funny? Stay tuned!