Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.
Another element to this task is to lengthen or shorten the TTS words to match the blobs of the human model. Figure 5 depicts the effort to make the TTS utterance of ‘…was a…’ (pronounced as though a contraction, ‘whuzza’) line up on the timeline with John’s clip. Use your DAW’s stretch tool to accomplish this.

Figure 5a- First, make your split points

Figure 5b- Next, use a stretch tool
Let’s continue splitting the TTS clip’s timelines so that we can move each corresponding sound blobs to match, and stretch the words right down to the syllable (Figure 6 shows what it looks like when all words have been synced). Listen to the whole joke, both voices lined up properly.

Figure 6
Here’s where some of you are thinking: Well, the blobs are lined up very nicely, but what about nuances regarding stress and pitch? Isn’t the word ‘lawyer’ as expressed by our human friend, John, not being expressed similarly? John’s lawyer blob is larger (i.e., louder) than the TTS blob. Also, isn’t the word ‘seen’ as expressed by John (in this case the stress is caused not by volume but by its pitch being higher, relatively, from the rest of the phrase) not being emulated by the synthetic actor?
Yes, indeed, so let’s try to fix these two issues. We’ll tackle the loudness point first. Figure 7 shows a Volume Envelope (the horizontal blue-ish line running through the center of the TTS clip in the timeline). With most DAWs with this feature, you can bend the volume envelope to cause increases or decreases in the audio.

Figure 7 – Creating break points within the line bends the envelope
Now let’s tackle the pitch issue with that word, ‘seen’. Figure 8 shows the clip properties dialog box specific to the split-off region of our seen-blob. The highlighted value indicates that the word pitch has been raised four half steps.

Figure 8
Listen to the resulting TTS clip with the treatments per ‘lawyer’ and ‘seen’.
Window dressing
Earlier I mentioned that this is a voice for a talking fish. This fish is contained within a fish tank in a hotel bar. Listen to our talking fish enveloped in a bubbling sound effect. Figure 9 shows the TTS clip, sans John’s clip, and with the fish tank noise clip added.

Figure 9 – Note that a volume envelope has been applied to the bubbles as well.
So, is that it, then? Maybe – maybe not. As if we really did want to add some reality to a talking fish environment, we might consider what we know about how a fish tank effects sound. Occlusion happens. There is a glass barrier between the sound emitter (the talking fish) and the sound receiver (the avatar). So, we could elect to shave off some of the high frequencies from our talking fish. We can accomplish this by choosing the appropriate reverb effect. If you have presets at your disposal, start with a bathroom preset or similar. Try placing the reverb effect before any equalization effects (EQ). We use EQ here to bring out the hi-mid frequencies of the voice to ensure that it is intelligible (you may need to reduce high frequencies as well if you choose a reverb preset that sounds too bright). In this case we are also deploying EQ to remove extreme low frequency rumble (artifacts that commonly get accidentally introduced when using filters in the digital domain). Figure 9 shows this idea. Have a listen to the result.

Figure 10a – Software ‘bathroom’ reverb

Figure 10b – Software EQ module
Conclusion
Can synthetic voice-actors make funny? Humor is a very subjective aspect of human emotion. What’s funny to Samuel isn’t so funny to Mary, and so forth. So maybe the jury is still out on that one. To improve our NPC’s delivery, we’ve had to rely on 3rd-party software to ensure that techniques were carefully deployed. Markup language deployment probably won’t be sufficient for specific tasks like this, where real-time interaction is not a requirement. That’s my best guess, anyway.
You may wonder what to do if you have a project that requires an ensemble of funny voices. Well, as long as you have at least one funny human available to you, that person can be your model for all voices. Then your cast of synthetic actors can be molded to conform to your model’s comedic timing.
How about this scenario: you have a cinematic cut-scene where there are several actors in the movie (or trailer). But your budget can only afford one human voice-actor. Consider recording your one voice actor doing the roles of the entire cast. Then, using the techniques discussed above, create an ensemble of TTS voices and synchronize them in your video editor (NLE) to the synthetic voices to the phrasings and expressions of your one human actor.
In fact, maybe we’ll try to tackle an example of that in my next blog entry. Stay tuned!













[...] post: Comedic treatment in TTS voices (Can Robots be Funny?, Part 2) Tags: ad-space, avatar, Best movie actors, listen, serious-games, themes, tts-voices, [...]