Emulating Human Voice-overs with TTS Voices, Part Three

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Continuing with our vlog ‘how-to’ series called, ‘Emulating Human Voice-overs with TTS Voices’ we now offer this newer presentation, suffixed as ‘Part Three’. We recommend that you review Part One and Part Two first but that is not necessarily a requirement. For this exercise we snipped out a small piece from one of our past projects. Unlike the premise of Part 2, where you learned to sync a TTS voice to a human voice-actor, this video tutorial will focus on the tactics of humanizing synthetic voice-clips with added detail. Today’s presentation does not only reinforce the techniques discussed in Part 1 and Part 2, but will show how to set the talk-pace to improve the phrasings and expressions of synthetic voices. And the concept of formant manipulation is introduced as well. Disclaimer: these are helpful tips, but generalized. Not all editing tools or TTS engines respond to specific techniques that you might try in the very same way. Mainly, just try to grasp the concepts, then adapt your technique to idiosyncrasies of your chosen tools.

The How-To-Humanize your TTS Clips exhibit (Exhibit Part 3). A follow-up on VO elements originally presented in Part 2’s vlog. This time, we introduce Formant handling.

As always check back in for more on this topic and other fun and useful information!

Emulating Human Voice-overs with TTS Voices, Part Two

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Since there was great interest in a blog entry last Fall called, ‘Emulating Human Voice-overs with TTS Voices’ I have elected to present those lessons as a Vlog, and so it makes sense that we give this newer presentation the same title, but suffixed with ‘Part Two’. We recommend that you review Part One’s scenario before you proceed (to do so, click here) but doing so doesn’t have to be a requirement. For this exercise we snipped out a small piece from Part One’s cut-scene where there were several actors in the cast, but there is only one actor dealt with in this clip. Recall that the premise is that your project’s budget can afford only one human voice-actor. So, you’ve recorded your one human voice actor doing each role of the entire cast. This video tutorial will show the techniques discussed in Part One. Learn how to sync synthetic voices to the phrasings and expressions of your human model. Disclaimer: these are helpful tips, but generalized. Not all editing tools or TTS engines respond to specific techniques that you might try in the very same way. Mainly, just try to grasp the concepts, then adapt your technique to idiosyncrasies of your chosen tools.

Sound effects were mentioned in Part One, but that discussion will need to wait for a future vlog. Music was mentioned also, but we cover music in other vlogs, so be sure to look for those as well.

(Ex. Part 2) The How-To-Create synchronize TTS to Human Model exhibit. A Vlog on how we developed the VO elements originally presented in Part 1.

As always check back in for more on this topic and other fun and useful information!

When Your Musician is a Robot, Part 4 (Can musical assets be free in a Virtual World?)

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

This is to fulfill my promise to describe how we were able to develop that functional, copyright free, royalty free, original music that we showcased within the video that we presented in Part 3. Click here for a review of that video, as it is essentially prerequisite viewing to get the most from this article’s tutorial. Yes, this video-blog contains the bona fide instructional ‘how-to’. That is — a fun, informational video show-and-tell regarding the background music-making tool and processes involved with the production in the Part 3 exhibit.

Recall that Exhibit Part 3, embodied a variety of musical styles. In this latest video, Exhibit Part 4, animated avatars will act as both your tour guide and mentors. Also remember that we had promised to discuss the usage of the music tool, with emphasis on reinforcing ‘the 1-4-5 principle’ (we initially introduced that here). But today…well, with this video you can see that principle in action.

We think you will enjoy this. Please don’t hesitate to send us feedback!

httpv://www.youtube.com/watch?v=-UTjeCRDSb8

The How-To-Create BGM exhibit (Exhibit Part 4). Learn how we developed the musical elements originally presented in Exhibit Part 3.

When Your Musician is a Robot, Part 3 (Can musical assets be free in a Virtual World?)

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Well, it’s been a while, so I thought that we should continue with the theme from my previous blog entry. There are many interesting and fun things to learn. You may recall that we were discussing the notion of your project’s background music (BGM) having the desirable attributes of being copyright free, royalty free, and an original composition at that! Those characteristics undoubtedly appeal to virtual-world developers, makers of cutscenes, trailers and Machinima projects. With the tutorials that we present here, there is no reason why even non-musicians can’t generate musically useful results (even for foreground musical elements, but that discussion is for a later time).

This exercise exhibits a variety of musical styles and embodies them into a single animation sequence. The exhibit portrays a conference room where the attendees are gathered to give their ‘boss man’ a report on the TV and film entertainment industry (actually it’s taken from the 2010 Golden Globes). Embedded in this viewable animation we’ll feature synthetic actors with synthetic speech (TTS voice-overs) as foreground elements (click here for a refresher on the technique). With this test-scene we utilize only 2 TTS male and 1 TTS female voice libraries to cover a cast of 11 adult attendees.

The animation sequence was borrowed from one of our past projects. It had been a full-motion video of a dramatized high-level meeting and for this exhibit it has been ‘cartoon-ized’ to mask logos, etc. The original human VO audio discussed issues about how to save the world, but here we have replaced them with a TTS script chattering about the entertainment world. So, obviously the script is intended to be nonsense; the focus of this little project being on ambient background music production, and less on the TTS actors (but don’t worry, we will have some more in-depth tutorials on TTS production in the near future!).

Now, please view and listen to the animation sequence. Imagine that at this meeting there might be a radio playing in the background for this scenario.

“- Link to”
YouTube Visual Purple When Your Musician is a Robot, Part 3 (Can musical assets be free in a Virtual World?)
The Conference Room exhibit –listen to the musical elements as they each segue. Note how well BGM serves the animation and VO.

In part 4 of this series I will discuss the usage of the music tool. Remember these numbers: 1-4-5? (if not, click here). Essentially, it’s all you need to know about music theory to engage in these tutorials.

[to be continued]

When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

How does one tame a virtual musician? For a discussion on the UI, let’s go step by step with the process I underwent to generate a music bed. What follows is the style palette. (Other competing software tools may not look like these screenshots but will offer similar functions.) In one scenario, I chose one of the very many available country styles, but one that includes pedal steel guitar, as in Figure 1.

Figure1 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 1 – The Country music selection from the musical styles palette window. Note that there are many sub-styles to choose from.

You can set the key (Figure 2). If you don’t care what key, leave this alone, your music will default to the key of ‘C’ (I didn’t; and mine did the default – for both styles). You can also set the tempo (Figure 3). This is a trial-and-error kind of thing. Experiment until it feels right for your purposes. If you don’t set the speed, it will default to 120 BPM (beats per minute…think Sousa March).

Figure2 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 2 – Key selection menu.

Figure3 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 3 –Tempo selection dialog box.

Figure 4. The interface is like a spreadsheet. Each cell entry represents which of the (1,4,5) chords will fall in the timeline. The first cell defaults to ‘1’ (in this case ‘C’), so you don’t need to enter any values yet.
Figure4 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 4 – Cell one defaults to ‘C’ chord

Figure 5. But move over to the next cell (‘bar 2’ in musical lingo) and enter the number 4. It automatically knows which proper chord to enter within that key (in this case, the ‘F’ chord).
Figure51 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 5 –Enter ‘4’; the ‘F’ chord appears

Figure 6. Move to the next cell and let’s enter the number we haven’t used yet, ‘5’. Let’s leave cell four empty.
Figure6 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 6 – Enter the digit ‘5’ into cell three

Figure 7. Now, at cell three, you will see that the tool has automatically assumed the ‘G’ chord for you.
Figure7 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 7 – The ‘G’ chord appears

What you have then is 1 bar of C, one bar of F and two bars of G. To finish preparing the body of your new music bed, highlight and copy the upper row of cells you instantiated, as in the following Figure 8.
Figure8 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 8 – Copy four bars of music (cells one through four)

The next step is to paste those 4 copied bars into three more rows of cells. Now you end up with a 16 bar loop, as in Figure 9.
Figure9 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 9 – Four bars pasted three times results in sixteen bars

Figure 10. Enter 16 bars (16 cells) to define the start and end of your loop.
Figure10 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 10 – Enter 16 to indicate which cell is the end

Figure 11. Choose how many repeats for your loop. How many times your music bed should loop-play depends on how long you need it to play. If the music engine in your project will repeat the loop as many time as you need, set the loop count to ‘1’. If not, set it to the number of loops that will fill the time required.
Figure11 When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 2
Figure 11 – Click the loop button and select repeats from a pull-down menu.

Conclusion
I commonly say that whenever you can afford real musicians for crucial sonic moments such as main themes, hire them. But when budget cries Mary, maybe try some of the things I have offered in these blogs about synthetic music production, especially for BGM.

Let’s review the positive points — copyright free, royalty free, original music…that can be created by anyone on your team (with the help of your synthetic musician, of course). In our next blog we will cover a few more fascinating creations from our virtual composer, so stay tuned! And by the way, if you would like some consultation or some help developing your project please don’t hesitate to contact us.

When Your Musician is a Robot (Can Automated Minstrels Play Nice in a Virtual World?), Part 1

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

This writing is a follow-up to a promise I made at the end of my previous blog ‘Emulating Human Voice-overs with TTS Voices’. For now, consider this proposition — what if your project’s background music (BGM) had these characteristics:
· copyright free
· royalty free
· original composition
· authored by you!

Does this appeal to you virtual-world developers of cutscenes, trailers or Machinima projects? Moreover, if you consider yourself a non-musician, then this should certainly be happy news for you! It’s true that being musically inclined can be a boon to this process, but there is no reason why a non-musician can’t generate some musically useful results.

I thought that this exercise would be a fun opportunity to exhibit two musical styles and apply them to the same animation sequence (not at the same time, of course!). The first exhibit will portray a rural cafe where the clientele would be people who appreciate country music. The second exhibit, while actually the same animation, let’s pretend is a ‘blue-collar’ cafe where the clientele would appreciate, um…, ‘roadhouse rock’ (whatever that is – let’s use our imagination!). In keeping true to my past themes of NPC VO, our YouTube animations embedded in this blog will star synthetic actors with synthetic speech as foreground elements (and which have been synced to the phrasings of a prerecorded human model; click here for a refresher on the technique). With this test-scene we utilized only 1 TTS male voice to cover a small cast of 2 adult males.

I borrowed the animation sequence from one of our past projects and the original voice-over tracks were actual professional VO talent. For these exhibits, however, I replaced the VO with TTS voices reading scripts I made up off the top of my head. The Country script and the Roadhouse script are largely nonsense, so please don’t strain yourself too hard trying to make sense of it (although I did try very hard to keep the lip sync to match the syllables). The original voice track may have expressed some confidential things (it was a training project), so it was prudent to make up nonsense TTS chatter and replace the original speech. Remember. The intent of this study is to focus on ambient BGM production, and not the TTS actors! So let us begin.

First, I’d like you to listen to the musical elements my automated composer has generated for this test (click here for the roadhouse sample). Notice how realistic the electric slide guitar and backing instruments sound. People, we have come a long way in auto-generated music in just the last two years!

Now, please listen to the country sample. Notice how realistic the steel guitar player sounds. And yes, you non-musicians can do this, nearly effortlessly. And it is equally easy to deal with almost any musical style!

Next, view and listen to the animation sequence for both BGM scenarios. After that I will discuss the usage of the music tool and present some screen shots.

“- Link to”
YouTube Visual Purple Can Automated Music Play Nice in Virtual Worlds? #1
The Roadhouse Cafe example – note the effective emulation of the synthetic musicians. The slide guitar is very convincing.

“- Link to”
YouTube Visual Purple Can Automated Music Play Nice in Virtual Worlds? #2
The Country Cafe example – note that the steel guitar is very realistic. Also notice how well background sound effects and music work together.

Remember these numbers: 1-4-5 (say, “one four five”). These three numbers represent the three principal chords of any given musical key (in Western culture). The number 1 represents the tonality of the key’s foundation. If the key is the key of ‘C’, then ‘1’ informs musicians to play the ‘C’ chord. The numbers 4 and 5 represent two other complimentary chords in the key structure. Almost any combination of the 1-4-5 chords sound ‘right’. There now, you know all you need to know about music theory to proceed. Three chords are all it takes!

While there are a number of choices that can be made as to selecting software packages that generate automatic or algorithmic music, this tutorial will reflect a user interface as found in a tool available by Canada’s own PG Music. This company, I believe, has recently set the bar rather high. Their impressive technology now allows you to generate music where the output is actual human recordings. And at the price of a song (pun intended). While MIDI is still an available technology in this tool (good for Rave/Techno/HipHop, etc), we have the good fortune of not being locked in to MIDI-only renders. [To be continued]

Emulating Human Voice-overs with TTS Voices

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

This is a follow-up that I promised in the last paragraph of my last blog entry, ‘Can Robots be Funny?’ This blog entry will not only perform the follow-up but will also segue nicely into my new topic, a topic that shall wait until the conclusion of this blog to introduce. Anyway, I had proposed this scenario: say we have a cinematic cut-scene where there are several actors in the scene (or trailer), but let’s say our budget can maybe only afford one human voice-actor. So, you’ve recorded your one human voice actor doing each role of the entire cast. Next, we use the techniques discussed in my aforementioned article to create an ensemble of TTS actors and sync their synthetic voices to the phrasings and expressions of your human model. Not only that, we will deploy only 1 TTS male voice and 1 TTS female voice to cover our entire casting of 7 characters (that’s 3 adult males, 2 adult females and 2 children)! These tasks we’ll accomplish with a suite of video and audio editing software (markup languages not being practical for trailers, etc).

First, listen to this excerpt where you will hear the human model’s VO.

Now listen to the same animation again. This time however, the audio you hear is our TTS VO.

Recall that from my earlier blogs, the tasks included lengthening or shortening the TTS vowels and syllables to match those of the human model. See Figure 1.
Figure 31 300x79 Emulating Human Voice overs with TTS Voices

Figure 2- Next, view the whole enchilada, as it were, this time with the entire cast of male and female TTS voice-actors doing their stuff, including a crowd sound effect for the background restaurant ambience:
“- Link to”
YouTube Visual Purple

Figure 3-Window dressing

Sound effects! Well, if you missed my previous blog (remember the talking fish?), you should visit it to read a discussion on sound effects. The background sound effects help the ambience place the scene in a more immersive environment, doesn’t it? And then there’s…

MUSIC! That’s right! Here’s where I depart for now from my series of blogs on TTS voice-actors (‘bout time isn’t it?). Background music can be an additional sweetening, adding greatly to your scene. “But wait!” you may be saying, “I thought the focus of this exercise was our tight budget!?”

You are so correct. How about copyright free, royalty free music? I’ll explore that a bit more later, but now let’s have a look and listen to this scene, but rendered with (royalty free) background music. This is a restaurant scene, and many restaurants provide music for their clientele, right?

Figure 4– Complete scene rendered with TTS voices and background music
“- Link to”
YouTube Visual Purple

Now see how effective the music was in providing a more pleasing restaurant ambience? In fact, without the music, the scene was rather sterile, wasn’t it? Even with the ambient crowd chatter in the background it was stale, but the music made the scene, well, …right!

Conclusion

Whenever you can afford real actors, do it! But when budget screams for relief, maybe try some of the things I have offered in these blogs about synthetic VO.
And algorithmic music? Yes. Algorithmic. Why not opt for copyright free, royalty free music since we have the opportunity and whenever it’s appropriate to our project, not to mention our budget. In fact, that will be the topic of my next blog, “When Your Musician is a Robot (Can Automated Composers Write Good Music?)”. We’ll demonstrate various musical styles and include movie clips for you to view so that we can hear the background music in context. Stay tuned!

Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Another element to this task is to lengthen or shorten the TTS words to match the blobs of the human model. Figure 5 depicts the effort to make the TTS utterance of ‘…was a…’ (pronounced as though a contraction, ‘whuzza’) line up on the timeline with John’s clip. Use your DAW’s stretch tool to accomplish this.

Figure 5a Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 5a- First, make your split points
Figure 5b Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 5b- Next, use a stretch tool

Let’s continue splitting the TTS clip’s timelines so that we can move each corresponding sound blobs to match, and stretch the words right down to the syllable (Figure 6 shows what it looks like when all words have been synced). Listen to the whole joke, both voices lined up properly.

Figure 6 Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 6

Here’s where some of you are thinking: Well, the blobs are lined up very nicely, but what about nuances regarding stress and pitch? Isn’t the word ‘lawyer’ as expressed by our human friend, John, not being expressed similarly? John’s lawyer blob is larger (i.e., louder) than the TTS blob. Also, isn’t the word ‘seen’ as expressed by John (in this case the stress is caused not by volume but by its pitch being higher, relatively, from the rest of the phrase) not being emulated by the synthetic actor?

Yes, indeed, so let’s try to fix these two issues. We’ll tackle the loudness point first. Figure 7 shows a Volume Envelope (the horizontal blue-ish line running through the center of the TTS clip in the timeline). With most DAWs with this feature, you can bend the volume envelope to cause increases or decreases in the audio.

Figure 7 Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 7 – Creating break points within the line bends the envelope

Now let’s tackle the pitch issue with that word, ‘seen’. Figure 8 shows the clip properties dialog box specific to the split-off region of our seen-blob. The highlighted value indicates that the word pitch has been raised four half steps.

Figure 8 Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 8

Listen to the resulting TTS clip with the treatments per ‘lawyer’ and ‘seen’.

Window dressing

Earlier I mentioned that this is a voice for a talking fish. This fish is contained within a fish tank in a hotel bar. Listen to our talking fish enveloped in a bubbling sound effect. Figure 9 shows the TTS clip, sans John’s clip, and with the fish tank noise clip added.

Figure 9 Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 9 – Note that a volume envelope has been applied to the bubbles as well.

So, is that it, then? Maybe – maybe not. As if we really did want to add some reality to a talking fish environment, we might consider what we know about how a fish tank effects sound. Occlusion happens. There is a glass barrier between the sound emitter (the talking fish) and the sound receiver (the avatar). So, we could elect to shave off some of the high frequencies from our talking fish. We can accomplish this by choosing the appropriate reverb effect. If you have presets at your disposal, start with a bathroom preset or similar. Try placing the reverb effect before any equalization effects (EQ). We use EQ here to bring out the hi-mid frequencies of the voice to ensure that it is intelligible (you may need to reduce high frequencies as well if you choose a reverb preset that sounds too bright). In this case we are also deploying EQ to remove extreme low frequency rumble (artifacts that commonly get accidentally introduced when using filters in the digital domain). Figure 9 shows this idea. Have a listen to the result.

Figure 10 a Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 10a – Software ‘bathroom’ reverb
Figure 10 b1 Comedic treatment in TTS voices (Can Robots be Funny?, Part 2)
Figure 10b – Software EQ module

Conclusion

Can synthetic voice-actors make funny? Humor is a very subjective aspect of human emotion. What’s funny to Samuel isn’t so funny to Mary, and so forth. So maybe the jury is still out on that one. To improve our NPC’s delivery, we’ve had to rely on 3rd-party software to ensure that techniques were carefully deployed. Markup language deployment probably won’t be sufficient for specific tasks like this, where real-time interaction is not a requirement. That’s my best guess, anyway.

You may wonder what to do if you have a project that requires an ensemble of funny voices. Well, as long as you have at least one funny human available to you, that person can be your model for all voices. Then your cast of synthetic actors can be molded to conform to your model’s comedic timing.

How about this scenario: you have a cinematic cut-scene where there are several actors in the movie (or trailer). But your budget can only afford one human voice-actor. Consider recording your one voice actor doing the roles of the entire cast. Then, using the techniques discussed above, create an ensemble of TTS voices and synchronize them in your video editor (NLE) to the synthetic voices to the phrasings and expressions of your one human actor.

In fact, maybe we’ll try to tackle an example of that in my next blog entry. Stay tuned!

Comedic Treatment in TTS Voices (Can Robots be Funny?, Part 1)

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

At the end of my previous discussion on NPC Voice-over production, I promised that I would follow up with a blog about what it might take to try to get a synthetic voice to be funny. Remember. We’re talking about NPCs (Non Player Characters), where otherwise playable characters are typically represented by professional voice-talent. I will provide you with samples as we roll along of course, as in tutorial fashion, but with the disclaimer that this is just one approach to this end, as there are likely other useful techniques that could be considered.

Ok, I sense you are protesting, how can a robot out ‘funny’ a performance by a professional voice-talent? I am not at all suggesting that a synthetic voice-actor can win such a contest. But if you are faced with options, and if this is the option you choose, you really want to come up with workable solutions.

What are the resources?

There are a number of synthetic voice vendors available. One obvious task is to choose one. A simple Internet search can help you solve that problem. For purposes of this discussion we will utilize 3rd-party software control mechanisms to effect voice properties. In this tutorial we’ll use a stand-alone audio editor along with a non-linear editor (NLE), but the same task almost certainly can be substituted by a digital audio workstation (DAW) of your choice. The audio editor might be replaced with XML controls if this is your favorite way to effect voice pitch and tempo, etc. However, I think it would be extremely tedious to try to deploy markup languages as a substitute for a DAW. By the end of this writing I bet you will probably agree with me. Please refer to my earlier post, When Your Voice-actor is a Robot, about some detail on resources. And then there is that last very important asset to have. Someone who is funny!

Here at Visual Purple, we are fortunate to have a gentleman who is a very funny guy. And for this experiment it makes for a very lucky day! So, you may be thinking, why are we talking about working with a funny human? Isn’t this topic about having a funny robot? Well, yes is the answer to that — but our funny human (Let’s call him John) will serve as a model for our robot.

Say what?

The short answer is, we will import audio clips of the funny human into our DAW, and then we will import audio clips of the synthetic voice and make it emulate the human’s speech patterns.

Say what?!!
Ok, in this project our goal is to make some humorous fish voices. You see, we have a scene in one our products where someone at a bar can stand and stare at a fish tank. As the fish swim by, and if the avatar is situated close enough to the fish tank, the fish might begin to say wise cracks to the, uh, fish admirer. This is an ‘Easter egg’ where fun is poked at the avatar, possibly insinuating that he has had a bit too much to drink. And to achieve our goal, we need to mode the synthetic voice clip to try to emulate the comedic timing as expressed in the human model.

Let’s do it!
So let’s start the process by importing into our DAW an audio clip that John, the funny human recorded for us (Figure 1).

Figure 1 Comedic Treatment in TTS Voices (Can Robots be Funny?, Part 1)
Figure 1

Next, Listen to John’s original model for reference. The script: “Last week it was a lawyer’s convention. I never seen so many sharks!” We follow that by importing a correlating audio clip from the synthetic voice (Figure 2).

Figure 21 Comedic Treatment in TTS Voices (Can Robots be Funny?, Part 1)
Figure 2

Without doing anything further at this stage, we can easily see that the graphical sound ‘blobs’ don’t match. So, before we move on, have a Listen at the robot’s recording. Notice that this clip has already been treated with pitch transposition. (For a discussion on ways to do that, please refer to my earlier post, When Your Voice-actor is a Robot.) Our intent was to get cartoon-y voices, so we started with a female TTS voice and then modified her pitch characteristics.

Now, to make the robot emulate John’s comedic speech patterns, we need to edit the clip’s timelines so that the graphical sound ‘blobs’ do match. Figure 3 illustrates an example:

Figure 3 Comedic Treatment in TTS Voices (Can Robots be Funny?, Part 1)
Figure 3

In Figure 3’s example we see only the first two words of the script (“Last week…”). Listen to how the TTS’s utterance of the word ‘week’ occurs earlier in the timeline than does John’s blob of the same word. Close — but the timing is just not right is it? Note that we need to create a split point (the vertical line represents this) just before the TTS’ blob. Doing this enables us to separate the words and move them as we wish on the timeline (see Figure 4).

Figure 4 Comedic Treatment in TTS Voices (Can Robots be Funny?, Part 1)
Figure 4

Now, Listen to both voices speak those two words in sync. (…to be continued)

When Your Voice-actor is a Robot (Confronting the NPC Speak Challenge for Virtual Worlds, Part 2)

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

The SSML language

If you’re new to markup languages, take a look at this example, as it may be useful as a reference to understand SSML syntax.

< ?xml version="1.0"? >
< speak version="1.0" xml:lang="en-US" >
< voice name="Dave" >
Hello, world; my name is Dave.
< /voice >
< /speak >

This example shows that the voice named “Dave” should pronounce: “Hello, world; my name is Dave.” (Keep in mind that the spaces adjacent to the angel brackets should not be observed in the real-world application) As in other XML-based markup languages, SSML is composed of elements. The root element is and it contains the text to be spoken. The element has two required attributes: xml:lang (the language to be spoken), and version (the version of the specification). There are a few optional attributes as well.

Figure 5 below is a table that shows how the SSML elements are associated to the five points of Text Analysis.

figure 51 When Your Voice actor is a Robot (Confronting the NPC Speak Challenge for Virtual Worlds, Part 2)
Figure 5

The prosody tag you will use a lot if you intend to create separate voice characters from only one TTS resource. With prosodic control you can manage the tempo and pitch of the voices.

Listen to this XML example of the ‘Grandson’ talk scenario. And see below the markup tag to make it play at a higher pitch.

< prosody pitch="+4.2st" > I believe Visual Purple’s products have among the best where NPC voice quality is concerned. < /prosody >

As far as TTS engines go, this is a pretty effective example. Here, rather than emphasizing one or several individual harmonics as occurs with the wood or metal in music instruments, the vocal tract emphasizes an entire band of harmonics, called formants. Each vowel sound has characteristic bands of higher intensity harmonics. In a word, the character of the original voice clip is largely retained, even when the voice’s pitch has been raised. Beware that not all TTS engines do so well when processed with markup languages.

Listen to an XML example of the ‘Grandson’ talk scenario and see below the markup tag that makes the above paragraph’s sample play at a faster tempo.

< prosody rate="+5%" > I believe Visual Purple’s products have among the best where NPC voice quality is concerned. < /prosody >

Note the glaring sonic artifacts in this example. It plays way too quickly to sound ‘natural’! In my own research I have noticed that many of the TTS engines available do not give the user a fine-control when entering tempo parameters into markup tags. The results are usually too fast or too slow. And in some of those engines that do respond to fine control, sonic artifacts such as static or scratchiness is introduced.

Listen to this XML example of the ‘Grandpa’ talk scenario where we use a markup tag to make it play at a lower pitch.

< prosody pitch="-3.8st" > There are a number of synthetic voice vendors available. It seems though, that many of these vendors are reselling the same voice actors, so try to get your license from the source. < /prosody >

It’s interesting to note that this is the same TTS engine that performed so well with the raised pitch formats, but shows some sonic artifacts in this example where the frequencies are pitched lower. Listen carefully and observe the slight scratchiness. It’s as though you can hear tiny, rhythmic interruptions all through the sound data.

Listen to an XML example of the ‘Grandpa’ talk scenario and see below the markup tag that makes the above Grandpa sample play at a slower tempo.

< prosody rate="-10%" > There are a number of synthetic voice vendors available. It seems though, that many of these vendors are reselling the same voice actors, so try to get your license from the source. < /prosody >

This one mirrors the same tempo defect revealed in the above Grandson tempo exercise. It plays way too quickly to be useful (unless you are going for that classic Hal-the-computer voice near the end of the movie 2001 where the robot meets his slow demise).

Conclusion
Synthetic speech can make effective voice-actors when techniques are carefully deployed (especially with regard to adjusting tempo and pitch to improve your NPC’s realism). At this juncture there appears to be no one go-to solution. For good results, we’ve had to utilize a combination of 3rd-party software with XML tags, though I have to admit we seem to resort to 3rd-party software more and more.

IF markup language deployment were as robust as we wish they were, we would be able to include an XML parser in our commercially available development tools. We have clients that have expressed an interest in having the capability of building their own virtual world simulations where all they need do is type in their avatars’ text, and the voice syncing-to-animation just happens automatically for them. The bottleneck, though, appears to be a too dramatic hit on frame rate, where the TTS speech and/or animation quality suffers. There is great demand put on a CPU when it has to display high quality images and process real-time audio manipulations simultaneously. This is why, in the meantime, we pre-render scenarios so that our content looks and sounds glorious.

Well, our technologists will figure out a solution to the real-time problem, though. Visual Purple is all about quality – and providing the tools that our customers want!

In a future blog post – Comedic treatment in TTS voices. Can robots be funny? Stay tuned!

LouiseBrooks theme byThemocracy

SEO Powered by Platinum SEO from Techblissonline