Technology

We believe it’s not about how much technology you have; it’s how you use it! By utilizing our proprietary technologies we reduce required resources resulting in a faster, cheaper, and greener production process.

We’ve all run up against trying to resolve frustrating technology issues, whether it is setting a new watch or programming a remote technology can make you want to pull your hair out at times. Consumer Reports National Research Center recently asked 13,000 subscribers for their biggest technology gripes, and not surprisingly almost everyone (95%) had a problem with a computer they’d bought within the last five years. In addition 80% had problems with their cell phones, and more than 50% had issues with their GPS, TVs, and cameras. While getting a computer error message that you don’t understand can be trying it all takes patience and determination to use technology in today’s technologically advanced society. TV’s nowadays need 2 or even 3 remotes just to utilize the functionality of turning them on and off and changing the channel or volume. These technology challenges all add up to just more daily annoyances that no matter how simplified they may seem to operate, they still present challenges for the novice to the advanced user.

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Continuing with our vlog ‘how-to’ series called, ‘Emulating Human Voice-overs with TTS Voices’ we now offer this newer presentation, suffixed as ‘Part Three’. We recommend that you review Part One and Part Two first but that is not necessarily a requirement. For this exercise we snipped out a small piece from one of our past projects. Unlike the premise of Part 2, where you learned to sync a TTS voice to a human voice-actor, this video tutorial will focus on the tactics of humanizing synthetic voice-clips with added detail. Today’s presentation does not only reinforce the techniques discussed in Part 1 and Part 2, but will show how to set the talk-pace to improve the phrasings and expressions of synthetic voices. And the concept of formant manipulation is introduced as well. Disclaimer: these are helpful tips, but generalized. Not all editing tools or TTS engines respond to specific techniques that you might try in the very same way. Mainly, just try to grasp the concepts, then adapt your technique to idiosyncrasies of your chosen tools.


The How-To-Humanize your TTS Clips exhibit (Exhibit Part 3). A follow-up on VO elements originally presented in Part 2’s vlog. This time, we introduce Formant handling.

As always check back in for more on this topic and other fun and useful information!

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Since there was great interest in a blog entry last Fall called, ‘Emulating Human Voice-overs with TTS Voices’ I have elected to present those lessons as a Vlog, and so it makes sense that we give this newer presentation the same title, but suffixed with ‘Part Two’. We recommend that you review Part One’s scenario before you proceed (to do so, click here) but doing so doesn’t have to be a requirement. For this exercise we snipped out a small piece from Part One’s cut-scene where there were several actors in the cast, but there is only one actor dealt with in this clip. Recall that the premise is that your project’s budget can afford only one human voice-actor. So, you’ve recorded your one human voice actor doing each role of the entire cast. This video tutorial will show the techniques discussed in Part One. Learn how to sync synthetic voices to the phrasings and expressions of your human model. Disclaimer: these are helpful tips, but generalized. Not all editing tools or TTS engines respond to specific techniques that you might try in the very same way. Mainly, just try to grasp the concepts, then adapt your technique to idiosyncrasies of your chosen tools.

Sound effects were mentioned in Part One, but that discussion will need to wait for a future vlog. Music was mentioned also, but we cover music in other vlogs, so be sure to look for those as well.

(Ex. Part 2) The How-To-Create synchronize TTS to Human Model exhibit. A Vlog on how we developed the VO elements originally presented in Part 1.

As always check back in for more on this topic and other fun and useful information!

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

This is to fulfill my promise to describe how we were able to develop that functional, copyright free, royalty free, original music that we showcased within the video that we presented in Part 3. Click here for a review of that video, as it is essentially prerequisite viewing to get the most from this article’s tutorial. Yes, this video-blog contains the bona fide instructional ‘how-to’. That is — a fun, informational video show-and-tell regarding the background music-making tool and processes involved with the production in the Part 3 exhibit.

Recall that Exhibit Part 3, embodied a variety of musical styles. In this latest video, Exhibit Part 4, animated avatars will act as both your tour guide and mentors. Also remember that we had promised to discuss the usage of the music tool, with emphasis on reinforcing ‘the 1-4-5 principle’ (we initially introduced that here). But today…well, with this video you can see that principle in action.

We think you will enjoy this. Please don’t hesitate to send us feedback!

The How-To-Create BGM exhibit (Exhibit Part 4). Learn how we developed the musical elements originally presented in Exhibit Part 3.

Well, unless you have been living under a rock you have probably come across the news of Apple’s latest invention, the iPad. We at Visual Purple think it’s way cool, BTW!!! Back in January when Steve Jobs unveiled the 1.5 pound innovation, many Apple junkies were star struck. Will the 75 million people that have bought into the iPhone and iPod Touch believe in the iPad as well? What could this potentially mean for web developers? Many Apple followers are already saying that they will not buy the iPad simply because it will not support Flash. But for a starting price of $499, what more could you expect? Well you could start by expecting to pay more than the publicized low $499 price tag.

Yes, I will admit back in the day when tablets first hit the market – I was a tablet fanatic. While my awe with them has dissipated a bit the talk of the new iPad coming onto the market brought back some ancient memories. I will acknowledge that I am not an Apple junkie, however I am still intrigued with what the iPad could potentially offer (and not offer). However I am disappointed by the news that they are passing on Flash capability. Adobe claims that Flash is installed on 99 percent of Internet-enabled computers and plays over 75 percent of videos that are viewed online, could this be a transition to the future of the Internet when Flash is no longer supported? What this means to me is that Flash-based 3-D virtual worlds and the future of browser-based virtual worlds cannot function on the iPad (unless you’re using the Unity 3D plug-in). While so many of us virtual world evangelist thought we were close to mainstream adoption another hurdle pops up. Could this potentially be the writing on the wall for Flash? Flash based MMO communities are wildly popular adding to the fact that all 3 major operating systems currently support Flash, I just don’t see how Flash could fall to the wayside.

The fact of the matter is Flash is cool and all, but is it all really that practical. I will even be one of the first ones to admit that we were awed by Flash’s capabilities and recreated our main company website around flash. But that newfangled technology has lost a lot of its glitz and glamour… hey, look the page flies! So my thought is that Flash will not disappear completely, but rather may not be seen on the all alluring Apple iPad (even with potential conversion capabilities in place). Could this be the next game changer, are we really ready to lead a Flash-free existence? What about playing a YouTube video? Can something weighing only 1.5 pounds really cause such a stir? Could this be the tablet that we have waited on for so long or just another step on the ladder to getting a worthy tablet device in the near future? Will the PC market be able to hold up to this – do they have anything under the radar being developed to counter Apple taking center stage?

I recently enjoyed testing the intelligence of a variety of “Chat Bots” online. While some tend to hold meaningless conversations, others actually make some sense!

Chatterbots, otherwise known as Chat Bots are defined by Wikipedia in the following definition: A chatterbot (or chatbot) is a type of conversational agent, a computer program designed to simulate an intelligent conversation with one or more human users via auditory or textual methods.

Websites are now employing chat-bots to welcome visitors and answer questions, chat bots are able to serve as a virtual assistant. We have also seen some migration of chat bots into virtual world spaces, such as Second Life. So what’s with this rather new form of intelligent technology? Although it’s not necessarily new, in fact the oldest chat bot was recorded back in the 1960’s! Today, they seem to be evolving and intent on becoming a practical solution in a variety of business and pleasure applications. Perhaps ALICE (Artificial Linguistic Internet Computer Entity), is the most famous chat bot of all.

What’s the role of a chat bot in an immersive virtual world? Applying chat bots into virtual world applications have been done- with some success. I believe that in the future we will see chat bots evolve further in virtual world spaces. When you enter Second Life you may immediately have a chat bot befriend you and carry on a meaningful (and intelligent) conversation. Rather than entering an empty area of Second Life with no other form of avatar contact. Chat bot technology is especially useful for companies that have set up shop in Second Life, whereas they don’t have to worry about staffing the Second Life location 24/7 as a virtual chat bot can do the job and answer the basic questions about a company. Because virtual worlds are primarily text based in nature for communication/ interactions this type of environment makes it a great test-bed for chat bot avatars. Utilized for entertainment and information services, Chat Bots are always available (24/7) and intelligent enough to answer questions.

Many are scripted, however a few are non-scripted that pull from a large database of text. Chat-bots are mostly text bound and utilize Natural Language Processing (NLP). Chat bots can range from greeters on web pages (that have been proven to increase sales/ conversion ratios), customer service representatives, tour guides and non-player characters (NPCs). Do they seem lifelike? To some extent yes, but they by no means take on the persona of a real person.

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Another element to this task is to lengthen or shorten the TTS words to match the blobs of the human model. Figure 5 depicts the effort to make the TTS utterance of ‘…was a…’ (pronounced as though a contraction, ‘whuzza’) line up on the timeline with John’s clip. Use your DAW’s stretch tool to accomplish this.

Figure 5a
Figure 5a- First, make your split points
Figure 5b
Figure 5b- Next, use a stretch tool

Let’s continue splitting the TTS clip’s timelines so that we can move each corresponding sound blobs to match, and stretch the words right down to the syllable (Figure 6 shows what it looks like when all words have been synced). Listen to the whole joke, both voices lined up properly.

Figure 6
Figure 6

Here’s where some of you are thinking: Well, the blobs are lined up very nicely, but what about nuances regarding stress and pitch? Isn’t the word ‘lawyer’ as expressed by our human friend, John, not being expressed similarly? John’s lawyer blob is larger (i.e., louder) than the TTS blob. Also, isn’t the word ‘seen’ as expressed by John (in this case the stress is caused not by volume but by its pitch being higher, relatively, from the rest of the phrase) not being emulated by the synthetic actor?

Yes, indeed, so let’s try to fix these two issues. We’ll tackle the loudness point first. Figure 7 shows a Volume Envelope (the horizontal blue-ish line running through the center of the TTS clip in the timeline). With most DAWs with this feature, you can bend the volume envelope to cause increases or decreases in the audio.

Figure 7
Figure 7 – Creating break points within the line bends the envelope

Now let’s tackle the pitch issue with that word, ‘seen’. Figure 8 shows the clip properties dialog box specific to the split-off region of our seen-blob. The highlighted value indicates that the word pitch has been raised four half steps.

Figure 8
Figure 8

Listen to the resulting TTS clip with the treatments per ‘lawyer’ and ‘seen’.

Window dressing

Earlier I mentioned that this is a voice for a talking fish. This fish is contained within a fish tank in a hotel bar. Listen to our talking fish enveloped in a bubbling sound effect. Figure 9 shows the TTS clip, sans John’s clip, and with the fish tank noise clip added.

Figure 9
Figure 9 – Note that a volume envelope has been applied to the bubbles as well.

So, is that it, then? Maybe – maybe not. As if we really did want to add some reality to a talking fish environment, we might consider what we know about how a fish tank effects sound. Occlusion happens. There is a glass barrier between the sound emitter (the talking fish) and the sound receiver (the avatar). So, we could elect to shave off some of the high frequencies from our talking fish. We can accomplish this by choosing the appropriate reverb effect. If you have presets at your disposal, start with a bathroom preset or similar. Try placing the reverb effect before any equalization effects (EQ). We use EQ here to bring out the hi-mid frequencies of the voice to ensure that it is intelligible (you may need to reduce high frequencies as well if you choose a reverb preset that sounds too bright). In this case we are also deploying EQ to remove extreme low frequency rumble (artifacts that commonly get accidentally introduced when using filters in the digital domain). Figure 9 shows this idea. Have a listen to the result.

Figure 10-a
Figure 10a – Software ‘bathroom’ reverb
Figure 10-b
Figure 10b – Software EQ module

Conclusion

Can synthetic voice-actors make funny? Humor is a very subjective aspect of human emotion. What’s funny to Samuel isn’t so funny to Mary, and so forth. So maybe the jury is still out on that one. To improve our NPC’s delivery, we’ve had to rely on 3rd-party software to ensure that techniques were carefully deployed. Markup language deployment probably won’t be sufficient for specific tasks like this, where real-time interaction is not a requirement. That’s my best guess, anyway.

You may wonder what to do if you have a project that requires an ensemble of funny voices. Well, as long as you have at least one funny human available to you, that person can be your model for all voices. Then your cast of synthetic actors can be molded to conform to your model’s comedic timing.

How about this scenario: you have a cinematic cut-scene where there are several actors in the movie (or trailer). But your budget can only afford one human voice-actor. Consider recording your one voice actor doing the roles of the entire cast. Then, using the techniques discussed above, create an ensemble of TTS voices and synchronize them in your video editor (NLE) to the synthetic voices to the phrasings and expressions of your one human actor.

In fact, maybe we’ll try to tackle an example of that in my next blog entry. Stay tuned!

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

At the end of my previous discussion on NPC Voice-over production, I promised that I would follow up with a blog about what it might take to try to get a synthetic voice to be funny. Remember. We’re talking about NPCs (Non Player Characters), where otherwise playable characters are typically represented by professional voice-talent. I will provide you with samples as we roll along of course, as in tutorial fashion, but with the disclaimer that this is just one approach to this end, as there are likely other useful techniques that could be considered.

Ok, I sense you are protesting, how can a robot out ‘funny’ a performance by a professional voice-talent? I am not at all suggesting that a synthetic voice-actor can win such a contest. But if you are faced with options, and if this is the option you choose, you really want to come up with workable solutions.

What are the resources?

There are a number of synthetic voice vendors available. One obvious task is to choose one. A simple Internet search can help you solve that problem. For purposes of this discussion we will utilize 3rd-party software control mechanisms to effect voice properties. In this tutorial we’ll use a stand-alone audio editor along with a non-linear editor (NLE), but the same task almost certainly can be substituted by a digital audio workstation (DAW) of your choice. The audio editor might be replaced with XML controls if this is your favorite way to effect voice pitch and tempo, etc. However, I think it would be extremely tedious to try to deploy markup languages as a substitute for a DAW. By the end of this writing I bet you will probably agree with me. Please refer to my earlier post, When Your Voice-actor is a Robot, about some detail on resources. And then there is that last very important asset to have. Someone who is funny!

Here at Visual Purple, we are fortunate to have a gentleman who is a very funny guy. And for this experiment it makes for a very lucky day! So, you may be thinking, why are we talking about working with a funny human? Isn’t this topic about having a funny robot? Well, yes is the answer to that — but our funny human (Let’s call him John) will serve as a model for our robot.

Say what?

The short answer is, we will import audio clips of the funny human into our DAW, and then we will import audio clips of the synthetic voice and make it emulate the human’s speech patterns.

Say what?!!
Ok, in this project our goal is to make some humorous fish voices. You see, we have a scene in one our products where someone at a bar can stand and stare at a fish tank. As the fish swim by, and if the avatar is situated close enough to the fish tank, the fish might begin to say wise cracks to the, uh, fish admirer. This is an ‘Easter egg’ where fun is poked at the avatar, possibly insinuating that he has had a bit too much to drink. And to achieve our goal, we need to mode the synthetic voice clip to try to emulate the comedic timing as expressed in the human model.

Let’s do it!
So let’s start the process by importing into our DAW an audio clip that John, the funny human recorded for us (Figure 1).

Figure 1
Figure 1

Next, Listen to John’s original model for reference. The script: “Last week it was a lawyer’s convention. I never seen so many sharks!” We follow that by importing a correlating audio clip from the synthetic voice (Figure 2).

Figure 2
Figure 2

Without doing anything further at this stage, we can easily see that the graphical sound ‘blobs’ don’t match. So, before we move on, have a Listen at the robot’s recording. Notice that this clip has already been treated with pitch transposition. (For a discussion on ways to do that, please refer to my earlier post, When Your Voice-actor is a Robot.) Our intent was to get cartoon-y voices, so we started with a female TTS voice and then modified her pitch characteristics.

Now, to make the robot emulate John’s comedic speech patterns, we need to edit the clip’s timelines so that the graphical sound ‘blobs’ do match. Figure 3 illustrates an example:

Figure 3
Figure 3

In Figure 3’s example we see only the first two words of the script (“Last week…”). Listen to how the TTS’s utterance of the word ‘week’ occurs earlier in the timeline than does John’s blob of the same word. Close — but the timing is just not right is it? Note that we need to create a split point (the vertical line represents this) just before the TTS’ blob. Doing this enables us to separate the words and move them as we wish on the timeline (see Figure 4).

Figure 4
Figure 4

Now, Listen to both voices speak those two words in sync. (…to be continued)

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

The SSML language

If you’re new to markup languages, take a look at this example, as it may be useful as a reference to understand SSML syntax.

< ?xml version="1.0"? >
< speak version="1.0" xml:lang="en-US" >
< voice name="Dave" >
Hello, world; my name is Dave.
< /voice >
< /speak >

This example shows that the voice named “Dave” should pronounce: “Hello, world; my name is Dave.” (Keep in mind that the spaces adjacent to the angel brackets should not be observed in the real-world application) As in other XML-based markup languages, SSML is composed of elements. The root element is and it contains the text to be spoken. The element has two required attributes: xml:lang (the language to be spoken), and version (the version of the specification). There are a few optional attributes as well.

Figure 5 below is a table that shows how the SSML elements are associated to the five points of Text Analysis.

Figure-5
Figure 5

The prosody tag you will use a lot if you intend to create separate voice characters from only one TTS resource. With prosodic control you can manage the tempo and pitch of the voices.

Listen to this XML example of the ‘Grandson’ talk scenario. And see below the markup tag to make it play at a higher pitch.

< prosody pitch="+4.2st" > I believe Visual Purple’s products have among the best where NPC voice quality is concerned. < /prosody >

As far as TTS engines go, this is a pretty effective example. Here, rather than emphasizing one or several individual harmonics as occurs with the wood or metal in music instruments, the vocal tract emphasizes an entire band of harmonics, called formants. Each vowel sound has characteristic bands of higher intensity harmonics. In a word, the character of the original voice clip is largely retained, even when the voice’s pitch has been raised. Beware that not all TTS engines do so well when processed with markup languages.

Listen to an XML example of the ‘Grandson’ talk scenario and see below the markup tag that makes the above paragraph’s sample play at a faster tempo.

< prosody rate="+5%" > I believe Visual Purple’s products have among the best where NPC voice quality is concerned. < /prosody >

Note the glaring sonic artifacts in this example. It plays way too quickly to sound ‘natural’! In my own research I have noticed that many of the TTS engines available do not give the user a fine-control when entering tempo parameters into markup tags. The results are usually too fast or too slow. And in some of those engines that do respond to fine control, sonic artifacts such as static or scratchiness is introduced.

Listen to this XML example of the ‘Grandpa’ talk scenario where we use a markup tag to make it play at a lower pitch.

< prosody pitch="-3.8st" > There are a number of synthetic voice vendors available. It seems though, that many of these vendors are reselling the same voice actors, so try to get your license from the source. < /prosody >

It’s interesting to note that this is the same TTS engine that performed so well with the raised pitch formats, but shows some sonic artifacts in this example where the frequencies are pitched lower. Listen carefully and observe the slight scratchiness. It’s as though you can hear tiny, rhythmic interruptions all through the sound data.

Listen to an XML example of the ‘Grandpa’ talk scenario and see below the markup tag that makes the above Grandpa sample play at a slower tempo.

< prosody rate="-10%" > There are a number of synthetic voice vendors available. It seems though, that many of these vendors are reselling the same voice actors, so try to get your license from the source. < /prosody >

This one mirrors the same tempo defect revealed in the above Grandson tempo exercise. It plays way too quickly to be useful (unless you are going for that classic Hal-the-computer voice near the end of the movie 2001 where the robot meets his slow demise).

Conclusion
Synthetic speech can make effective voice-actors when techniques are carefully deployed (especially with regard to adjusting tempo and pitch to improve your NPC’s realism). At this juncture there appears to be no one go-to solution. For good results, we’ve had to utilize a combination of 3rd-party software with XML tags, though I have to admit we seem to resort to 3rd-party software more and more.

IF markup language deployment were as robust as we wish they were, we would be able to include an XML parser in our commercially available development tools. We have clients that have expressed an interest in having the capability of building their own virtual world simulations where all they need do is type in their avatars’ text, and the voice syncing-to-animation just happens automatically for them. The bottleneck, though, appears to be a too dramatic hit on frame rate, where the TTS speech and/or animation quality suffers. There is great demand put on a CPU when it has to display high quality images and process real-time audio manipulations simultaneously. This is why, in the meantime, we pre-render scenarios so that our content looks and sounds glorious.

Well, our technologists will figure out a solution to the real-time problem, though. Visual Purple is all about quality – and providing the tools that our customers want!

In a future blog post – Comedic treatment in TTS voices. Can robots be funny? Stay tuned!

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

I’d like to discuss NPC Voice-over production. I will even provide you with downloadable samples as we roll along. In our virtual worlds, Visual Purple sometimes deploys intelligent NPCs (Non Player Characters), where otherwise our playable characters are typically represented by professional voice-talent. Much of the challenge involved is making synthetic voice recordings not sound too synthetic, such as you may think about when you’re on the phone and trapped within one of those automated voice applications. To confront this, some of the tactics we deploy may involve adjusting the tempo and pitch, either to an NPC’s global dialog trait, or just to specific words or phrases. Even some clever combination of both treatments comes into play. One reason to affect tempo and pitch is so that you can get extra mileage from one synthetic voice-actor. A quick for instance: say you desire three male voice-actors for your project…one is a teenager, another one plays the teenager’s father, and that third actor plays the grandfather. By adjusting the pitch of a single synthetic actor you can achieve this. Re-pitch the teenager’s voice a bit higher than ‘dad’s’ (you might just leave dad’s timbre as is), and re-pitch grandpa’s voice a bit lower. Now, in reality, we as individuals likely speak with a different pace (tempo) than an individual in the next cubicle. I submit that we can emulate the same phenomenon in our synthetic actors. We could elect to make the teenager speak with a slightly quicker tempo than dad does (again, we could just leave dad’s pace as is), and slow down grandpa’s tempo somewhat. I’m sure you’re getting the idea. For female timbres simply consider similar treatments.

What and where are the resources?

There are a number of synthetic voice vendors available. It seems though, that many of these vendors are reselling the same voice actors, so try to get your license from the source. This is a global market so do not assume that your own language is available only from vendors in the same country as your native tongue. I believe Visual Purple’s products have among the best where NPC voice quality is concerned.

There is a goodly supply of audio tool vendors available as well. Most of the synthetic voice vendors have on-board processing tools. These tools are there to help you arrive at solutions such as the teenager/dad/grandpa scenario depicted above. One common way to utilize their on-board tools (software based) is by developing some markup language skills. XML anyone?

On-board audio-treatment (markup languages)

Control mechanisms to effect voice properties are SSML, SALT, SAPI4, SAPI5, and TTS vendor’s proprietary inventions. Here are a few links if you’d like to study these XML based technologies: http://www.phon.ucl.ac.uk/home/mark/salt/ssml.html, http://www.w3.org/TR/speech-synthesis/, and http://en.wikipedia.org/wiki/Speech_Application_Programming_Interface.

Out-board audio treatment, such as 3rd party software.

Control mechanisms to effect voice properties utilizing 3rd-party software solutions are digital audio workstation (DAW) or non-linear editors (NLE) such as Pro Tools, Sonar, Nuendo (http://www.steinberg.net/en/products/audiopostproduction_product.html), Vegas Pro, and Melodyne, and yes even Audacity among others.

Revisiting the grandpa, dad, and grandson scenario I mentioned earlier, I now want to show you some screen shots and audio examples from results I got when using a 3rd-party tool.
grandpa-1

Now listen to grandpa_pitched_low-xml.

Now, to listen to Dad’s pitch texture (pitched normally) click here.
And to listen to Grandson’s pitch texture (raised somewhat) click here.
son-1

To further differentiate these actors’ speaking styles, we can also effect their tempo (speed). And we should do this without changing the pitch again. We could give Grandpa a slightly slower tempo, give the Son a quicker tempo, and let’s just leave Dad’s speech pace as is. To listen to Grandpa’s tempo (slowed somewhat) click here.

grandpa-21

To listen to Dad’s tempo (kept as is) click here. And to listen to the Son’s tempo (sped up a little) click here.

son-2