Virtual Speak

Advanced Simulation Technologies & Embedded Training Systems

Virtual Speak

Advanced Simulation Technologies & Embedded Training Systems

A recent post on Advertising Age entitled “What Social Media Will Look Like in 2012” captured my attention and got me thinking (this could be dangerous, I know). Needless to say the article speculated that in 2009 most major marketers adopted social media as something that is here to stay and not just a fad. While I agree with this to some extent, the article went on to outline the eleven points to watch for in the coming years leading up to 2012.

Privacy expectations will have to change: Sharing more personal information with people you don’t even know…hummm although some may be all into this I don’t think it is necessary for complete strangers to know my favorite color or what my preferred pizza topping is.
Complete decentralization of social networks: Everything is becoming more portable from iPad to iPhone many check friend’s status updates and news while on the run.
Our interactions with search engines will be different: I agree with the fact that real-time information will be king.
Rise of content aggregators: Who wants to login to 5 different social networking services to check on statuses and news- the aggregator can serve as man’s best friend.
Social media augmented reality: Kind of scary to know what your friend is doing before even picking up the phone to call. This to me takes away a certain level of personal contact, just like email has reduced the number of snail mail letters that are sent.
Influencer marketing will be redefined: Social influence will have more pull than the old-school search features that Google and other major search engines currently offer.
Ratings everywhere: User ratings could hinder or help, all dependent upon the specific instance. A lot of purchasers nowadays look at how products are rated by users before ultimately making the purchase decision.
Social media agents: Oh great! More things going automated and taking the person out of the customer service experience, please press 0 not to be connected to a live person.
Riding the Google (wave): It could be the next big thing, but only time will tell.
Thinking beyond “nowness”: Could the answer really be the semantic web?
Social media everything and return of digital media: Not too sure that social media may change to being referred to as digital media.

With new technology deployments- a lot are hesitant to jump in and take a risk. The conundrum here is really ROI; will my investment really pay off? New technology deployments are always risky. But just as with anything else without taking the risk one will never know what could have been. Although early adopters tend to jump on the bandwagon, many remain skeptical and want to see some proven success before trying out a new program. Without a tried and proven solution to demonstrate what your specific results may be it may be a hard sell (especially to those higher on the ladder). However, by taking the risk one may gain the technological advantage in the marketplace.

New technology always takes a while to reach the maturity curve and the ROI studies that follow will be tried and true, but you may now be behind in implementing a solution and your competition could have the upper hand. But by the time the technology is no longer new it may be too late. So are you a risk taker or do you avoid risk at all costs? It’s up to you but just remember you may lose the competitive advantage in this scenario.

Common Technology Gripes

May 17th, 2010

We’ve all run up against trying to resolve frustrating technology issues, whether it is setting a new watch or programming a remote technology can make you want to pull your hair out at times. Consumer Reports National Research Center recently asked 13,000 subscribers for their biggest technology gripes, and not surprisingly almost everyone (95%) had a problem with a computer they’d bought within the last five years. In addition 80% had problems with their cell phones, and more than 50% had issues with their GPS, TVs, and cameras. While getting a computer error message that you don’t understand can be trying it all takes patience and determination to use technology in today’s technologically advanced society. TV’s nowadays need 2 or even 3 remotes just to utilize the functionality of turning them on and off and changing the channel or volume. These technology challenges all add up to just more daily annoyances that no matter how simplified they may seem to operate, they still present challenges for the novice to the advanced user.

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Continuing with our vlog ‘how-to’ series called, ‘Emulating Human Voice-overs with TTS Voices’ we now offer this newer presentation, suffixed as ‘Part Three’. We recommend that you review Part One and Part Two first but that is not necessarily a requirement. For this exercise we snipped out a small piece from one of our past projects. Unlike the premise of Part 2, where you learned to sync a TTS voice to a human voice-actor, this video tutorial will focus on the tactics of humanizing synthetic voice-clips with added detail. Today’s presentation does not only reinforce the techniques discussed in Part 1 and Part 2, but will show how to set the talk-pace to improve the phrasings and expressions of synthetic voices. And the concept of formant manipulation is introduced as well. Disclaimer: these are helpful tips, but generalized. Not all editing tools or TTS engines respond to specific techniques that you might try in the very same way. Mainly, just try to grasp the concepts, then adapt your technique to idiosyncrasies of your chosen tools.

The How-To-Humanize your TTS Clips exhibit (Exhibit Part 3). A follow-up on VO elements originally presented in Part 2’s vlog. This time, we introduce Formant handling.

As always check back in for more on this topic and other fun and useful information!

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Since there was great interest in a blog entry last Fall called, ‘Emulating Human Voice-overs with TTS Voices’ I have elected to present those lessons as a Vlog, and so it makes sense that we give this newer presentation the same title, but suffixed with ‘Part Two’. We recommend that you review Part One’s scenario before you proceed (to do so, click here) but doing so doesn’t have to be a requirement. For this exercise we snipped out a small piece from Part One’s cut-scene where there were several actors in the cast, but there is only one actor dealt with in this clip. Recall that the premise is that your project’s budget can afford only one human voice-actor. So, you’ve recorded your one human voice actor doing each role of the entire cast. This video tutorial will show the techniques discussed in Part One. Learn how to sync synthetic voices to the phrasings and expressions of your human model. Disclaimer: these are helpful tips, but generalized. Not all editing tools or TTS engines respond to specific techniques that you might try in the very same way. Mainly, just try to grasp the concepts, then adapt your technique to idiosyncrasies of your chosen tools.

Sound effects were mentioned in Part One, but that discussion will need to wait for a future vlog. Music was mentioned also, but we cover music in other vlogs, so be sure to look for those as well.

(Ex. Part 2) The How-To-Create synchronize TTS to Human Model exhibit. A Vlog on how we developed the VO elements originally presented in Part 1.

As always check back in for more on this topic and other fun and useful information!

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

This is to fulfill my promise to describe how we were able to develop that functional, copyright free, royalty free, original music that we showcased within the video that we presented in Part 3. Click here for a review of that video, as it is essentially prerequisite viewing to get the most from this article’s tutorial. Yes, this video-blog contains the bona fide instructional ‘how-to’. That is — a fun, informational video show-and-tell regarding the background music-making tool and processes involved with the production in the Part 3 exhibit.

Recall that Exhibit Part 3, embodied a variety of musical styles. In this latest video, Exhibit Part 4, animated avatars will act as both your tour guide and mentors. Also remember that we had promised to discuss the usage of the music tool, with emphasis on reinforcing ‘the 1-4-5 principle’ (we initially introduced that here). But today…well, with this video you can see that principle in action.

We think you will enjoy this. Please don’t hesitate to send us feedback!

The How-To-Create BGM exhibit (Exhibit Part 4). Learn how we developed the musical elements originally presented in Exhibit Part 3.

Is it all about FLASH?

March 1st, 2010

Well, unless you have been living under a rock you have probably come across the news of Apple’s latest invention, the iPad. We at Visual Purple think it’s way cool, BTW!!! Back in January when Steve Jobs unveiled the 1.5 pound innovation, many Apple junkies were star struck. Will the 75 million people that have bought into the iPhone and iPod Touch believe in the iPad as well? What could this potentially mean for web developers? Many Apple followers are already saying that they will not buy the iPad simply because it will not support Flash. But for a starting price of $499, what more could you expect? Well you could start by expecting to pay more than the publicized low $499 price tag.

Yes, I will admit back in the day when tablets first hit the market – I was a tablet fanatic. While my awe with them has dissipated a bit the talk of the new iPad coming onto the market brought back some ancient memories. I will acknowledge that I am not an Apple junkie, however I am still intrigued with what the iPad could potentially offer (and not offer). However I am disappointed by the news that they are passing on Flash capability. Adobe claims that Flash is installed on 99 percent of Internet-enabled computers and plays over 75 percent of videos that are viewed online, could this be a transition to the future of the Internet when Flash is no longer supported? What this means to me is that Flash-based 3-D virtual worlds and the future of browser-based virtual worlds cannot function on the iPad (unless you’re using the Unity 3D plug-in). While so many of us virtual world evangelist thought we were close to mainstream adoption another hurdle pops up. Could this potentially be the writing on the wall for Flash? Flash based MMO communities are wildly popular adding to the fact that all 3 major operating systems currently support Flash, I just don’t see how Flash could fall to the wayside.

The fact of the matter is Flash is cool and all, but is it all really that practical. I will even be one of the first ones to admit that we were awed by Flash’s capabilities and recreated our main company website around flash. But that newfangled technology has lost a lot of its glitz and glamour… hey, look the page flies! So my thought is that Flash will not disappear completely, but rather may not be seen on the all alluring Apple iPad (even with potential conversion capabilities in place). Could this be the next game changer, are we really ready to lead a Flash-free existence? What about playing a YouTube video? Can something weighing only 1.5 pounds really cause such a stir? Could this be the tablet that we have waited on for so long or just another step on the ladder to getting a worthy tablet device in the near future? Will the PC market be able to hold up to this – do they have anything under the radar being developed to counter Apple taking center stage?

ipad 264x300 Is it all about FLASH?

I recently enjoyed testing the intelligence of a variety of “Chat Bots” online. While some tend to hold meaningless conversations, others actually make some sense!

Chatterbots, otherwise known as Chat Bots are defined by Wikipedia in the following definition: A chatterbot (or chatbot) is a type of conversational agent, a computer program designed to simulate an intelligent conversation with one or more human users via auditory or textual methods.

Websites are now employing chat-bots to welcome visitors and answer questions, chat bots are able to serve as a virtual assistant. We have also seen some migration of chat bots into virtual world spaces, such as Second Life. So what’s with this rather new form of intelligent technology? Although it’s not necessarily new, in fact the oldest chat bot was recorded back in the 1960’s! Today, they seem to be evolving and intent on becoming a practical solution in a variety of business and pleasure applications. Perhaps ALICE (Artificial Linguistic Internet Computer Entity), is the most famous chat bot of all.

What’s the role of a chat bot in an immersive virtual world? Applying chat bots into virtual world applications have been done- with some success. I believe that in the future we will see chat bots evolve further in virtual world spaces. When you enter Second Life you may immediately have a chat bot befriend you and carry on a meaningful (and intelligent) conversation. Rather than entering an empty area of Second Life with no other form of avatar contact. Chat bot technology is especially useful for companies that have set up shop in Second Life, whereas they don’t have to worry about staffing the Second Life location 24/7 as a virtual chat bot can do the job and answer the basic questions about a company. Because virtual worlds are primarily text based in nature for communication/ interactions this type of environment makes it a great test-bed for chat bot avatars. Utilized for entertainment and information services, Chat Bots are always available (24/7) and intelligent enough to answer questions.

Many are scripted, however a few are non-scripted that pull from a large database of text. Chat-bots are mostly text bound and utilize Natural Language Processing (NLP). Chat bots can range from greeters on web pages (that have been proven to increase sales/ conversion ratios), customer service representatives, tour guides and non-player characters (NPCs). Do they seem lifelike? To some extent yes, but they by no means take on the persona of a real person.

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

Another element to this task is to lengthen or shorten the TTS words to match the blobs of the human model. Figure 5 depicts the effort to make the TTS utterance of ‘…was a…’ (pronounced as though a contraction, ‘whuzza’) line up on the timeline with John’s clip. Use your DAW’s stretch tool to accomplish this.

Figure 5a
Figure 5a- First, make your split points
Figure 5b
Figure 5b- Next, use a stretch tool

Let’s continue splitting the TTS clip’s timelines so that we can move each corresponding sound blobs to match, and stretch the words right down to the syllable (Figure 6 shows what it looks like when all words have been synced). Listen to the whole joke, both voices lined up properly.

Figure 6
Figure 6

Here’s where some of you are thinking: Well, the blobs are lined up very nicely, but what about nuances regarding stress and pitch? Isn’t the word ‘lawyer’ as expressed by our human friend, John, not being expressed similarly? John’s lawyer blob is larger (i.e., louder) than the TTS blob. Also, isn’t the word ‘seen’ as expressed by John (in this case the stress is caused not by volume but by its pitch being higher, relatively, from the rest of the phrase) not being emulated by the synthetic actor?

Yes, indeed, so let’s try to fix these two issues. We’ll tackle the loudness point first. Figure 7 shows a Volume Envelope (the horizontal blue-ish line running through the center of the TTS clip in the timeline). With most DAWs with this feature, you can bend the volume envelope to cause increases or decreases in the audio.

Figure 7
Figure 7 – Creating break points within the line bends the envelope

Now let’s tackle the pitch issue with that word, ‘seen’. Figure 8 shows the clip properties dialog box specific to the split-off region of our seen-blob. The highlighted value indicates that the word pitch has been raised four half steps.

Figure 8
Figure 8

Listen to the resulting TTS clip with the treatments per ‘lawyer’ and ‘seen’.

Window dressing

Earlier I mentioned that this is a voice for a talking fish. This fish is contained within a fish tank in a hotel bar. Listen to our talking fish enveloped in a bubbling sound effect. Figure 9 shows the TTS clip, sans John’s clip, and with the fish tank noise clip added.

Figure 9
Figure 9 – Note that a volume envelope has been applied to the bubbles as well.

So, is that it, then? Maybe – maybe not. As if we really did want to add some reality to a talking fish environment, we might consider what we know about how a fish tank effects sound. Occlusion happens. There is a glass barrier between the sound emitter (the talking fish) and the sound receiver (the avatar). So, we could elect to shave off some of the high frequencies from our talking fish. We can accomplish this by choosing the appropriate reverb effect. If you have presets at your disposal, start with a bathroom preset or similar. Try placing the reverb effect before any equalization effects (EQ). We use EQ here to bring out the hi-mid frequencies of the voice to ensure that it is intelligible (you may need to reduce high frequencies as well if you choose a reverb preset that sounds too bright). In this case we are also deploying EQ to remove extreme low frequency rumble (artifacts that commonly get accidentally introduced when using filters in the digital domain). Figure 9 shows this idea. Have a listen to the result.

Figure 10-a
Figure 10a – Software ‘bathroom’ reverb
Figure 10-b
Figure 10b – Software EQ module

Conclusion

Can synthetic voice-actors make funny? Humor is a very subjective aspect of human emotion. What’s funny to Samuel isn’t so funny to Mary, and so forth. So maybe the jury is still out on that one. To improve our NPC’s delivery, we’ve had to rely on 3rd-party software to ensure that techniques were carefully deployed. Markup language deployment probably won’t be sufficient for specific tasks like this, where real-time interaction is not a requirement. That’s my best guess, anyway.

You may wonder what to do if you have a project that requires an ensemble of funny voices. Well, as long as you have at least one funny human available to you, that person can be your model for all voices. Then your cast of synthetic actors can be molded to conform to your model’s comedic timing.

How about this scenario: you have a cinematic cut-scene where there are several actors in the movie (or trailer). But your budget can only afford one human voice-actor. Consider recording your one voice actor doing the roles of the entire cast. Then, using the techniques discussed above, create an ensemble of TTS voices and synchronize them in your video editor (NLE) to the synthetic voices to the phrasings and expressions of your one human actor.

In fact, maybe we’ll try to tackle an example of that in my next blog entry. Stay tuned!

By Rudy Helm, Audio and Quality Assurance Tech, Visual Purple, LLC.

At the end of my previous discussion on NPC Voice-over production, I promised that I would follow up with a blog about what it might take to try to get a synthetic voice to be funny. Remember. We’re talking about NPCs (Non Player Characters), where otherwise playable characters are typically represented by professional voice-talent. I will provide you with samples as we roll along of course, as in tutorial fashion, but with the disclaimer that this is just one approach to this end, as there are likely other useful techniques that could be considered.

Ok, I sense you are protesting, how can a robot out ‘funny’ a performance by a professional voice-talent? I am not at all suggesting that a synthetic voice-actor can win such a contest. But if you are faced with options, and if this is the option you choose, you really want to come up with workable solutions.

What are the resources?

There are a number of synthetic voice vendors available. One obvious task is to choose one. A simple Internet search can help you solve that problem. For purposes of this discussion we will utilize 3rd-party software control mechanisms to effect voice properties. In this tutorial we’ll use a stand-alone audio editor along with a non-linear editor (NLE), but the same task almost certainly can be substituted by a digital audio workstation (DAW) of your choice. The audio editor might be replaced with XML controls if this is your favorite way to effect voice pitch and tempo, etc. However, I think it would be extremely tedious to try to deploy markup languages as a substitute for a DAW. By the end of this writing I bet you will probably agree with me. Please refer to my earlier post, When Your Voice-actor is a Robot, about some detail on resources. And then there is that last very important asset to have. Someone who is funny!

Here at Visual Purple, we are fortunate to have a gentleman who is a very funny guy. And for this experiment it makes for a very lucky day! So, you may be thinking, why are we talking about working with a funny human? Isn’t this topic about having a funny robot? Well, yes is the answer to that — but our funny human (Let’s call him John) will serve as a model for our robot.

Say what?

The short answer is, we will import audio clips of the funny human into our DAW, and then we will import audio clips of the synthetic voice and make it emulate the human’s speech patterns.

Say what?!!
Ok, in this project our goal is to make some humorous fish voices. You see, we have a scene in one our products where someone at a bar can stand and stare at a fish tank. As the fish swim by, and if the avatar is situated close enough to the fish tank, the fish might begin to say wise cracks to the, uh, fish admirer. This is an ‘Easter egg’ where fun is poked at the avatar, possibly insinuating that he has had a bit too much to drink. And to achieve our goal, we need to mode the synthetic voice clip to try to emulate the comedic timing as expressed in the human model.

Let’s do it!
So let’s start the process by importing into our DAW an audio clip that John, the funny human recorded for us (Figure 1).

Figure 1
Figure 1

Next, Listen to John’s original model for reference. The script: “Last week it was a lawyer’s convention. I never seen so many sharks!” We follow that by importing a correlating audio clip from the synthetic voice (Figure 2).

Figure 2
Figure 2

Without doing anything further at this stage, we can easily see that the graphical sound ‘blobs’ don’t match. So, before we move on, have a Listen at the robot’s recording. Notice that this clip has already been treated with pitch transposition. (For a discussion on ways to do that, please refer to my earlier post, When Your Voice-actor is a Robot.) Our intent was to get cartoon-y voices, so we started with a female TTS voice and then modified her pitch characteristics.

Now, to make the robot emulate John’s comedic speech patterns, we need to edit the clip’s timelines so that the graphical sound ‘blobs’ do match. Figure 3 illustrates an example:

Figure 3
Figure 3

In Figure 3’s example we see only the first two words of the script (“Last week…”). Listen to how the TTS’s utterance of the word ‘week’ occurs earlier in the timeline than does John’s blob of the same word. Close — but the timing is just not right is it? Note that we need to create a split point (the vertical line represents this) just before the TTS’ blob. Doing this enables us to separate the words and move them as we wish on the timeline (see Figure 4).

Figure 4
Figure 4

Now, Listen to both voices speak those two words in sync. (…to be continued)