Friday, February 19, 2010

Kindle Text-to-Speech Dissected: Part 1 - Tom Glynn Interview

Here’s an interesting scenario: you’re listening to your child read a story to you from when they were 6 years old.  Your child is now 35, so this must be a recording, right?  But the book your child is reading was published only last year, and you are playing it for your 5-year old grandchild!  Sounds impossible?  Not if your child’s voice was recorded specifically for playback in a text-to-speech (TTS) system.  Although TTS uses a computer or someone else’s voice today, in the near future, TTS recording will enable the capture and playback of voices for everyone.  But, how does a TTS system actually work?
In this multi-part series, Kindicted will examine the history, technology, and people behind TTS, which includes everyone from childhood prodigies to internationally famous criminals.  But first, a lighter look at computerized speech, including a recent interview with the default male voice of the Kindle – Tom Glynn.
The ‘human’ computer
In a large city, computerized and computer-controlled speech systems are encountered on a daily basis; subway, transit, and GPS reservation systems, automated call attendants, cell phones, personal digital assistants, ebook readers, and so on.  For systems with a fixed number of words and phrases, envisioning the system is straightforward.  The computer simply plays back the appropriate previously recorded text based on input criteria.  For TTS systems, such as the Kindle, that use a human voice rather than a computerized (or synthesized) one, 1,400 individual snippets of English speech have to be recorded, labeled, and dynamically arranged for playback in order for the device to convert text to speech.
The man behind Kindle’s TTS voice
In the case of Amazon’s Kindle device, Nuance Technologies supplied the software and voices to convert text to speech.  You can currently choose from a male or female voice, although Nuance’s website lists dozens of voices in many languages.  In February of 2009, it was discovered that the male voice behind the default Kindle TTS is an experienced singer/songwriter and broadcaster: Tom Glynn.  A year has passed since Tom’s Kindle ‘discovery’; he has a new album out, and Amazon has sold millions of Kindles.  Kindicted recently had an opportunity to catch-up with Tom.
As an added bonus, this interview is available in mobi format here.  Simply download the mobi file, transfer it to your Kindle, and play the interview using the default male voice (Tom’s).  In some sense, Tom will be reading the interview aloud using his own voice!

The interview

Kindicted: You are an accomplished singer and songwriter; when did you realize that you had vocal and musical talent?
Tom: I realized it pretty young. My parents picked up on it and started me on piano lessons at age ten. From there, I taught myself how to play by ear and picked up the guitar around age 14. I was obsessed with playing piano or guitar every night through high school. I always loved music and had a pretty finely tuned ear for details like harmony, chord structure, and rhythm from an early age.
Kindicted: Broadcasting was a part of your career; was that to support your music, or to enhance it?
Tom: It was essentially a way to support myself, but I had a love of broadcasting from an early age. I loved performing impressions growing up, and I paid close attention to the nuances of the way people spoke. But yes, broadcasting and music definitely enhance each other. Inflection, pacing, and other elements of spoken words are certainly helped by being musical, as well as being able to remember the pitch of something I say and duplicate it many times for consistency.

Kindicted: In hindsight, do you feel that being a radio personality was critical to being able to use your voice talents for computerized speech?
Tom: Absolutely. A radio background gives you the experience you need to know how to capture people’s attention and communicate information in a compelling way. It also helps you develop a style and feel that’s your own.

Kindicted: Did you have to seek out work, or did someone hear your voice and decide that it would be perfect for a speech system?
Tom: Like most people in broadcasting, I had to work hard for a number of years to seek out opportunities. It’s a misconception some people have that having a good voice is all it takes to do voice-overs. There’s a lot more to it than that. Part of it is who you are as a person because your personality is reflected in the work you do. It also requires many, many hours of refining things such as pronunciation and inflection, along with listening to your recorded voice constantly to see if there are subtle improvements you can make to convey a better feel or connect better with the listener. I still do that everyday.

Kindicted: Is there a high degree of competition in the voice market?
Tom: Yes, voice-over work is a very competitive industry. I say that not in the sense that I feel like I’m competing with someone, but that there are perhaps a limited number of jobs that are in high demand. Ultimately, you’re competing with yourself to be the best you can be, just like any field, and if you develop a sound and style that’s your own, you’ll do well. If you find a niche, it’s great.

Kindicted: From a philosophical point of view, does it bother you that your voice is being used to utter phrases that you personally would not say or approve of?
Tom: Not really. I did some on-camera work earlier in my career, and I found that to be much more invasive and questionable. I think when someone sees your face, it’s more like a personal endorsement. That’s why you hear a lot of major movie stars doing voice-overs for TV commercials these days that they would never appear on camera for. If they were on camera, it would be as if they were personally endorsing something, but that’s not a problem if it’s just their voice -  even when people recognize their voice. I honestly don’t spend much time thinking about the way my voice is chopped up and used. I’m much more focused on getting it right when I do the actual recordings, and then I let it go. Also, I think people realize that a computerized TTS voice is just a functional tool more than a real person. 

Kindicted: If your voice kept uttering new phrases after your death (a long time from now), do you feel that you have a more modern degree of immortality than actors or musicians, whose body of work is essentially static?
Tom: Hey, you may be right. I never thought of that. Maybe my TTS voice can do my eulogy. 

Kindicted: Have you ever encountered your own voice in an interesting situation? If so, what was that like?
Tom: Oh yes, all the time. I end up having to converse with myself frequently on the phone. It’s also amusing when I’m waiting in line at CVS, and I hear myself say “One pharmacy call” on the loudspeaker. Or the time a group of us were watching a storm bulletin on TV, and it was me giving the emergency forecast as the voice of the National Weather Forecast. There are many surreal moments.

Kindicted: Do people recognize your voice as the voice of a GPS, Kindle, voice prompt, etc.?
Tom: If someone asks me what I do, and I tell them, then they recognize it. But not just out of the clear blue. Even when I’m at CVS and having a conversation with the clerk, they don’t recognize that’s also me on the loudspeaker – and I certainly don’t tell them. That’s another beauty of voice-overs…my anonymity. I’m a quiet, introverted person for the most part despite my voice being all over the place, so not being recognized is fine by me.

Kindicted: You don't own your voice in regards to the plethora of devices and systems that use it - does that bother you?
Tom: Not at all. That’s part of the gig.

Are you made aware when your voice will be used in a new device, or do you usually find out after the fact?
Tom: Usually I know because most of my daily work is not TTS. I’m usually recording actual phrases for specific clients that I’m tailoring my voice and presentation for. But with TTS, I don’t always know where my voice ends up until after the fact. I had no idea I’d end up as the voice of the Kindle when we recorded those phrases. It was a thrill for me because I had already become addicted to my first generation Kindle before the TTS one came out. I’ve been a Kindle addict for quite some time.

Kindicted: If you lost your voice, would you use a computer to speak with your own voice, or would you choose a different one?
Tom: I’d probably enjoy the silence. I talk so much for my job that I prefer to be quiet much of the time. 

Kindicted: Do you like hearing the sound of your own voice?
Tom: Well, I’ve certainly become used to it over the years between singing and speaking. When I hear my voice, I’m usually paying close attention to the details and nuances of what I’m saying. I’m usually asking myself questions like, “How might the way I said that make somebody feel? Was it friendly enough, was it too friendly, was it delivered at a nice brisk pace or was it too rushed?” That’s an example of my internal dialogue. 

Kindicted: The process of recording diphones (snippets of words) seems (on the surface) to be physically and mentally demanding - how do you prepare for the process?
Tom: Yes, the work takes a great degree of focus for long stretches at a time. I burn out after about 3 hours of continual recording because of the level of concentration and the physical demands of making my mouth pronounce everything just right.  It’s important to be incredibly consistent, so I just get myself in a good frame of mind before I record. I can’t think about anything else other than what I’m recording. It really takes full concentration, but I enjoy that. I’m someone who’d much rather work intensely for several hours than work all day at a job that has a bunch of downtime.

Kindicted: How long does the typical recording session take (in total)?
Tom: A job can take anywhere from a few minutes to all day. But generally I try to limit any one job to three or four hours to make sure the client is getting the very best product possible.

Kindicted: How closely did you have to work with the scientists and engineers to pronounce the diphones just right?
Tom: We had recorded several versions together in the past, so we were lucky enough to have a lot of trial and error with TTS going back a number of years. The way we decided to go was to just be myself as if I was speaking normally and things I was saying were not going to eventually be chopped up. I think that helped us end up with a more natural sound with this version of TTS. Certainly it’s not as natural as hearing a real voice speaking, but it has come a long way. I really hope people find it helpful.

Kindicted: Did you have to have any speech training, or work with a linguist?
Tom: No, my speech training was all on the job over the years during broadcasting jobs, and many hours listening to recordings of myself and being hyper-critical. The most important element in learning to be good at voice-overs is not how well you talk, but how well you listen to yourself and others.

Kindicted: Do you use your voice talents for audiobooks?
Tom: I have never done an audiobook. I’ve done many types of narration over the years, but never an audiobook. I do listen to them quite often though, and there are some remarkable voice talents out there who read them. I love listening to their presentations.

Kindicted: Are you in demand for other roles (TV, radio, Internet etc.) based on your voice work?
Tom: I’ve done numerous radio and TV commercials over the years, along with many projects for the Internet, training videos, cartoon characters, corporate presentations, movie trailers, and literally thousands of other projects. Now people mainly know me as the phone voice they speak to when they call Bank of America, United, Apple, CVS, and many more. And my TTS voice is the voice of Onstar’s GPS, the National Weather Service, the Phoenix Airport, and of course, the Kindle. 

Kindicted: The Kindle didn't pronounce ‘Obama’ properly - did you have to record that one?
Tom: I actually read about that on my Kindle when the story came out. No, I didn’t re-record it, so they must have fixed it somehow in the technology. I’m glad they did.

Kindicted: For TTS, are you still asked to record new words, diphones, and phrases, or is your body of work large enough that no additional pronunciation is required?
Tom: I’m sure at some point we’ll record some more phrases, but currently I think we’re all set.

Kindicted: Do you still plan to market your voice, or are you concentrating on other endeavors?
Tom: I’m always open to new projects and ideas. I’m lucky in that I have a lot of clients who rely on me at the present moment, but I’m always up for new challenges. I’m still a musician at heart, and I just released a brand new album called “Blue You’ll Do”, which is available at Amazon, iTunes and I’m really happy with the way it turned out, and the reaction so far has been fabulous. This particular album features a unique baritone acoustic guitar, which I bought last year. It has an unusual custom tuning, so it’s half guitar and half bass. I’ve never heard anything like it on a singer-songwriter record. Right now I’m concentrating on promoting that and hopefully getting it into the ears of as many people as possible.

Kindicted: People can still tell that your voice is computer-driven; how long do you feel it will be before a computer-controlled voice will be indistinguishable from a human one?
Tom: That’s a good question. As someone who speaks for a living, I believe there is a human dimension to speech that can never really be replicated by a machine completely. But who knows?

Kindicted: From a personal point of view, do you feel that the ever-increasing use of electronics and electronic communication enriches people's lives, or does it dehumanize to a degree?
Tom: I love technology. Technology allows me to reach millions of people with my music digitally, and it allows me to do my voice-over work from virtually anywhere. Like anything, it has the potential for good and bad in it depending on what it’s used for. But that’s human nature in a nutshell too. I do know what you mean about dehumanizing with all the devices, but hopefully it’s also opening up channels for people to connect in new and beneficial ways too.

Kindicted: Do you ever see a day when computers will be the norm for writing and performing music - including singing?
Tom: Wow, I hope not. I guess to some degree it already is the norm. Singers are made to sound more ‘computerized’ with the Auto-Tune effect. I hope we always value real musicians, singers, and songwriters because that’s really at the core of who we are as human beings.

Kindicted: Tom, thanks for taking the time out to answer a few questions. Best of luck with your new album.