Friday, March 12, 2010

Kindle Text-to-Speech Dissected: Part 4 - Jo Lernout

In the last TTS installment, the lineage of the Kindle's TTS was traced back to 1999 (the creation of RealSpeak), and then 1983 (and earlier) for the roots of Nuance - the company that owns RealSpeak.  RealSpeak was originally developed by Lernout & Hauspie, a Belgian-based corporation founded in1987 with the intention of incorporating computer speech in devices and, later, the Internet.  After rapid expansion in the late 1990s, Lernout & Hauspie became caught up in a web of accounting that eventually toppled the company. Accounting inconsistencies aside, L & H was a company set on changing how people interacted with computers and technology. This is evidenced by the substantial growth and adoption of L & H technologies by ScanSoft (and later Nuance) after they purchased the speech division of L & H.

It has been almost 10 years since the fateful breakup of L & H, and Kindicted wanted to know what happened to the principals - Jo Lernout and Pol Hauspie. Jo Lernout was kind enough to agree to an interview with Kindicted. The topics range from early influences, to Jo's vision of the future of computer interaction.

Jo Lernout has been described as a compulsive entrepreneur; someone who is forever thinking about new inventions and companies. This was certainly true in 1987 when Mr. Lernout and a self-made businessman, Pol Hauspie founded a company that would eventually become a juggernaut for speech technologies - the aptly named Lernout & Hauspie (or L&H). Regrettably, L&H’s fire burned bright, and it became a victim of its own runaway success. The ashes of L&H did survive, however, incorporated into many speech products for computers, phones, GPSes, and even the Kindle.

And now, the interview.

You’ve had a brilliant and varied career from teacher to director, inventor, and consultant. 
What influences did you have early on in life that led you to choose a career in science?

As a child, I was interested in science in general. Growing up, I was maybe 7 or 8 years old or so, I read chemistry handbooks and experimented with growing crystals, etc. – the typical hobby kits and toy-chemistry labs. I read and collected books about birds and animals, and every day, studied the local biotope where I grew up (a little village in Flanders, Belgium). So, to me, it was a logical career path to later become a biology and chemistry teacher.

After a few years of teaching, a former teacher-colleague told me that business life is more challenging and interesting, along with the potential to earn a lot more than as a teacher. So, with his advice, I became a medical representative (at Merck Sharp and Dohme); a good combination of sales and (medical) science. I loved that job, because Merck gave numerous courses, ranging from medical topics, to pharmacology, and sales techniques. I did that for 5 years. I enjoyed this quite a lot, since selling seemed to be a natural fit for me. That was from 1972 to 1977.

Communications technology was then still limited to landline phones, telex, and regular mail. Fax and Xerox machines were just released, and computers were considered to be the domain of a limited number of mainframes and some of the first mini-computers, all contained in secret and sacred cooled rooms, where only a special breed of programmers were allowed.

Who would you say had the greatest influence on your advancement as either a teacher, a businessman, or an inventor?

One of my younger brothers (there were 14 of us!) who was blessed with a very high IQ and a natural ability in math, graduated as a civil engineer in electronics. He was an excellent computer programmer. He told me computers were the future, and programming was not that hard to learn. He convinced me to apply as a sales engineer at what was then Honeywell Bull. They hired me, and I followed a 3-month COBOL programming course, and then started selling mini-computers, mainly for accounting and payroll applications. Again, selling computers was kind of easy for me. In two years I sold many computers to medium and small businesses in Flanders. I still am grateful to my brother for having convinced me to step over to the computer business.

How did you initially become interested in speech technology? 

In 1979, the then CFO of the local subsidiary at Wang Belgium convinced me to become sales manager at Wang Belgium, and also to enroll for a post-graduate MBA evening course at the University of Ghent (the famous Vlerick Institute).

I am still grateful to that person, since I received formal business management training and was also exposed to some early speech recognition and TTS applications that Wang was experimenting with. At that time, Wang was the absolute world leader in word processing systems. We started using internal email as early as 1982. Wang was also experimenting in those years with a system called "Alliance", which was a keyword-based text search engine, that blew my socks off. Essentially, that was "Google" 15 years before Google was founded! Wang was also delivering word processors to companies that were building automatic machine translation technologies. One of them was a division of Siemens, linked to the University of Leuven in Belgium. That division was later sold to a German company: GMS, and L & H bought GMS in 1996.

From then on, the combination of "speech recognition, text-to-speech, search engines, email and automatic translation" technologies became nearly an obsession for me. And today it still is an obsession.

To me, it was clear that this combination of technologies was bound to become reality in various applications at some point in the future. Wang was clearly on that path. I saw my role as lobbying inside Wang Labs headquarters to ensure that other languages besides English would be developed.

What motivated you to start L&H?

In 1987 Wang was still doing OK, but the first signs of decline were then visible. Since 1984, Wang had a lot of competition from word processing as an application on PCs. Wang's dominance was over, and Wang entered the PC stage too late.  Wang should have licensed its word processing software to Microsoft or IBM – but that is another story.

I was thrilled by the future of the speech and language technologies, and reading about it taught me that most large companies conducting research in those domains were concentrating on English or Japanese only. At Wang, there wasn't much drive to develop these technologies in other languages. I contacted several local Belgian university labs, and learned that some researchers there were working on Dutch and French versions (they became later our main R&D team at L & H).

Around that time, I became acquainted with a self-made business man, Pol Hauspie, who lived in Flanders, and who had taught himself how to program computers. Pol had developed his own accounting applications, and built a 40-employee software company, which he had recently sold to a larger software company in Belgium.

Pol Hauspie convinced me that the best way to realize the dream of combined speech and language technologies, in one set and in many languages, was to try to do it by ourselves. Together, we decided to start L & H. We put some of our own money in (yes, I sold my house and said goodbye to the healthy salary enjoyed at Wang as a sales and marketing manager) and some money from friends and family. In 1987, we started L & H with about $400,000 USD.

The rest is, as they say, history.

Millions of devices now use technologies from L&H.  When you founded L&H, did you have any idea that the field of computer speech would become so popular?

I dare to say that we indeed had this vision early on. We knew it was a matter of time for the technologies to become more mature, stable and really useful.  We also knew it was only a matter of time before the necessary hardware was available, and small and cheap enough to deploy such technologies on PCs and even mobile devices. Our vision statement in 1990 was audacious enough to state just that.

Our vision was also that, if we were to develop for many languages, we could license these as components to large companies all over the world. We compared our strategy to "Dolby".

We had early licensing talks with AT&T (who invested, in 1993, $5M USD), Apple, Analog Devices, V-Tech (for devices similar to the Kindle), and even early talks with Microsoft and Samsung (1996).

Later on, Microsoft endorsed our multi-lingual approach and vision by investing $60M USD in 1997, and by signing a co-development and co-marketing deal.

Still later on, in late 1999, we made licensing deals, or were in active talks with Delco, Ford, Samsung, LG, Daimler-Chrysler, TeleAtlas, and many others.

In 2000, we even had serious talks with David Wetherell, CMGI, and (then) owner of Alta Vista and Lycos. If that merger would have materialized, we would have been "Google" today. Together, we saw the future of mobile devices for mobile internet access, where the combination of speech and language technologies (available in the language of the user), would help mobile users find information in the enormous haystack of multilingual online information. Semantic search engines (such as the ones we acquired from Novell and others in the late nineties), rather than pure keyword matches, would have helped mobile users find more precise answers to their queries. The project we had in mind was dubbed SofIA : Society of Intelligent Assistants.  Early wireless prototypes were shown in the spring of 2000. If you analyze Eric Schmidt's current vision, he basically says the same as what we stated in 2000, with regard to what is important to Google and the world: billions of users will dig their information from smart internet phones, and speech and language technologies will have a very important role in this.

Do you have any satisfaction in knowing that technologies that you helped to develop are now assisting millions of people?

I certainly do, the success of Nuance and others (Apple, Google Nexus) currently applying Speech and Language Technologies in numerous devices and applications, re-enforces the vision we had at L & H.

In the late 90's and early 2000, we at L & H saw these technologies becoming massively available around 2005.  That turned out to be a few years "too optimistic".  It is interesting to note that Nuance’s revenue from sales was already over $500M USD in 2005/2006.

Between 1998 and 2005, the use of Nuance and other speech and language technologies was already massive – albeit it more in vertical domains.  Deployment was less widespread than on internet smart-phones, and included devices such as:

  • Dictation applications (especially Dictaphone) for doctors, lawyers, etc.
  • User Interfaces for the visually impaired; this particular segment gives me a lot of satisfaction, as the technologies are of much help to them.
  • In-car dashboard applications, such as voice controlled GPS and in-car mobile phones
  • A wide variety of educational toys and language study applications.

Do you regularly use any of the technologies that you had a hand in developing or popularizing?

I often use Google Translation. It doesn't give a perfect translation, but it is good enough to pull a Chinese or Russian website into English, and I get the gist of it. That is not former L & H technology (Google's MT is based on a statistical engine, and not on rule-based MT), but it demonstrates that there is a multilingual haystack of information on the net, and MT helps to find what you need, in any major world language. That part of language technology was definitely part of L & H's vision and portfolio.

I don't use speech recognition or TTS. I'll become an avid user when it really functions well on a smart-phone.

If I were a radiologist and I had to dictate daily standard radiology reports, I would certainly use Dragon Naturally Speaking or specialized versions of Dictaphone (L & H acquired Dictaphone in 2000) . It is more productive than dictating first in an old-technology taped based Dictaphone device, then passing it on to a medical secretary for transcription. It is much cheaper to have good speech recognition do the transcription job in a nearly fully automatic way.

PC-based Dragon Naturally Speaking (especially their latest versions) is absolutely stunningly good in terms of accuracy, etc. However, even with such high accuracy, it slows me down when I am at my PC; I have to dictate and observe the displayed text at the same time, and this simply confuses me. This is where we were wrong at L & H in terms of assessment of the real market for PC based dictation engines. In addition, it is kind of weird sitting at your desk and dictating to a PC, whether you are sitting alone in the office, and even more so when others are around.

So, I resort to the “good old keyboard”.

As for the speech recognition on my Nokia smart phone, or the one in my TomTom GPS, well, I stopped using it. Not good enough – too many false recognitions, and it doesn't recognize continuous speech for asking Google or other search engines a query, or to dictate my SMS or emails on it. So, today, it is just faster to tap and touch and type on the keyboard.

What one speech technology do you personally feel has a strong future?

Without any doubt, the so called "vertical" applications I mentioned before will continue to grow, and the speech and language technologies will continue to improve.

The real breakthrough and unparalleled massive usage will arrive (give and take within a couple years) when:

1.       Speech recognition works really well on smart phones.

  • This includes continuous speech dictation for dictating SMS, emails, tweets, etc., and for submitting search queries – especially when one is on the move, or when one is driving a car. At that point, well-functioning speech recognition and voice-driven user interfaces will be the preferred way of interacting with smart-phones.

2.       Intelligent search becomes a reality

  • But there is more. In addition to voice-based interaction, very intelligent search technologies will be the key to deliver mobile internet information. Google keyword matches return too many pages, and when one is mobile, one doesn't want to receive millions of hits; one needs a small, concise, and semantically precise answer.  The user must be "understood" not only in terms of "what words were spoken by the user?", but more in the sense of "what is actually meant by the user?"

As an example, the question "is there a Godiva shop in Chicago?" is, semantically, slightly different from the question, "Where can I find a Godiva shop in Chicago?" and again slightly different from the question of a user (being in Chicago, using a smart phone with GPS) asking, "Show me the nearest Godiva shop around here." A user doesn't want to receive the same zillion pages back from Google, with many hits about "Godiva", "Chicago", etc. – no matter what the user actually meant.

Speech recognition and retrieval of mobile information must function really well; to the point where the user finds it much more useful than tapping, touching and typing (on a tiny keyboard) and much faster than scrolling on the small display of a mobile smart phone through a vast amount of returned pages.

I used to proclaim in the late nineties, "There will be either a great market for toothpicks – to tap endlessly on these tiny keyboards on the billions of mobile internet phones – or a great market for speech recognition on these devices!"

Before that "real usefulness" is achieved, speech recognition will have to be "enriched" with embedded semantic understanding, thus not just based on good acoustical and statistical language models. In addition, sophisticated natural language processing will be needed, in order to capture the full meaning of what is queried, and in order to convert that to SQL or text mining features. The goal is to return only really relevant information to the mobile user.

Then and only then will we see massive usage on mobile phones. That is exactly the domain where companies such as MIIAtech is working in now: enhancing search engines and speech recognition by means of sophisticated NLP (Natural Language Processing).

Your company developed Text-to-Speech, dictation, and other technologies.  For Text-to-Speech, do you feel that recorded speech (using diphones, etc.) is superior to pure speech synthesis, or vice-versa?

It really depends on the application. Diphone-based TTS is usually a better choice for educational purposes (language learning, for instance), or talking toys and talking avatars: imagine an avatar based on a real life famous person, with his or her voice as TTS.

Blind users, on the other hand, prefer to stick to real synthesized voices as they easier for them to understand, and are easier to manipulate pitch and speed. An in-car GPS device may also benefit more from synthesized TTS, as it differentiates better for routing messages and reading street names.

Do you own an Amazon Kindle, or any other device that uses technologies developed at L&H?

I don't, but after reading your blog, I'll hurry up to get one for myself and the kids around here!

I have heard TTS on the Kindle; it is impressive, and is based on Realspeak, but with substantial enhancements brought to it since it was acquired by Scansoft at the end of 2001.

One still notices it is a computer talking, but I am sure this TTS will further evolve. Again, also in this area, adding NLP to the algorithms will help to make TTS really sound like a human being. Semantic clues derived from the context the TTS is reading aloud will be a helping factor to generate the correct prosody.

The technology you developed at L&H can essentially recreate a voice from anyone with a large enough sample of speech - including deceased individuals.  Have you personally had your voice recorded and converted to a TTS voice? 

This is really exciting! At L & H, we were, in the late nineties, using this approach to develop different voices, but we had the idea of sampling anyone's voice only in planning stages back then. Nuance did a great job in bringing this to the next stage (realized by former L & H TTS engineers now at Nuance).

I wish I had my voice recorded back then. Maybe one day!

What is the period in your career that you recall the fondest?

The day Microsoft invested in L & H, and a few years thereafter, when Bill Gates said in a keynote speech at the Etre conference in Southern France to a large audience, filled with executives from the biggest IT and Telecom companies from around the world, "L & H RealSpeak (TTS) is the only voice I want to listen to for longer than a minute for having a device read my emails to me."

Where do you see the computer speech field in 20 years?

By then, we will have passed the point where computers (including the ones we hold in our hands and the ones woven into our clothes) really understand what their human users want: spoken text and commands, multilingual search queries, and instant and perfect translation from one language to any other major world language. Reproduction of spoken translated text will be with your own voiced TTS.

The user interface will include other "commands" we mean and give, such as pointing, or facial expressions, as indicated in MIT's Pattie Maes’ "6th sense" speeches. Or already shown in the latest Google apps; let the camera of the smart phone look at something (a menu in German for instance) and tell what that is and translate into English.

Users will be "deeply understood" by their computers, meaning that the computers will also reason about what is asked. We see such examples in Wolframs' Alpha applications. Computers will solve real world "puzzles", to the point that humans can ask what best to do in a particular situation. Think of it as having your professor, lawyer, business consultant, interpreter, coach, friend and playmates with you and on you all the time, ready to give very precise answers, advise, solutions, and ready to play along.

The artificial intelligence will look eerily close to real human intelligence, but that doesn't mean computers will be sentient "beings"; they will just act like sentient beings.

People can still tell with a high degree of probability if a speaker is controlled by a computer.  What advancements in the field are required in order to achieve computer speech that is indistinguishable from a human?

Probably a combination of:

  • More and better acoustic modeling and embedded NLP inside speech recognition algorithms,
  • Avatars that speak with excellent TTS (this is just a matter of time and more and cheaper hardware),
  • Avatars that "look" at you when you are speaking (benefit is that the camera can also pass lip-reading clues to the speech recognition algorithm),
  • Sophisticated NLP to make sure that the user is really "understood".

Al of that will enhance the "near human" experience of the user interface.

Do you have any regrets in regards to your career as a teacher, businessman, or inventor? 

No regrets in terms of having at least tried very hard, and having achieved some part of L&H’s early vision. Others completed and are completing the picture. MIIAtech may well become an important supplier of the world's best NLP. I am grateful to a number of Belgian investors who contributed to the first start-up phase of MIIAtech.

Looking back, I think we would not have lost L&H if we would have been more transparent in the way we conducted business. I am still convinced that the accusations (of planned fraudulent revenue bookings) are wrong. We designed a system of franchises, in which investors could invest to pay fees to co-develop language versions with L&H. We only booked these paid-up franchise fees, not the later to be expected shared revenues from licensing income. I still don't see what was wrong with that. Our lawyers and auditors even advised us it was a good and legal revenue recognition system. Even today, there is still a lot of controversy about that issue.

But I learned this lesson: when an entrepreneur has good intentions, even if his lawyers and auditors approve the accounts, the entrepreneur still has to be very transparent to the entire world when their company is listed on a public stock market. If the company works on "sensitive" technologies, then this company must also make sure to engage in full transparency to all concerned and possibly affected parties.

To be sure: I haven't lost my passion for these technologies, and at the age of 61, I haven't lost my entrepreneurial drive. But I guess it is better to let younger and sharper folks turn vision and technology into shareholder value.

Saturday, March 6, 2010

Kindle Text-to-Speech Dissected: Part 3 - Corporate History

As mentioned in previous articles, TTS has followed a long and winding road, with as many as 50 companies vying for the ultimate prize: a machine that can speak as well as a human.  Over the past 5 years or so, the computer speech industry has consolidated to 4 major companies, which has given an opportunity for a new round of speech-related startups to take a shot at the prize. So far, no TTS technology (without pre-set phrases) has been able to fool a human, but advances in technology have a funny way of 'popping' up all of a sudden.

There have been many colorful figures in TTS’ history, and this series of articles will take a closer look at a few starting with an in-depth interview with one of the key TTS figures in the last 25 years.  But first, a timeline of sorts is required to establish a temporal context on which the rest of the TTS historical articles can be based.  With that, the following is a list of companies, name changes, and acquisitions that have led to the TTS technology found in the Kindle today.

Bell Labs' VODER is displayed at the 1939 World's Fair
Stanford Research Institute founded
IBM funds speech research
G. Peterson, W. Wang, and E. Sivertsen produce speech using diphones
IBM Text-to-Speech team formed, including Dr. Michael H. O'Malley
John L. Kelly at Bell Labs uses an IBM 704 to 'sing'
Xerox PARC research facility opened
Cecil H. Coker at Bell Labs converts printed text into speech
Kurzweil Computer Products, Inc. is founded by Dr. Ray Kurtzweil to develop character recognition software for any font.
Berkeley Speech Technologies (Text to Speech, Speech Recognition) founded by Dr. Michael H. O'Malley
Xerox purchases Kurzweil Computer Products and runs it as Xerox Imaging Systems (1990-1999), and later as ScanSoft (1999+)
Dragon Systems founded by husband and wife team Dr. James and Janet Baker
Speech Technology and Research (STAR) Laboratory founded as a spinoff of the Stanford Research Institute (SRI)
Eloquent Technology founded in Ithaca, NY by Dr.Susan Hertz
Lernout & Hauspie founded in Belgium by Jo Lernout and Pol Hauspie
Visioneer (Scanner hardware and software) founded by Dr. Denis R. Coleman
Nuance Founded as a spinoff of SRI's STAR lab (originally called Corona)
ALTech founded by Mike Phillips
Phonetic Systems founded
Lernout & Hauspie acquires Berkeley Speech Technologies
Lernout & Hauspie acquire an additional 16 speech-related companies
Lernout & Hauspie acquires Kurzweil Applied Intelligence
AT&T Launches their Next-Generation TTS, later renamed AT&T Natural Voices
ALTech renamed to SpeechWorks
Visioneer purchases ScanSoft from Xerox and adopts ScanSoft as a company-wide name
Lernout & Hauspie develops RealSpeak; the TTS system that would eventually make its way into the Kindle
Lernout & Hauspie acquires Dragon Systems
SpeechWorks Inc. acquires Eloquent Technologies
Rhetorical Systems Inc. founded in Edinborough, Scotland
ScanSoft acquires Lernout & Hauspie's Speech and Language division
ScanSoft acquires Philips Speech Processing division
ScanSoft acquires SpeechWorks Inc.
ScanSoft acquires Rhetorical Systems Ltd.
ScanSoft acquires Phonetic Systems Ltd.
ScanSoft merges with Nuance and changes company-wide name to Nuance
Nuance acquires an additional 20 speech-related companies
Amazon selects Nuance technologies' RealSpeak to provide TTS in Kindles
Amazon releases the Kindle 2 and DX with TTS

A [rough] graphical version of the Timeline is available here.

Look for the next Kindicted article in the TTS series: an interview with...someone named on the above list!  Until then, happy reading!

Friday, February 26, 2010

Kindle Text-to-Speech Dissected: Part 2 - TTS History

Many Kindicts love the text-to-speech (TTS) feature of the Kindle2 and DX, which reads books aloud in a user-selectable voice.  The ability for TTS to read a book's text is dependent on whether or not the book is TTS 'enabled', which is an interesting subject on its own, but not the focus of this article - the second in a series on TTS technologies.  Below, Kindicted presents a very brief history of the 'speech' portion of TTS.

History is littered with 'odd' individuals who were fixated on the technology du jour - distant evolutionary cousins of today's uber-geeks.  In that vein, every few hundred years or so, a historical figure became obsessed with building a machine or apparatus that could mimic the human voice.  The motive behind such early obsessions was not entirely clear, suffice it to say that since the primarily means of human communication is speech, a talking machine would net a profit if put to some practical use.  Regardless of the motivation, these individuals made strides in analysis of human speech, and occasionally the yardsticks of knowledge were moved forward a bit.

Early machines consisted of user-modifiable tubes and bellows to produce vowel sounds.  The subsequent addition of a mechanical tongue and lips enabled consonant sounds to be produced (along with a likely side effect: lonely inventors more adept at kissing).  The advent of the telephone raised new interest in the study of human speech, and by the 1930s, Homer Dudley, an engineer at Bell Labs, developed an electromechanical (i.e. non-digital) speech synthesizer dubbed VODER (for Voice Operating DEmonstratoR) based on research by fellow Bell scientists, led by Harvey Fletcher.  The techniques used for speech synthesis in VODER are still used in today's synthesis hardware - albeit with many refinements.  Note that this voice synthesis is separate from voice encoding (VOCODER), which was originally invented as a means of coding speech for transmission through phone lines, but subsequently used by the musicians as an interesting vocal effect.

Hooked on Phonemes
Post-1950, speech research focused on the phonetic elements of speech.  A phoneme is the smallest unit of sound that can be separately distinguished between sequential utterances (e.g. the 't' sound in 'sat' or 'test').  The production of phonemes during speech produces energy (in the form of sound pressure waves) that can be recorded and analyzed.  If a general model of each phoneme produced through human speech is recorded, an electronic representation of a language can be recorded and labeled; the English language contains 37 to 47 phonetic elements.  Playback of the recorded phonemes in the right sequence, and at the proper speed, produces crude synthesized speech.  Early systems that produced speech in this manner were barely intelligible.  Humans are very sensitive to even minute variations in speech, which makes clear speech synthesis quite difficult.

The human auditory system is the original sound transcoder, transforming sound pressure waves to electrical signals to be processed by the brain.  No less than 5 distinct areas of the brain stem and brain participate in the detection and recognition of sound.  People are particularly sensitive to their own language, and can even detect (at better than chance levels) whether or not an unseen speaker is smiling while they speak.  This extraordinary sensitivity is one of the reasons why people are so adept at detecting when a computer is speaking or controlling speech.  So far, no one has been able to develop a non-scripted speech system that is indistinguishable from a human speaker - but that does not mean that companies haven't tried.

Please answer the Diphone
In order to raise the quality of computerized speech, researchers moved away from phonemes and towards diphones.  Simply put, a diphone is the sound produced in the middle of two phonemes; from a point halfway into the first phoneme to a point halfway into the second.  Diphones (and sometimes half-syllables and triphones) are important in producing a natural-sounding synthesized voice, since the sound of a phoneme is modified slightly by the sound of the next phoneme. There are 1,400 diphones in the English language, which corresponds to 'allowable' combinations of phonemes.  Strictly speaking, the number of diphones should be the square of the number of phonemes, but many phoneme combinations never appear in spoken language.  By using the diphone approach, researchers were able to greatly increase the intelligibility of a synthesized voice, since there were many more realistic phoneme combinations to choose from when constructing the sound.

Live versus Lip Synch
In the 1980s, a battle royale of sorts was unfolding among speech researchers.  In one corner were the fundamentalists; researchers who believed that the purest and most flexible sound came from rule-based speech synthesis.  This included programs that modeled airflow, tongue position, lips, etc. - sort of like a digital representation of the early tube-based apparatus, coupled with an extensive database of rules describing how different phonemes are paired.  In the other corner were the concatenation-based zealots.  This group believed that the key to realistic speech was to build words out of a pre-recorded database of diphones (from a human speaker).  Since the Kindle's TTS uses a concatenation-based system, this series of articles will not cover rule-based speech synthesis to any great detail, suffice it to say that there are pros and cons to each approach, and neither has clearly 'won'.

In the first installment of this series, Tom Glynn (the TTS voice of the kindle), indicated that his diphone recording sessions consisted of reading phrases.  These phrases were selected to cover the 1,400 diphones; the recorded segments were sent through another application that analyzed Tom's speech and automatically converted the speech to diphone segments, which were stored in a database.  These speech segments are dynamically selected and concatenated to create speech, but how does the computer know which segments to concatenate?  The interpretation of text is the subject of a future article in this series.

By the mid 1980s, the majority of TTS technology that is in use today was developed; efforts have since concentrated on interpreting text, applying proper pronunciation and emphasis (a.k.a. prosody), and non-English language support.  In 1987, a corporation was founded with the vision of bringing computerized speech to the masses.  The corporation met its goal, but at a high cost to the owners.  Next week's installment examines the corporate roots of TTS; from university-funded initiatives in the 1960s, to the market leaders today (and the acquisition of over 30 companies in-between).

Friday, February 19, 2010

Kindle Text-to-Speech Dissected: Part 1 - Tom Glynn Interview

Here’s an interesting scenario: you’re listening to your child read a story to you from when they were 6 years old.  Your child is now 35, so this must be a recording, right?  But the book your child is reading was published only last year, and you are playing it for your 5-year old grandchild!  Sounds impossible?  Not if your child’s voice was recorded specifically for playback in a text-to-speech (TTS) system.  Although TTS uses a computer or someone else’s voice today, in the near future, TTS recording will enable the capture and playback of voices for everyone.  But, how does a TTS system actually work?
In this multi-part series, Kindicted will examine the history, technology, and people behind TTS, which includes everyone from childhood prodigies to internationally famous criminals.  But first, a lighter look at computerized speech, including a recent interview with the default male voice of the Kindle – Tom Glynn.
The ‘human’ computer
In a large city, computerized and computer-controlled speech systems are encountered on a daily basis; subway, transit, and GPS reservation systems, automated call attendants, cell phones, personal digital assistants, ebook readers, and so on.  For systems with a fixed number of words and phrases, envisioning the system is straightforward.  The computer simply plays back the appropriate previously recorded text based on input criteria.  For TTS systems, such as the Kindle, that use a human voice rather than a computerized (or synthesized) one, 1,400 individual snippets of English speech have to be recorded, labeled, and dynamically arranged for playback in order for the device to convert text to speech.
The man behind Kindle’s TTS voice
In the case of Amazon’s Kindle device, Nuance Technologies supplied the software and voices to convert text to speech.  You can currently choose from a male or female voice, although Nuance’s website lists dozens of voices in many languages.  In February of 2009, it was discovered that the male voice behind the default Kindle TTS is an experienced singer/songwriter and broadcaster: Tom Glynn.  A year has passed since Tom’s Kindle ‘discovery’; he has a new album out, and Amazon has sold millions of Kindles.  Kindicted recently had an opportunity to catch-up with Tom.
As an added bonus, this interview is available in mobi format here.  Simply download the mobi file, transfer it to your Kindle, and play the interview using the default male voice (Tom’s).  In some sense, Tom will be reading the interview aloud using his own voice!

The interview

Kindicted: You are an accomplished singer and songwriter; when did you realize that you had vocal and musical talent?
Tom: I realized it pretty young. My parents picked up on it and started me on piano lessons at age ten. From there, I taught myself how to play by ear and picked up the guitar around age 14. I was obsessed with playing piano or guitar every night through high school. I always loved music and had a pretty finely tuned ear for details like harmony, chord structure, and rhythm from an early age.
Kindicted: Broadcasting was a part of your career; was that to support your music, or to enhance it?
Tom: It was essentially a way to support myself, but I had a love of broadcasting from an early age. I loved performing impressions growing up, and I paid close attention to the nuances of the way people spoke. But yes, broadcasting and music definitely enhance each other. Inflection, pacing, and other elements of spoken words are certainly helped by being musical, as well as being able to remember the pitch of something I say and duplicate it many times for consistency.

Kindicted: In hindsight, do you feel that being a radio personality was critical to being able to use your voice talents for computerized speech?
Tom: Absolutely. A radio background gives you the experience you need to know how to capture people’s attention and communicate information in a compelling way. It also helps you develop a style and feel that’s your own.

Kindicted: Did you have to seek out work, or did someone hear your voice and decide that it would be perfect for a speech system?
Tom: Like most people in broadcasting, I had to work hard for a number of years to seek out opportunities. It’s a misconception some people have that having a good voice is all it takes to do voice-overs. There’s a lot more to it than that. Part of it is who you are as a person because your personality is reflected in the work you do. It also requires many, many hours of refining things such as pronunciation and inflection, along with listening to your recorded voice constantly to see if there are subtle improvements you can make to convey a better feel or connect better with the listener. I still do that everyday.

Kindicted: Is there a high degree of competition in the voice market?
Tom: Yes, voice-over work is a very competitive industry. I say that not in the sense that I feel like I’m competing with someone, but that there are perhaps a limited number of jobs that are in high demand. Ultimately, you’re competing with yourself to be the best you can be, just like any field, and if you develop a sound and style that’s your own, you’ll do well. If you find a niche, it’s great.

Kindicted: From a philosophical point of view, does it bother you that your voice is being used to utter phrases that you personally would not say or approve of?
Tom: Not really. I did some on-camera work earlier in my career, and I found that to be much more invasive and questionable. I think when someone sees your face, it’s more like a personal endorsement. That’s why you hear a lot of major movie stars doing voice-overs for TV commercials these days that they would never appear on camera for. If they were on camera, it would be as if they were personally endorsing something, but that’s not a problem if it’s just their voice -  even when people recognize their voice. I honestly don’t spend much time thinking about the way my voice is chopped up and used. I’m much more focused on getting it right when I do the actual recordings, and then I let it go. Also, I think people realize that a computerized TTS voice is just a functional tool more than a real person. 

Kindicted: If your voice kept uttering new phrases after your death (a long time from now), do you feel that you have a more modern degree of immortality than actors or musicians, whose body of work is essentially static?
Tom: Hey, you may be right. I never thought of that. Maybe my TTS voice can do my eulogy. 

Kindicted: Have you ever encountered your own voice in an interesting situation? If so, what was that like?
Tom: Oh yes, all the time. I end up having to converse with myself frequently on the phone. It’s also amusing when I’m waiting in line at CVS, and I hear myself say “One pharmacy call” on the loudspeaker. Or the time a group of us were watching a storm bulletin on TV, and it was me giving the emergency forecast as the voice of the National Weather Forecast. There are many surreal moments.

Kindicted: Do people recognize your voice as the voice of a GPS, Kindle, voice prompt, etc.?
Tom: If someone asks me what I do, and I tell them, then they recognize it. But not just out of the clear blue. Even when I’m at CVS and having a conversation with the clerk, they don’t recognize that’s also me on the loudspeaker – and I certainly don’t tell them. That’s another beauty of voice-overs…my anonymity. I’m a quiet, introverted person for the most part despite my voice being all over the place, so not being recognized is fine by me.

Kindicted: You don't own your voice in regards to the plethora of devices and systems that use it - does that bother you?
Tom: Not at all. That’s part of the gig.

Are you made aware when your voice will be used in a new device, or do you usually find out after the fact?
Tom: Usually I know because most of my daily work is not TTS. I’m usually recording actual phrases for specific clients that I’m tailoring my voice and presentation for. But with TTS, I don’t always know where my voice ends up until after the fact. I had no idea I’d end up as the voice of the Kindle when we recorded those phrases. It was a thrill for me because I had already become addicted to my first generation Kindle before the TTS one came out. I’ve been a Kindle addict for quite some time.

Kindicted: If you lost your voice, would you use a computer to speak with your own voice, or would you choose a different one?
Tom: I’d probably enjoy the silence. I talk so much for my job that I prefer to be quiet much of the time. 

Kindicted: Do you like hearing the sound of your own voice?
Tom: Well, I’ve certainly become used to it over the years between singing and speaking. When I hear my voice, I’m usually paying close attention to the details and nuances of what I’m saying. I’m usually asking myself questions like, “How might the way I said that make somebody feel? Was it friendly enough, was it too friendly, was it delivered at a nice brisk pace or was it too rushed?” That’s an example of my internal dialogue. 

Kindicted: The process of recording diphones (snippets of words) seems (on the surface) to be physically and mentally demanding - how do you prepare for the process?
Tom: Yes, the work takes a great degree of focus for long stretches at a time. I burn out after about 3 hours of continual recording because of the level of concentration and the physical demands of making my mouth pronounce everything just right.  It’s important to be incredibly consistent, so I just get myself in a good frame of mind before I record. I can’t think about anything else other than what I’m recording. It really takes full concentration, but I enjoy that. I’m someone who’d much rather work intensely for several hours than work all day at a job that has a bunch of downtime.

Kindicted: How long does the typical recording session take (in total)?
Tom: A job can take anywhere from a few minutes to all day. But generally I try to limit any one job to three or four hours to make sure the client is getting the very best product possible.

Kindicted: How closely did you have to work with the scientists and engineers to pronounce the diphones just right?
Tom: We had recorded several versions together in the past, so we were lucky enough to have a lot of trial and error with TTS going back a number of years. The way we decided to go was to just be myself as if I was speaking normally and things I was saying were not going to eventually be chopped up. I think that helped us end up with a more natural sound with this version of TTS. Certainly it’s not as natural as hearing a real voice speaking, but it has come a long way. I really hope people find it helpful.

Kindicted: Did you have to have any speech training, or work with a linguist?
Tom: No, my speech training was all on the job over the years during broadcasting jobs, and many hours listening to recordings of myself and being hyper-critical. The most important element in learning to be good at voice-overs is not how well you talk, but how well you listen to yourself and others.

Kindicted: Do you use your voice talents for audiobooks?
Tom: I have never done an audiobook. I’ve done many types of narration over the years, but never an audiobook. I do listen to them quite often though, and there are some remarkable voice talents out there who read them. I love listening to their presentations.

Kindicted: Are you in demand for other roles (TV, radio, Internet etc.) based on your voice work?
Tom: I’ve done numerous radio and TV commercials over the years, along with many projects for the Internet, training videos, cartoon characters, corporate presentations, movie trailers, and literally thousands of other projects. Now people mainly know me as the phone voice they speak to when they call Bank of America, United, Apple, CVS, and many more. And my TTS voice is the voice of Onstar’s GPS, the National Weather Service, the Phoenix Airport, and of course, the Kindle. 

Kindicted: The Kindle didn't pronounce ‘Obama’ properly - did you have to record that one?
Tom: I actually read about that on my Kindle when the story came out. No, I didn’t re-record it, so they must have fixed it somehow in the technology. I’m glad they did.

Kindicted: For TTS, are you still asked to record new words, diphones, and phrases, or is your body of work large enough that no additional pronunciation is required?
Tom: I’m sure at some point we’ll record some more phrases, but currently I think we’re all set.

Kindicted: Do you still plan to market your voice, or are you concentrating on other endeavors?
Tom: I’m always open to new projects and ideas. I’m lucky in that I have a lot of clients who rely on me at the present moment, but I’m always up for new challenges. I’m still a musician at heart, and I just released a brand new album called “Blue You’ll Do”, which is available at Amazon, iTunes and I’m really happy with the way it turned out, and the reaction so far has been fabulous. This particular album features a unique baritone acoustic guitar, which I bought last year. It has an unusual custom tuning, so it’s half guitar and half bass. I’ve never heard anything like it on a singer-songwriter record. Right now I’m concentrating on promoting that and hopefully getting it into the ears of as many people as possible.

Kindicted: People can still tell that your voice is computer-driven; how long do you feel it will be before a computer-controlled voice will be indistinguishable from a human one?
Tom: That’s a good question. As someone who speaks for a living, I believe there is a human dimension to speech that can never really be replicated by a machine completely. But who knows?

Kindicted: From a personal point of view, do you feel that the ever-increasing use of electronics and electronic communication enriches people's lives, or does it dehumanize to a degree?
Tom: I love technology. Technology allows me to reach millions of people with my music digitally, and it allows me to do my voice-over work from virtually anywhere. Like anything, it has the potential for good and bad in it depending on what it’s used for. But that’s human nature in a nutshell too. I do know what you mean about dehumanizing with all the devices, but hopefully it’s also opening up channels for people to connect in new and beneficial ways too.

Kindicted: Do you ever see a day when computers will be the norm for writing and performing music - including singing?
Tom: Wow, I hope not. I guess to some degree it already is the norm. Singers are made to sound more ‘computerized’ with the Auto-Tune effect. I hope we always value real musicians, singers, and songwriters because that’s really at the core of who we are as human beings.

Kindicted: Tom, thanks for taking the time out to answer a few questions. Best of luck with your new album.