Friday, February 26, 2010

Kindle Text-to-Speech Dissected: Part 2 - TTS History

Many Kindicts love the text-to-speech (TTS) feature of the Kindle2 and DX, which reads books aloud in a user-selectable voice.  The ability for TTS to read a book's text is dependent on whether or not the book is TTS 'enabled', which is an interesting subject on its own, but not the focus of this article - the second in a series on TTS technologies.  Below, Kindicted presents a very brief history of the 'speech' portion of TTS.

History is littered with 'odd' individuals who were fixated on the technology du jour - distant evolutionary cousins of today's uber-geeks.  In that vein, every few hundred years or so, a historical figure became obsessed with building a machine or apparatus that could mimic the human voice.  The motive behind such early obsessions was not entirely clear, suffice it to say that since the primarily means of human communication is speech, a talking machine would net a profit if put to some practical use.  Regardless of the motivation, these individuals made strides in analysis of human speech, and occasionally the yardsticks of knowledge were moved forward a bit.

Tube-Tied
Early machines consisted of user-modifiable tubes and bellows to produce vowel sounds.  The subsequent addition of a mechanical tongue and lips enabled consonant sounds to be produced (along with a likely side effect: lonely inventors more adept at kissing).  The advent of the telephone raised new interest in the study of human speech, and by the 1930s, Homer Dudley, an engineer at Bell Labs, developed an electromechanical (i.e. non-digital) speech synthesizer dubbed VODER (for Voice Operating DEmonstratoR) based on research by fellow Bell scientists, led by Harvey Fletcher.  The techniques used for speech synthesis in VODER are still used in today's synthesis hardware - albeit with many refinements.  Note that this voice synthesis is separate from voice encoding (VOCODER), which was originally invented as a means of coding speech for transmission through phone lines, but subsequently used by the musicians as an interesting vocal effect.

Hooked on Phonemes
Post-1950, speech research focused on the phonetic elements of speech.  A phoneme is the smallest unit of sound that can be separately distinguished between sequential utterances (e.g. the 't' sound in 'sat' or 'test').  The production of phonemes during speech produces energy (in the form of sound pressure waves) that can be recorded and analyzed.  If a general model of each phoneme produced through human speech is recorded, an electronic representation of a language can be recorded and labeled; the English language contains 37 to 47 phonetic elements.  Playback of the recorded phonemes in the right sequence, and at the proper speed, produces crude synthesized speech.  Early systems that produced speech in this manner were barely intelligible.  Humans are very sensitive to even minute variations in speech, which makes clear speech synthesis quite difficult.

The human auditory system is the original sound transcoder, transforming sound pressure waves to electrical signals to be processed by the brain.  No less than 5 distinct areas of the brain stem and brain participate in the detection and recognition of sound.  People are particularly sensitive to their own language, and can even detect (at better than chance levels) whether or not an unseen speaker is smiling while they speak.  This extraordinary sensitivity is one of the reasons why people are so adept at detecting when a computer is speaking or controlling speech.  So far, no one has been able to develop a non-scripted speech system that is indistinguishable from a human speaker - but that does not mean that companies haven't tried.

Please answer the Diphone
In order to raise the quality of computerized speech, researchers moved away from phonemes and towards diphones.  Simply put, a diphone is the sound produced in the middle of two phonemes; from a point halfway into the first phoneme to a point halfway into the second.  Diphones (and sometimes half-syllables and triphones) are important in producing a natural-sounding synthesized voice, since the sound of a phoneme is modified slightly by the sound of the next phoneme. There are 1,400 diphones in the English language, which corresponds to 'allowable' combinations of phonemes.  Strictly speaking, the number of diphones should be the square of the number of phonemes, but many phoneme combinations never appear in spoken language.  By using the diphone approach, researchers were able to greatly increase the intelligibility of a synthesized voice, since there were many more realistic phoneme combinations to choose from when constructing the sound.

Live versus Lip Synch
In the 1980s, a battle royale of sorts was unfolding among speech researchers.  In one corner were the fundamentalists; researchers who believed that the purest and most flexible sound came from rule-based speech synthesis.  This included programs that modeled airflow, tongue position, lips, etc. - sort of like a digital representation of the early tube-based apparatus, coupled with an extensive database of rules describing how different phonemes are paired.  In the other corner were the concatenation-based zealots.  This group believed that the key to realistic speech was to build words out of a pre-recorded database of diphones (from a human speaker).  Since the Kindle's TTS uses a concatenation-based system, this series of articles will not cover rule-based speech synthesis to any great detail, suffice it to say that there are pros and cons to each approach, and neither has clearly 'won'.

In the first installment of this series, Tom Glynn (the TTS voice of the kindle), indicated that his diphone recording sessions consisted of reading phrases.  These phrases were selected to cover the 1,400 diphones; the recorded segments were sent through another application that analyzed Tom's speech and automatically converted the speech to diphone segments, which were stored in a database.  These speech segments are dynamically selected and concatenated to create speech, but how does the computer know which segments to concatenate?  The interpretation of text is the subject of a future article in this series.

By the mid 1980s, the majority of TTS technology that is in use today was developed; efforts have since concentrated on interpreting text, applying proper pronunciation and emphasis (a.k.a. prosody), and non-English language support.  In 1987, a corporation was founded with the vision of bringing computerized speech to the masses.  The corporation met its goal, but at a high cost to the owners.  Next week's installment examines the corporate roots of TTS; from university-funded initiatives in the 1960s, to the market leaders today (and the acquisition of over 30 companies in-between).

12 comments:

  1. Regardless of the motivation, these individuals made strides in analysis of human speech, and occasionally the yardsticks of knowledge were moved forward a bit.speech recognition software

    ReplyDelete