Friday, March 12, 2010

Kindle Text-to-Speech Dissected: Part 4 - Jo Lernout

In the last TTS installment, the lineage of the Kindle's TTS was traced back to 1999 (the creation of RealSpeak), and then 1983 (and earlier) for the roots of Nuance - the company that owns RealSpeak.  RealSpeak was originally developed by Lernout & Hauspie, a Belgian-based corporation founded in1987 with the intention of incorporating computer speech in devices and, later, the Internet.  After rapid expansion in the late 1990s, Lernout & Hauspie became caught up in a web of accounting that eventually toppled the company. Accounting inconsistencies aside, L & H was a company set on changing how people interacted with computers and technology. This is evidenced by the substantial growth and adoption of L & H technologies by ScanSoft (and later Nuance) after they purchased the speech division of L & H.

It has been almost 10 years since the fateful breakup of L & H, and Kindicted wanted to know what happened to the principals - Jo Lernout and Pol Hauspie. Jo Lernout was kind enough to agree to an interview with Kindicted. The topics range from early influences, to Jo's vision of the future of computer interaction.

Jo Lernout has been described as a compulsive entrepreneur; someone who is forever thinking about new inventions and companies. This was certainly true in 1987 when Mr. Lernout and a self-made businessman, Pol Hauspie founded a company that would eventually become a juggernaut for speech technologies - the aptly named Lernout & Hauspie (or L&H). Regrettably, L&H’s fire burned bright, and it became a victim of its own runaway success. The ashes of L&H did survive, however, incorporated into many speech products for computers, phones, GPSes, and even the Kindle.

And now, the interview.

You’ve had a brilliant and varied career from teacher to director, inventor, and consultant. 
What influences did you have early on in life that led you to choose a career in science?

As a child, I was interested in science in general. Growing up, I was maybe 7 or 8 years old or so, I read chemistry handbooks and experimented with growing crystals, etc. – the typical hobby kits and toy-chemistry labs. I read and collected books about birds and animals, and every day, studied the local biotope where I grew up (a little village in Flanders, Belgium). So, to me, it was a logical career path to later become a biology and chemistry teacher.

After a few years of teaching, a former teacher-colleague told me that business life is more challenging and interesting, along with the potential to earn a lot more than as a teacher. So, with his advice, I became a medical representative (at Merck Sharp and Dohme); a good combination of sales and (medical) science. I loved that job, because Merck gave numerous courses, ranging from medical topics, to pharmacology, and sales techniques. I did that for 5 years. I enjoyed this quite a lot, since selling seemed to be a natural fit for me. That was from 1972 to 1977.

Communications technology was then still limited to landline phones, telex, and regular mail. Fax and Xerox machines were just released, and computers were considered to be the domain of a limited number of mainframes and some of the first mini-computers, all contained in secret and sacred cooled rooms, where only a special breed of programmers were allowed.

Who would you say had the greatest influence on your advancement as either a teacher, a businessman, or an inventor?

One of my younger brothers (there were 14 of us!) who was blessed with a very high IQ and a natural ability in math, graduated as a civil engineer in electronics. He was an excellent computer programmer. He told me computers were the future, and programming was not that hard to learn. He convinced me to apply as a sales engineer at what was then Honeywell Bull. They hired me, and I followed a 3-month COBOL programming course, and then started selling mini-computers, mainly for accounting and payroll applications. Again, selling computers was kind of easy for me. In two years I sold many computers to medium and small businesses in Flanders. I still am grateful to my brother for having convinced me to step over to the computer business.

How did you initially become interested in speech technology? 

In 1979, the then CFO of the local subsidiary at Wang Belgium convinced me to become sales manager at Wang Belgium, and also to enroll for a post-graduate MBA evening course at the University of Ghent (the famous Vlerick Institute).

I am still grateful to that person, since I received formal business management training and was also exposed to some early speech recognition and TTS applications that Wang was experimenting with. At that time, Wang was the absolute world leader in word processing systems. We started using internal email as early as 1982. Wang was also experimenting in those years with a system called "Alliance", which was a keyword-based text search engine, that blew my socks off. Essentially, that was "Google" 15 years before Google was founded! Wang was also delivering word processors to companies that were building automatic machine translation technologies. One of them was a division of Siemens, linked to the University of Leuven in Belgium. That division was later sold to a German company: GMS, and L & H bought GMS in 1996.

From then on, the combination of "speech recognition, text-to-speech, search engines, email and automatic translation" technologies became nearly an obsession for me. And today it still is an obsession.

To me, it was clear that this combination of technologies was bound to become reality in various applications at some point in the future. Wang was clearly on that path. I saw my role as lobbying inside Wang Labs headquarters to ensure that other languages besides English would be developed.

What motivated you to start L&H?

In 1987 Wang was still doing OK, but the first signs of decline were then visible. Since 1984, Wang had a lot of competition from word processing as an application on PCs. Wang's dominance was over, and Wang entered the PC stage too late.  Wang should have licensed its word processing software to Microsoft or IBM – but that is another story.

I was thrilled by the future of the speech and language technologies, and reading about it taught me that most large companies conducting research in those domains were concentrating on English or Japanese only. At Wang, there wasn't much drive to develop these technologies in other languages. I contacted several local Belgian university labs, and learned that some researchers there were working on Dutch and French versions (they became later our main R&D team at L & H).

Around that time, I became acquainted with a self-made business man, Pol Hauspie, who lived in Flanders, and who had taught himself how to program computers. Pol had developed his own accounting applications, and built a 40-employee software company, which he had recently sold to a larger software company in Belgium.

Pol Hauspie convinced me that the best way to realize the dream of combined speech and language technologies, in one set and in many languages, was to try to do it by ourselves. Together, we decided to start L & H. We put some of our own money in (yes, I sold my house and said goodbye to the healthy salary enjoyed at Wang as a sales and marketing manager) and some money from friends and family. In 1987, we started L & H with about $400,000 USD.

The rest is, as they say, history.

Millions of devices now use technologies from L&H.  When you founded L&H, did you have any idea that the field of computer speech would become so popular?

I dare to say that we indeed had this vision early on. We knew it was a matter of time for the technologies to become more mature, stable and really useful.  We also knew it was only a matter of time before the necessary hardware was available, and small and cheap enough to deploy such technologies on PCs and even mobile devices. Our vision statement in 1990 was audacious enough to state just that.

Our vision was also that, if we were to develop for many languages, we could license these as components to large companies all over the world. We compared our strategy to "Dolby".

We had early licensing talks with AT&T (who invested, in 1993, $5M USD), Apple, Analog Devices, V-Tech (for devices similar to the Kindle), and even early talks with Microsoft and Samsung (1996).

Later on, Microsoft endorsed our multi-lingual approach and vision by investing $60M USD in 1997, and by signing a co-development and co-marketing deal.

Still later on, in late 1999, we made licensing deals, or were in active talks with Delco, Ford, Samsung, LG, Daimler-Chrysler, TeleAtlas, and many others.

In 2000, we even had serious talks with David Wetherell, CMGI, and (then) owner of Alta Vista and Lycos. If that merger would have materialized, we would have been "Google" today. Together, we saw the future of mobile devices for mobile internet access, where the combination of speech and language technologies (available in the language of the user), would help mobile users find information in the enormous haystack of multilingual online information. Semantic search engines (such as the ones we acquired from Novell and others in the late nineties), rather than pure keyword matches, would have helped mobile users find more precise answers to their queries. The project we had in mind was dubbed SofIA : Society of Intelligent Assistants.  Early wireless prototypes were shown in the spring of 2000. If you analyze Eric Schmidt's current vision, he basically says the same as what we stated in 2000, with regard to what is important to Google and the world: billions of users will dig their information from smart internet phones, and speech and language technologies will have a very important role in this.

Do you have any satisfaction in knowing that technologies that you helped to develop are now assisting millions of people?

I certainly do, the success of Nuance and others (Apple, Google Nexus) currently applying Speech and Language Technologies in numerous devices and applications, re-enforces the vision we had at L & H.

In the late 90's and early 2000, we at L & H saw these technologies becoming massively available around 2005.  That turned out to be a few years "too optimistic".  It is interesting to note that Nuance’s revenue from sales was already over $500M USD in 2005/2006.

Between 1998 and 2005, the use of Nuance and other speech and language technologies was already massive – albeit it more in vertical domains.  Deployment was less widespread than on internet smart-phones, and included devices such as:

  • Dictation applications (especially Dictaphone) for doctors, lawyers, etc.
  • User Interfaces for the visually impaired; this particular segment gives me a lot of satisfaction, as the technologies are of much help to them.
  • In-car dashboard applications, such as voice controlled GPS and in-car mobile phones
  • A wide variety of educational toys and language study applications.

Do you regularly use any of the technologies that you had a hand in developing or popularizing?

I often use Google Translation. It doesn't give a perfect translation, but it is good enough to pull a Chinese or Russian website into English, and I get the gist of it. That is not former L & H technology (Google's MT is based on a statistical engine, and not on rule-based MT), but it demonstrates that there is a multilingual haystack of information on the net, and MT helps to find what you need, in any major world language. That part of language technology was definitely part of L & H's vision and portfolio.

I don't use speech recognition or TTS. I'll become an avid user when it really functions well on a smart-phone.

If I were a radiologist and I had to dictate daily standard radiology reports, I would certainly use Dragon Naturally Speaking or specialized versions of Dictaphone (L & H acquired Dictaphone in 2000) . It is more productive than dictating first in an old-technology taped based Dictaphone device, then passing it on to a medical secretary for transcription. It is much cheaper to have good speech recognition do the transcription job in a nearly fully automatic way.

PC-based Dragon Naturally Speaking (especially their latest versions) is absolutely stunningly good in terms of accuracy, etc. However, even with such high accuracy, it slows me down when I am at my PC; I have to dictate and observe the displayed text at the same time, and this simply confuses me. This is where we were wrong at L & H in terms of assessment of the real market for PC based dictation engines. In addition, it is kind of weird sitting at your desk and dictating to a PC, whether you are sitting alone in the office, and even more so when others are around.

So, I resort to the “good old keyboard”.

As for the speech recognition on my Nokia smart phone, or the one in my TomTom GPS, well, I stopped using it. Not good enough – too many false recognitions, and it doesn't recognize continuous speech for asking Google or other search engines a query, or to dictate my SMS or emails on it. So, today, it is just faster to tap and touch and type on the keyboard.

What one speech technology do you personally feel has a strong future?

Without any doubt, the so called "vertical" applications I mentioned before will continue to grow, and the speech and language technologies will continue to improve.

The real breakthrough and unparalleled massive usage will arrive (give and take within a couple years) when:

1.       Speech recognition works really well on smart phones.

  • This includes continuous speech dictation for dictating SMS, emails, tweets, etc., and for submitting search queries – especially when one is on the move, or when one is driving a car. At that point, well-functioning speech recognition and voice-driven user interfaces will be the preferred way of interacting with smart-phones.

2.       Intelligent search becomes a reality

  • But there is more. In addition to voice-based interaction, very intelligent search technologies will be the key to deliver mobile internet information. Google keyword matches return too many pages, and when one is mobile, one doesn't want to receive millions of hits; one needs a small, concise, and semantically precise answer.  The user must be "understood" not only in terms of "what words were spoken by the user?", but more in the sense of "what is actually meant by the user?"

As an example, the question "is there a Godiva shop in Chicago?" is, semantically, slightly different from the question, "Where can I find a Godiva shop in Chicago?" and again slightly different from the question of a user (being in Chicago, using a smart phone with GPS) asking, "Show me the nearest Godiva shop around here." A user doesn't want to receive the same zillion pages back from Google, with many hits about "Godiva", "Chicago", etc. – no matter what the user actually meant.

Speech recognition and retrieval of mobile information must function really well; to the point where the user finds it much more useful than tapping, touching and typing (on a tiny keyboard) and much faster than scrolling on the small display of a mobile smart phone through a vast amount of returned pages.

I used to proclaim in the late nineties, "There will be either a great market for toothpicks – to tap endlessly on these tiny keyboards on the billions of mobile internet phones – or a great market for speech recognition on these devices!"

Before that "real usefulness" is achieved, speech recognition will have to be "enriched" with embedded semantic understanding, thus not just based on good acoustical and statistical language models. In addition, sophisticated natural language processing will be needed, in order to capture the full meaning of what is queried, and in order to convert that to SQL or text mining features. The goal is to return only really relevant information to the mobile user.

Then and only then will we see massive usage on mobile phones. That is exactly the domain where companies such as MIIAtech is working in now: enhancing search engines and speech recognition by means of sophisticated NLP (Natural Language Processing).

Your company developed Text-to-Speech, dictation, and other technologies.  For Text-to-Speech, do you feel that recorded speech (using diphones, etc.) is superior to pure speech synthesis, or vice-versa?

It really depends on the application. Diphone-based TTS is usually a better choice for educational purposes (language learning, for instance), or talking toys and talking avatars: imagine an avatar based on a real life famous person, with his or her voice as TTS.

Blind users, on the other hand, prefer to stick to real synthesized voices as they easier for them to understand, and are easier to manipulate pitch and speed. An in-car GPS device may also benefit more from synthesized TTS, as it differentiates better for routing messages and reading street names.

Do you own an Amazon Kindle, or any other device that uses technologies developed at L&H?

I don't, but after reading your blog, I'll hurry up to get one for myself and the kids around here!

I have heard TTS on the Kindle; it is impressive, and is based on Realspeak, but with substantial enhancements brought to it since it was acquired by Scansoft at the end of 2001.

One still notices it is a computer talking, but I am sure this TTS will further evolve. Again, also in this area, adding NLP to the algorithms will help to make TTS really sound like a human being. Semantic clues derived from the context the TTS is reading aloud will be a helping factor to generate the correct prosody.

The technology you developed at L&H can essentially recreate a voice from anyone with a large enough sample of speech - including deceased individuals.  Have you personally had your voice recorded and converted to a TTS voice? 

This is really exciting! At L & H, we were, in the late nineties, using this approach to develop different voices, but we had the idea of sampling anyone's voice only in planning stages back then. Nuance did a great job in bringing this to the next stage (realized by former L & H TTS engineers now at Nuance).

I wish I had my voice recorded back then. Maybe one day!

What is the period in your career that you recall the fondest?

The day Microsoft invested in L & H, and a few years thereafter, when Bill Gates said in a keynote speech at the Etre conference in Southern France to a large audience, filled with executives from the biggest IT and Telecom companies from around the world, "L & H RealSpeak (TTS) is the only voice I want to listen to for longer than a minute for having a device read my emails to me."

Where do you see the computer speech field in 20 years?

By then, we will have passed the point where computers (including the ones we hold in our hands and the ones woven into our clothes) really understand what their human users want: spoken text and commands, multilingual search queries, and instant and perfect translation from one language to any other major world language. Reproduction of spoken translated text will be with your own voiced TTS.

The user interface will include other "commands" we mean and give, such as pointing, or facial expressions, as indicated in MIT's Pattie Maes’ "6th sense" speeches. Or already shown in the latest Google apps; let the camera of the smart phone look at something (a menu in German for instance) and tell what that is and translate into English.

Users will be "deeply understood" by their computers, meaning that the computers will also reason about what is asked. We see such examples in Wolframs' Alpha applications. Computers will solve real world "puzzles", to the point that humans can ask what best to do in a particular situation. Think of it as having your professor, lawyer, business consultant, interpreter, coach, friend and playmates with you and on you all the time, ready to give very precise answers, advise, solutions, and ready to play along.

The artificial intelligence will look eerily close to real human intelligence, but that doesn't mean computers will be sentient "beings"; they will just act like sentient beings.

People can still tell with a high degree of probability if a speaker is controlled by a computer.  What advancements in the field are required in order to achieve computer speech that is indistinguishable from a human?

Probably a combination of:

  • More and better acoustic modeling and embedded NLP inside speech recognition algorithms,
  • Avatars that speak with excellent TTS (this is just a matter of time and more and cheaper hardware),
  • Avatars that "look" at you when you are speaking (benefit is that the camera can also pass lip-reading clues to the speech recognition algorithm),
  • Sophisticated NLP to make sure that the user is really "understood".

Al of that will enhance the "near human" experience of the user interface.

Do you have any regrets in regards to your career as a teacher, businessman, or inventor? 

No regrets in terms of having at least tried very hard, and having achieved some part of L&H’s early vision. Others completed and are completing the picture. MIIAtech may well become an important supplier of the world's best NLP. I am grateful to a number of Belgian investors who contributed to the first start-up phase of MIIAtech.

Looking back, I think we would not have lost L&H if we would have been more transparent in the way we conducted business. I am still convinced that the accusations (of planned fraudulent revenue bookings) are wrong. We designed a system of franchises, in which investors could invest to pay fees to co-develop language versions with L&H. We only booked these paid-up franchise fees, not the later to be expected shared revenues from licensing income. I still don't see what was wrong with that. Our lawyers and auditors even advised us it was a good and legal revenue recognition system. Even today, there is still a lot of controversy about that issue.

But I learned this lesson: when an entrepreneur has good intentions, even if his lawyers and auditors approve the accounts, the entrepreneur still has to be very transparent to the entire world when their company is listed on a public stock market. If the company works on "sensitive" technologies, then this company must also make sure to engage in full transparency to all concerned and possibly affected parties.

To be sure: I haven't lost my passion for these technologies, and at the age of 61, I haven't lost my entrepreneurial drive. But I guess it is better to let younger and sharper folks turn vision and technology into shareholder value.

Saturday, March 6, 2010

Kindle Text-to-Speech Dissected: Part 3 - Corporate History

As mentioned in previous articles, TTS has followed a long and winding road, with as many as 50 companies vying for the ultimate prize: a machine that can speak as well as a human.  Over the past 5 years or so, the computer speech industry has consolidated to 4 major companies, which has given an opportunity for a new round of speech-related startups to take a shot at the prize. So far, no TTS technology (without pre-set phrases) has been able to fool a human, but advances in technology have a funny way of 'popping' up all of a sudden.

There have been many colorful figures in TTS’ history, and this series of articles will take a closer look at a few starting with an in-depth interview with one of the key TTS figures in the last 25 years.  But first, a timeline of sorts is required to establish a temporal context on which the rest of the TTS historical articles can be based.  With that, the following is a list of companies, name changes, and acquisitions that have led to the TTS technology found in the Kindle today.

Bell Labs' VODER is displayed at the 1939 World's Fair
Stanford Research Institute founded
IBM funds speech research
G. Peterson, W. Wang, and E. Sivertsen produce speech using diphones
IBM Text-to-Speech team formed, including Dr. Michael H. O'Malley
John L. Kelly at Bell Labs uses an IBM 704 to 'sing'
Xerox PARC research facility opened
Cecil H. Coker at Bell Labs converts printed text into speech
Kurzweil Computer Products, Inc. is founded by Dr. Ray Kurtzweil to develop character recognition software for any font.
Berkeley Speech Technologies (Text to Speech, Speech Recognition) founded by Dr. Michael H. O'Malley
Xerox purchases Kurzweil Computer Products and runs it as Xerox Imaging Systems (1990-1999), and later as ScanSoft (1999+)
Dragon Systems founded by husband and wife team Dr. James and Janet Baker
Speech Technology and Research (STAR) Laboratory founded as a spinoff of the Stanford Research Institute (SRI)
Eloquent Technology founded in Ithaca, NY by Dr.Susan Hertz
Lernout & Hauspie founded in Belgium by Jo Lernout and Pol Hauspie
Visioneer (Scanner hardware and software) founded by Dr. Denis R. Coleman
Nuance Founded as a spinoff of SRI's STAR lab (originally called Corona)
ALTech founded by Mike Phillips
Phonetic Systems founded
Lernout & Hauspie acquires Berkeley Speech Technologies
Lernout & Hauspie acquire an additional 16 speech-related companies
Lernout & Hauspie acquires Kurzweil Applied Intelligence
AT&T Launches their Next-Generation TTS, later renamed AT&T Natural Voices
ALTech renamed to SpeechWorks
Visioneer purchases ScanSoft from Xerox and adopts ScanSoft as a company-wide name
Lernout & Hauspie develops RealSpeak; the TTS system that would eventually make its way into the Kindle
Lernout & Hauspie acquires Dragon Systems
SpeechWorks Inc. acquires Eloquent Technologies
Rhetorical Systems Inc. founded in Edinborough, Scotland
ScanSoft acquires Lernout & Hauspie's Speech and Language division
ScanSoft acquires Philips Speech Processing division
ScanSoft acquires SpeechWorks Inc.
ScanSoft acquires Rhetorical Systems Ltd.
ScanSoft acquires Phonetic Systems Ltd.
ScanSoft merges with Nuance and changes company-wide name to Nuance
Nuance acquires an additional 20 speech-related companies
Amazon selects Nuance technologies' RealSpeak to provide TTS in Kindles
Amazon releases the Kindle 2 and DX with TTS

A [rough] graphical version of the Timeline is available here.

Look for the next Kindicted article in the TTS series: an interview with...someone named on the above list!  Until then, happy reading!