Global

Corporate Research & Development Center

Overview

Research and Development

  • Research News
  • Research Fields
  • Awards
  • Media
  • Videos

Development of Voice Design Technology for Easy and Intuitive Creation of a Wide Variety of Text-to-Speech Voices -For Quick Creation of Low-Cost Speech Content-

2016/3

Overview

Toshiba has developed a voice design technology that enables easy and rapid creation of various voices in a text-to-speech (TTS) system. This new technology allows users to create a wide variety of TTS voices by controlling elements of voice characteristics such as the age, gender, and brightness of the voice. With a graphical user interface (GUI) incorporating this technology, users can not only create tens of thousands of different voices, but also efficiently and intuitively design particular types of voices just as they imagined.
With this technology, users can quickly and inexpensively produce speech content that fits target use cases. The details of this technology will be announced at the 2016 Spring Meeting of the Acoustical Society of Japan, which will be held at Toin University in Yokohama on March 9-11, 2016.

Background

TTS is starting to be used to produce speech content for a wide variety of platforms, such as car navigation systems, audiobooks, educational materials, and video games. The future will see an increasing trend toward the Internet of Things, as more and more devices get connected to the Internet. The use of TTS is expected to increase along with this trend, becoming incorporated into applications such as voice advertisements, video content production, communication robots, and online education services. To produce this diverse speech content effectively, users need an easy method to obtain appropriate types of voices that match the types of content they are creating.
To date, the selection of voices for TTS has been limited because users had to choose from pre-existing voice samples. On the other hand, when there were many samples, it was difficult to find a voice with the desired features.

Features

With the voice design technology developed by Toshiba, users can design a wide variety of TTS voices easily by controlling voice characteristics, instead of just selecting a single voice from prepared voice samples.
This new technology includes a statistical model called the "perceptual element space model," in which differences in voice characteristics among speakers are decomposed into a number of perceptual elements,including age, gender, brightness, clearness, and tightness by means of a proprietary model optimization method, and the weighted elements are summed. The voice characteristics can then be freely controlled by changing the weight assigned to individual elements to produce any kind of voice.

Perceptual element space model

In addition, on the basis of this model, we have developed a GUI that allows the easy and intuitive creation of TTS voices. We selected a small set of perceptual elements that can be controlled in the GUI based on a statistical analysis of a large-scale subjective evaluation by many participants.
Furthermore, we incorporated into the GUI an "impression-to-perceptual transformation model," which determines the coordinates of the "perceptual element space" corresponding to a given impression, such as cute, intelligent, or polite. This model lets users easily produce a desired type of voice. With the GUI, users can select a voice via facial images or via voice impressions such as cute, intelligent, and polite. In addition, users can easily and intuitively adjust the perceptual axis to modify the age, gender, brightness, and other features of the TTS voice in order to effectively match the voice that they envision.

* Click Play button to start movie. YouTube is the service provided from other company, and please follow the terms of use in YouTube.

Outlook

Toshiba will continue with our research and development of this technology, so as to make it available through our RECAIUS™ cloud service during the 2016 fiscal year. The RECAIUS™ cloud service understands people's situations and intentions from speech and video data and conveys this understanding to people simply, so as to assist in a wide variety of activities for a great number of people.

*RECAIUS is a trademark of Toshiba Corporation.