Corporate Research & Development Center


Research and Development

  • Research News
  • Research Fields
  • Awards
  • Media
  • Videos

Toshiba's Latest Advance in Voice Recognition can Distinguish Multiple Individual Speakers, Without Special Training


Toshiba has taken a major step forward in speech recognition with the development of a technology able to precisely distinguish between and capture the utterances of individual speakers in real time, even when multiple voices are speaking at once.

Recent improvements in speech recognition technologies have pointed the way towards improving the efficiency of making automated transcripts of business meetings or keeping records of conversations with clients. There is one hurdle to overcome, however─recognition reliability falls off when multiple people speak simultaneously. While technologies have been developed for separating simultaneous speech, the acoustic characteristics of locations where conversations take place and recording environment factors, such as the positioning of the speakers, have required provision of dozens of minutes of recordings to realize training for optimal separation.

Toshiba's technology realizes precise, real-time identification of speakers and separate voice capture, even when many voices are trying to be heard(Note 1), and delivers high-precision recognition and transcription of each speaker, using a microphone array embedded in a single sound input device.

High-precision transcription alleviates the need for manually keeping minutes in business meetings, and allows an increased focus on analysis of customer opinions and the improvement of staff manuals. Transcriptions of meetings with customers from overseas can also be used for automatic translation systems.

A problem with previous sound-source separation systems is that they require dozens of minutes of pre-recorded speech for system training in order create a sufficiently precise separation filter for each speech source (person). Toshiba's novel method(Note 2) replaces this time-consuming direct learning for filter creation with learning of the spatial characteristics representing speaker position information from the positioning of the microphones. This achieves high-performance separation supported by continuous filter updates according to the environment, and approximately double the separation precision of previous techniques.(Note 3)

In operation, the new system rapidly determines the relative positioning of speakers through matching an association table for sound direction to the time difference at which sound arrives from speakers attached to each microphone. This technique allows the capture and separation of each individual's voice, even when there are simultaneous utterances and without any previous recordings at the location.

Voice separation and capture technology

Toshiba continue to research the technology, with the aim of its 2017 inclusion in RECAIUS™, a Toshiba-developed cloud-based service that supports various human activities for understanding the intent and the conditions of others in audio and visual recordings.

* Part of the presented achievement is based on collaboration with Associate Professor Dr. Nobutaka Ono of National Institute of Informatics (Japan) from Apr. 2011 to Mar. 2015. (added in Nov. 4, 2016)

(Note 1)
Voice separation is possible for the number of speakers up to and equaling the number of microphones (e.g. three microphones allow separation of up to three speakers' voices).
(Note 2)
The fundamental algorithm of the method was developed in collaboration with Associate Professor Dr. Nobutaka Ono of the National Institute of Informatics (Japan) from Apr. 2011 to Mar. 2015.(added in Nov. 4, 2016)
(Note 3)
When separating simultaneous speech by two speakers, the amount of suppression of the of the second person's speech was improved from 3 to 9 dB, an approximate doubling.

  About RECAIUS™

RECAIUS™ is a cloud-based AI service that supports various human activities for understanding the intent and conditions of others in audio and visual recordings. This service combines and systematizes various technologies for media knowledge processing (media intelligence) that Toshiba has developed over many years, including speech recognition, speech synthesis, translation, interpreting, intent understanding, and image recognition of faces and individuals. RECAIUS™ contributes to the creation of new lifestyles and business opportunities.