Global

Corporate Research & Development Center

Overview

Research and Development

  • Research News
  • Research Fields
  • Awards
  • Media
  • Videos

Toshiba's Speech Recognition AI Technology is the World's First for
High Speed Keyword Detection with Individual Speaker Recognition
-Speaker recognition delivers user-specific operation of home appliances-

20 Feb, 2020
Toshiba Corporation

TOKYO─Toshiba Corporation (TOKYO: 6502, hereinafter "Toshiba") has developed the world's first AI technology that can bring fast recognition of speakers and keywords to all kinds of electronic products, without any need for internet connectivity and no need to rely on cloud resources for processing. Home appliances integrating the technology will be able to register individual speakers with only three utterances, and to adjust operation in response to voice commands.

Toshiba will present the details of this technology at ICPRAM 2020 (International Conference on Pattern Recognition Applications and Methods) to be held on 22-24 February 2020 in Malta.

Speech recognition technology promises a natural interface between people and machines, and is also attracting attention as a means to increase workplace efficiency and overcome labor shortages. A growing number of appliances can already be operated through keyword detection and recognition, pointing the way toward a global market that is expected to be 2.3 trillion yen by 2024.

Toshiba is advancing the capabilities of voice-operated devices with an innovation that delivers fast keyword and speaker detection, and that also adjusts device operation to the speaker's preferences. For example, an air-conditioner set in operation by a voice command will also adjust its temperature setting to suit the user who made the command (Figure 1).

Figure 1: The user and her preferences are recognized

This is not a simple process. Keyword detection and speaker recognition both require large numbers of calculations, which are typically executed remotely on a cloud platform or high-end devices like a smartphone. Making such capabilities a native feature of home appliances and other devices requires high-speed AI technology that can be embedded in the devices themselves.

Toshiba has developed just that. Its new AI can simultaneously and quickly execute keyword detection and speaker recognition, all without any need for network connectivity or remote processing power. The technology has two core features.

The first feature is the use of the intermediate outputs during the keyword detection for effective speaker registration and recognition. The AI must first detect keywords by separating ambient noise from audio information. Its neural network does this by processing spoken input while absorbing the effects of ambient noise. Speaker registration and recognition are performed using the intermediate outputs of the neural network (Figure 2), an approach that suppresses the effects of ambient noise on speaker recognition, and also greatly reduces the time required recognize the speaker. It secures high-speed operations with constrained resources.

Figure 2: Application of information in keyword detection

The second feature is the use of data expansion methodology in the neural network. Data expansion is a method for learning from small amounts of data, in this case spoken utterances. By randomly assigning zero weight to connections between neural network nodes, simulated voice information can be generated, as if a speaker had spoken in various ways (Figure 3). Successful identification of individuals is based on the AI learning from their speech samples, and this method recognizes particular speakers even when only a small number of utterances are available. Toshiba has reduced the number of required speech samples to a point where the new AI technology can complete user registration with only three utterances.

Figure 3: Application of data expansion methodology in a neural network

Comparative evaluation based on three utterances per registered speaker found that Toshiba's method achieved an identification accuracy of 89% for 100 people, while accuracy of i-vector, a commonly used method for speaker recognition, remains at 71%. As devices such as home appliances are expected to have five to 10 registered speakers at most, this level of performance is considered sufficient for practical application. Furthermore, the amounts of computation and processing speed were measured on a server and confirmed that neither would be problematic, even in an embedded system.

As its next step, Toshiba will work toward incorporating the technology in embedded systems and investigating its utility in home appliances and other use cases. The company is also reviewing the opportunity to develop new services, such as application in the communication AI "RECAIUS™" developed by Toshiba Digital Solutions Corporation.

  About Toshiba's RECAIUS™ communication AI

RECAIUS™ is a service that supports the making of human activity-related decisions that are appropriate to circumstances. It does so by efficiently gathering knowledge from sites of human activity. It invigorates communication between humans and systems by integrating natural language processing technology and knowledge processing technology. RECAIUS™ aims to realize a mechanism that will allow everyone to execute their operations in comfort, achieve an efficient working style and enjoy a pleasant lifestyle.

For more information, see the RECAIUS™ website: https://www.toshiba-sol.co.jp/pro/recaius/ (in Japanese only)

  • RECAIUS is a registered trademark of Toshiba Digital Solutions Corporation in Japan and other countries.
  • Other company and product names used here may be trademarks or registered trademarks of their respective company.