Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


09 February 2010



DSP-based system-on-chip moves speech recognition from the lab to portable devices

By Zhizuo Yang and Jia Liu, Tsinghua University, And Eric Chan, Lim Cheow Guan and ChenKim Ching Infineon Technologies AG

Jan 30, 2007
Print This Story Send As Email Reprints
 
Rate this article
WORSE | BETTER
1 2 3 4 5
With the development and maturity of speech compression/ decompression and recognition, speech processing is becoming an important form of man-machine interface. Currently, these systems in laboratories are mostly large and complex. These implementations are usually based on computer platforms.

The primary focus of these speech recognition systems is large vocabulary continuous speech. Their speech encoding/ decoding algorithms are so complex that they have to depend on a PC's computing power and ability. Hardware requirements for such subsystems are highly restrictive in meeting the requirements of portable, low-power and low-cost embedded system.

To date, the Tsinghua/Infineon Uni-Speech SoC has shown good performance in accurate speech recognition and low-rate high-quality speech encoding/decoding. However, the economics and power consumption of the current solution limit its potential in portable speech-recognition applications.

This opens up opportunities for Uni-Lite, an SoC that addresses cost and power consumption. The Uni-Lite chip architecture comprises a 16bit DSP core, on-chip ROM and RAM, an embedded Delta Sigma () ADC/DAC with their respective I/O analog channels, and other common interfaces. Uni-Lite is an SoC embedded with speech-processing firmware that can be developed into an application without the need for secondary on-board devices.

Hardware architecture
The Uni-Lite contains not only a DSP core and a codec, but also I/O analog channels that serve as microphone and speakers. On-chip RAM and ROM, communication to external devices are handled through on-chip UART, GPIO lines as well as SPI. With the exception of power supply, microphones and speakers, all the system hardware components are integrated into a single chip (Figure 1, below).

Figure 1: The Uni-Lite contains not only a DSP core and codec, but also I/O analog channels that serve as microphone and speakers. On-chip RAM and ROM are handled through on-chip UART, GPIO lines and SPI.

The OAK DSP Core is a 16bit (data and program buses) high performance fixed-point DSP core. All the I/O units are connected to it directly through intermediary digital logics.

The DSP core is designed for operation of up to 104MHz. However, due to the paradigm of design for portability, the operating speed of the DSP can be programmed to various rates from complete hibernation, 1-104MHz. In terms of computational capability, the DSP core is able to provide a throughput of 1 Drystone MIPs per MHz of operation.

On-chip memory consists of RAM and ROM. Most program and data used by the algorithm are stored in the ROM to minimize the costs of both silicon area and SoC. The application layer will reside in RAM to provide application flexibility of the whole system.

The codec consists of 12 bit on-chip DAC and 12bit ADC. The sampling rate of the ADC and DAC can be programmed to either 8kHz or 16kHz. This allows speech processing for different frequency bandwidths. The input analog channel of the ADC has a PGA-programmable gain amplifier with a dynamic gain of 0-42dB while the output analog channel of the DAC has a PGA of 0-(-18)dB and a programmable band pass filter of 2.5-10kHz.

A power management unit provides general-purpose power management and clock-rate control capabilities. It has the ability to control clock inputs to all major blocks of the design as the power consumption of the system is closely related to the rate of the system clock. The power management unit is also able to put the chipset into a state of hibernation until the whole system is 'woken up' by the defined sources.

A watchdog timer is also available. This special design enables the system to recover from unexpected endless loops and hangups. This will thus enhance the chip's operational reliability. An SPI interface enables the use of external memories that use the SPI. In the case of Uni-Lite, the SPI interface serves two purposes. In the first case, it is used as a source for downloading firmware during system boot-up.

In the second case, the SPI interface can function as a general-purpose interface to external serial-interface memories. In the first mode, the system loads the first batch of contents of the SPI EEPROM into the internal Program RAM during boot up.

This is done with the aid of either a Hardware Bootloader or Software BootLoader. This special double bootloader design provides a stable performance of the system. The GPIO provides general purpose serial communications and control capabilities as well as "wakes up" sources when the system is put to sleep mode. The typical rate of operation is 10MHz.

Power consumption
As an SoC designed for portable systems, size and power consumption are two most important factors to consider in design. In Uni-Speech, the SoC was developed specifically for use in advanced stationary speech-recognition applications.

Uni-Lite, which was jointly developed by Infineon Technologies and Tsinghua University, focuses on developing a seamless system that integrates an advanced speech-recognition algorithm and cost-efficient SoC solutions.

Power consumption is tightly correlated to the operating clock rate in any digital system. The DSP core was designed to operate at varying rates, according to the function module. In the case of the real-time G.723 coder, a high operating rate is needed due to the complexity of the computation in compression algorithm.

On the other hand, the G.723 decoder requires a much lower operating rate. The algorithm for the Mel Frequency Cepstral Coef ficient (MFCC) feature extraction requires a lower rate. The higher operating rate is needed by the recognition mechanism to minimize the delay after the end of speech. The operating rate of the chip can be modified according to the delay requirement, the scale of vocabulary and the complexity of the template (Table 1 below).

Table 1: The operating rate of the chip can be modified according to the delay requirement, scale of vocabulary and complexity of the template.

The ability to vary the operating rate of the semi-real-time speech application will thus constrain peripherals usage and limit computational requirements. This will result in power reduction to some portion of the chip, which will contribute to the overall power savings. With correct settings in the applications software, the entire Uni-Lite SoC can be put into hibernation mode where it only consumes current in the range of microamperes (Table 2 below).

Table 2: With correct settings in the applications software, the entire Uni-Lite SoC can be put into hibernation mode where it only consumes current in the range of microamperes.

Software architecture
On the SoC, a full-speech interface - functions of guidance prompt, speech talk-back and speech recognition - is embedded.

This software set (Figure 2, below) is composed of endpoint detection, MFCC feature extraction, small vocabulary speaker-independent recognition and encoding/decoding objects. Other algorithms such as speaker-dependent recognition and speaker identification on this chip are under development.

Figure 2: The entire software system is partitioned into three levels—application, service and driver.

The entire software system is partitioned into three levels: application, service and driver. The driver level mainly manipulates the hardware and peripherals of the chip, and serves as a soft device to the upper level. With this structure, only minor revisions need to be made for the whole system to function when the external devices are changed.

The service level contains basic speech functional objects. The performance of a template-based word-recognition system is very sensitive to the variations of endpoint. This is even more difficult in a speech-interface chip, since all endpoint-detection processing must be done in real-time on the hardware.

The two-stage endpoint detection is a good method developed to cope with such difficulty. In the first stage, endpoint detection is based on energy and zero-crossing rate.

This process is simple enough to be time-synchronous. However, it gives only rough active-voice boundaries. Speech frames within these boundaries are then processed with their features extracted and recorded. This process saves storage space by bypassing the silence. This will also lower the DSP's average computing burden resulting in a lower operation rate.

The second-stage endpoint detection uses more information generated from feature extraction, such as energy of different frequency bands. A new feature is added to allow the algorithm to search both forward and backward of the endpoints, and update the searching threshold based on the current whole word.

In this case, the second-stage endpoint detection gives out a far more accurate endpoint location. This is considered a well-balanced mechanism between efficiency and accuracy, and results in high performance feature extraction.

The continuous density Hidden Markov Model (HMM) method based on both word model and subword model is also implemented in this level to achieve text- and speaker-independent high-accuracy recognition. This means that vocabulary recognition can easily be added as text from the computer and downloaded to the chip.

A multipass decoding algorithm is also embedded in the system. Using a simple template, the most likely words are selected in a short time. A much more precise template is used to determine which one is the final result. This saves both memory consumption and recognition time, thus enhancing the performance of SoC recognition.

Also implemented on the SoC is the ITU G723.1 speech algorithm, which provides a low-rate, good-quality speech coding method that has been successfully applied in very narrow- band videoconferences.

This high compression rate also contributes to longer speech for a given data-storage space. Most of the code and data in this level are stored in the ROM. This results in a significant reduction of silicon area that translates to lower power and hardware cost reduction.

The application level is the most variable portion of the system. With the implemented software architecture, this can be easily changed according to specific applications. With the support of relevant service-level speech-functional objects, any new application software can be built up rapidly.

And being a flexible configurable software system, each service-level speech- functional object can be freely integrated into the system application firmware.

The embedded software can be built up using the provided high-performance functional objects and is designed to enable applications to be easily assembled within a very short time.

The SoC is an optimized solution for embedded applications such as toys, voice-based remote control and speech recorder. Future developments include real-time recognition with the ability to recognize a much longer speech in a short time on the SoC. Speaker-dependent recognition and speaker identification will also be developed to satisfy more complex applications.

Zhizuo Yang and Jia Liu are with the Department of Electronic Engineering at Tsinghua University Beijing, China and Eric Chan is Staff Engineer, Lim Cheow Guan is Senior Engineer and Chen Kim Chin is IC Design Manager in the ASIC Design and Security Department at Infineon Technologies AG.




EE Times TechCareers
Search Jobs

Enter Keyword(s):


Function:


State:
  

Post Your Resume
-----------------
Employers Area
Most Recent Posts
Ascension Health seeking Solutions Development Analyst in St. Louis, MO

National Semiconductor seeking Principal IC Design Engineer in Santa Clara, CA

Taylor Guitars seeking Sr. Web Designer in El Cajon, CA

Covidien seeking Hardware Manager in Boulder, CO

Sierra Nevada seeking Software Engineer in Hagerstown, MD

More career-related news, resources and job postings for technology professionals



Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
All materials on this site Copyright © 2010 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement ¦ Terms of Service