With the development and maturity of speech compression/ decompression
and recognition, speech processing is becoming an important form of
man-machine interface. Currently, these systems in laboratories are
mostly large and complex. These implementations are usually based on
computer platforms.
The primary focus of these speech recognition systems is
large vocabulary continuous speech. Their speech encoding/ decoding
algorithms are so complex that they have to depend on a PC's computing
power and ability. Hardware requirements for such subsystems are highly
restrictive in meeting the requirements of portable, low-power and
low-cost embedded system.
To date, the Tsinghua/Infineon Uni-Speech
SoC has shown good
performance in accurate speech recognition and low-rate high-quality
speech encoding/decoding. However, the economics and power consumption
of the current solution limit its potential in portable
speech-recognition applications.
This opens up opportunities for Uni-Lite, an SoC that addresses cost
and power consumption. The Uni-Lite chip architecture comprises a 16bit
DSP core, on-chip ROM and RAM, an embedded Delta Sigma
() ADC/DAC with their respective I/O analog channels, and other
common interfaces. Uni-Lite is an SoC embedded with speech-processing
firmware that can be developed into an application without the need for
secondary on-board devices.
Hardware architecture
The Uni-Lite contains not only a DSP core and a codec, but also I/O analog
channels that serve as microphone and speakers. On-chip RAM and ROM,
communication to external devices are handled through on-chip UART, GPIO lines as well as SPI. With the exception of power
supply, microphones and speakers, all the system hardware components
are integrated into a single chip (Figure
1, below).
 |
| Figure
1: The Uni-Lite contains not only a DSP core and codec, but also I/O
analog channels that serve as microphone and speakers. On-chip RAM and
ROM are handled through on-chip UART, GPIO lines and SPI. |
The OAK DSP Core is a 16bit (data
and program buses) high performance fixed-point DSP core. All the I/O
units are connected to it directly through intermediary digital logics.
The DSP core is designed for operation of up to 104MHz. However, due
to the paradigm of design for portability, the operating speed of the
DSP can be programmed to various rates from complete hibernation,
1-104MHz. In terms of computational capability, the DSP core is able to
provide a throughput of 1 Drystone MIPs per MHz of operation.
On-chip memory consists of RAM and ROM. Most program and data used
by the algorithm are stored in the ROM to minimize the costs of both
silicon area and SoC. The application layer will reside in RAM to
provide application flexibility of the whole system.
The codec consists of 12 bit on-chip DAC and 12bit ADC. The sampling
rate of the ADC and DAC can be programmed to either 8kHz or 16kHz. This
allows speech processing for different frequency bandwidths. The input
analog channel of the ADC has a PGA-programmable gain amplifier with a
dynamic gain of 0-42dB while the output analog channel of the DAC has a
PGA of 0-(-18)dB and a programmable band pass filter of 2.5-10kHz.
A power management unit provides general-purpose power management
and clock-rate control capabilities. It has the ability to control
clock inputs to all major blocks of the design as the power consumption
of the system is closely related to the rate of the system clock. The
power management unit is also able to put the chipset into a state of
hibernation until the whole system is 'woken up' by the defined
sources.
A watchdog timer is also available. This special design enables the
system to recover from unexpected endless loops and hangups. This will
thus enhance the chip's operational reliability. An SPI interface
enables the use of external memories that use the SPI. In the case of
Uni-Lite, the SPI interface serves two purposes. In the first case, it
is used as a source for downloading firmware during system boot-up.
In the second case, the SPI interface can function as a
general-purpose interface to external serial-interface memories. In the
first mode, the system loads the first batch of contents of the SPI
EEPROM into the internal Program RAM during boot up.
This is done with the aid of either a Hardware Bootloader or
Software BootLoader. This special double bootloader design provides a
stable performance of the system. The GPIO provides general purpose
serial communications and control capabilities as well as "wakes up"
sources when the system is put to sleep mode. The typical rate of
operation is 10MHz.
Power consumption
As an SoC designed for portable systems, size and power consumption are
two most important factors to consider in design. In Uni-Speech, the
SoC was developed specifically for use in advanced stationary
speech-recognition applications.
Uni-Lite, which was jointly developed by Infineon Technologies and
Tsinghua University, focuses on developing a seamless system that
integrates an advanced speech-recognition algorithm and cost-efficient
SoC solutions.
Power consumption is tightly correlated to the operating clock rate
in any digital system. The DSP core was designed to operate at varying
rates, according to the function module. In the case of the real-time
G.723 coder, a high operating rate is needed due to the complexity of
the computation in compression algorithm.
On the other hand, the G.723 decoder requires a much lower operating
rate. The algorithm for the Mel
Frequency Cepstral Coef ficient (MFCC)
feature extraction requires a lower rate. The higher operating rate is
needed by the recognition mechanism to minimize the delay after the end
of speech. The operating rate of the chip can be modified according to
the delay requirement, the scale of vocabulary and the complexity of
the template (Table 1 below).
 |
| Table
1: The operating rate of the chip can be modified according to the
delay requirement, scale of vocabulary and complexity of the template. |
The ability to vary the operating rate of the semi-real-time speech
application will thus constrain peripherals usage and limit
computational requirements. This will result in power reduction to some
portion of the chip, which will contribute to the overall power
savings. With correct settings in the applications software, the entire
Uni-Lite SoC can be put into hibernation mode where it only consumes
current in the range of microamperes (Table
2 below).
 |
| Table
2: With correct settings in the applications software, the entire
Uni-Lite SoC can be put into hibernation mode where it only consumes
current in the range of microamperes. |
Software architecture
On the SoC, a full-speech interface - functions of guidance prompt,
speech talk-back and speech recognition - is embedded.
This software set (Figure 2, below)
is composed of endpoint
detection, MFCC feature extraction, small vocabulary
speaker-independent recognition and encoding/decoding objects. Other
algorithms such as speaker-dependent recognition and speaker
identification on this chip are under development.
 |
| Figure
2: The entire software system is partitioned into three
levels—application, service and driver. |
The entire software system is partitioned into three levels:
application, service and driver. The driver level mainly manipulates
the hardware and peripherals of the chip, and serves as a soft device
to the upper level. With this structure, only minor revisions need to
be made for the whole system to function when the external devices are
changed.
The service level contains basic speech functional objects. The
performance of a template-based word-recognition system is very
sensitive to the variations of endpoint. This is even more difficult in
a speech-interface chip, since all endpoint-detection processing must
be done in real-time on the hardware.
The two-stage endpoint detection is a good method developed to cope
with such difficulty. In the first stage, endpoint detection is based
on energy and zero-crossing rate.
This process is simple enough to be time-synchronous. However, it
gives only rough active-voice boundaries. Speech frames within these
boundaries are then processed with their features extracted and
recorded. This process saves storage space by bypassing the silence.
This will also lower the DSP's average computing burden resulting in a
lower operation rate.
The second-stage endpoint detection uses more information generated
from feature extraction, such as energy of different frequency bands. A
new feature is added to allow the algorithm to search both forward and
backward of the endpoints, and update the searching threshold based on
the current whole word.
In this case, the second-stage endpoint detection gives out a far
more accurate endpoint location. This is considered a well-balanced
mechanism between efficiency and accuracy, and results in high
performance feature extraction.
The continuous density Hidden Markov
Model (HMM) method based on both word model and subword
model is also implemented in this level to achieve text- and
speaker-independent high-accuracy recognition. This means that
vocabulary recognition can easily be added as text from the computer
and downloaded to the chip.
A multipass decoding algorithm is also embedded in the system. Using
a simple template, the most likely words are selected in a short time.
A much more precise template is used to determine which one is the
final result. This saves both memory consumption and recognition time,
thus enhancing the performance of SoC recognition.
Also implemented on the SoC is the ITU
G723.1 speech algorithm, which provides a low-rate, good-quality speech
coding method that has been successfully applied in very narrow- band
videoconferences.
This high compression rate also contributes to longer speech for a
given data-storage space. Most of the code and data in this level are
stored in the ROM. This results in a significant reduction of silicon
area that translates to lower power and hardware cost reduction.
The application level is the most variable portion of the system.
With the implemented software architecture, this can be easily changed
according to specific applications. With the support of relevant
service-level speech-functional objects, any new application software
can be built up rapidly.
And being a flexible configurable software system, each
service-level speech- functional object can be freely integrated into
the system application firmware.
The embedded software can be built up using the provided
high-performance functional objects and is designed to enable
applications to be easily assembled within a very short time.
The SoC is an optimized solution for embedded applications such as
toys, voice-based remote control and speech recorder. Future
developments include real-time recognition with the ability to
recognize a much longer speech in a short time on the SoC.
Speaker-dependent recognition and speaker identification will also be
developed to satisfy more complex applications.
Zhizuo Yang and Jia Liu are with
the Department of Electronic Engineering at Tsinghua University
Beijing, China and Eric Chan is Staff Engineer, Lim Cheow Guan is
Senior Engineer and Chen Kim Chin is IC Design Manager in the ASIC
Design and Security Department at
Infineon Technologies AG.