Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


16 March 2010

QCELP Vocoders in CDMA Systems Design

In wideband spread-spectrum systems, QCELP voice coders (vocoders) are the speech coding standard. When designing a CDMA system, it is important to understand the basic principles and key features of the vocoders, including functional structures and signal-processing flows in each module.

By Derek Q. Wang
Qualcomm code excited linear prediction (QCELP) 8-kbps and 13-kbps voice coders (vocoders) have become the North American speech service option standards for the wideband spread spectrum digital communication systems TIA IS-96A (May 1995) and TIA IS-733 (March 1998), respectively. QCELP vocoders are full-duplex speech encoders and decoders that produce near wireline quality speech at variable compressed data rates. QCELP vocoders are already implemented in several general-purpose digital signal processor (DSP) chips such as AT&T DSP16xx, OakDSP, and TI TMS320C54x. They are also widely used in code division multiple access (CDMA) wireless communication systems.

This article describes the basic principles and key features of QCELP vocoders, including functional structures and signal-processing flow in each major module. The 13-kbps QCELP and 8-kbps QCELP employ the same methods in a majority of their modules. However, the 13-kbps QCELP includes a second stage of encoding data rate determination, called rate reduction (as opposed to the single stage of rate determination in the 8 kbps vocoder). The 13-kbps QCELP also uses vector quantization to quantize line spectral pair (LSP) frequencies, instead of the 8-kbps QCELP’s method, which uses scalar quantization. This article focuses on the 13-kbps vocoder.

QCELP vocoders
The statistical characteristics of a speech signal can be demonstrated by a source-filter model. Speech data can be significantly compressed with this type of modeling. Thus, the communications channel can be effciently used for more users. The source-filter model assumes that speech is the result of exciting the linear time-varying filters with a source signal. The excitation source signal is modeled as either a periodic impulse train for voiced speech like vowel sounds, or a random noise for unvoiced speech like consonants. The linear time-varying filters usually include the formant synthesis filter, or the linear predictive coding (LPC) synthesis filter, and the pitch synthesis filter. Speech compression/decompression is also called speech coding. The two basic speech-coding methods for data rates between 4.8 kbps and 16 kbps are analysis-and-synthesis (AaS), and analysis-by-synthesis (AbS).

In the AaS approach, an analyzer in the transmitter analyzes the original speech, and extracts a set of parameters that represent some kind of source-filter model. These parameters are then transmitted out to the receiver, where a synthesizer reconstructs the speech based on the received parameters. In this system, the distortion throughout the whole coding process is difficult to check and control, because the analyzer and the synthesizer are located separately in the transmitter and receiver.

In the AbS approach, an analyzer and a local synthesizer are introduced in the transmitter. The synthesized speech is now available in the transmitter for analysis. A trial-and-error procedure, similar to a closed loop, determines the optimum parameters in the transmitter. In the receiver, these parameters reconstruct the synthetic speech, which match the real signal with minimum perceptual error. Compared to the AaS approach, the AbS approach is capable of producing higher-quality speech at low data rates, but the encoder in the transmitter is more complex.

The basic CELP algorithm is one of the AbS methods widely used in the low bit rate of speech coding. For example, the US Government Federal Standard uses CELP at 4.8 kbps, and the ITU-T G.728 standard uses low-delay (LD)-CELP at 16 kbps. The North American IS-54 TDMA mobile standard also uses vector sum excited linear predictive (VSELP) coding at 8 kbps. All of these vocoders use the CELP algorithm in the fixed data rate. QCELP is also a CELP algorithm, but it differs from the traditional CELP in that it dynamically adjusts the encoded data rate based on speech signal energy, background noise, and other speech characteristics. Therefore, the average data rate of the compressed speech is significantly reduced, while the voice quality is not affected. The QCELP vocoder consists of an encoder and decoder. When using a general DSP chip to implement a QCELP vocoder, approximately 20 to 25 MIPS are needed, of which 90% of these MIPS are for the encoder and the remaining 10% for the decoder.

A 13-kbps encoder
The functional block diagram of the QCELP 13-kbps encoder is shown in Figure 1. The original speech is partitioned into 20-ms frames, each consisting of 160 samples at a sampling rate of 8 ksps. The encoder takes each frame as input, and produces a transmission packet of the compressed data as output. Each frame may be encoded at one of four different data rates: full rate, 1/2 rate, 1/4 rate, and 1/8 rate. The output packet has 266 bits in full rate, 124 bits in 1/2 rate, 54 bits in 1/4 rate, and 20 bits in 1/8 rate. So, the output data rate will be 13.3 kbps, 6.2 kbps, 2.7 kbps, and 1 kbps, respectively. The encoder consists of four major modules: LPC (formant) analysis, data rate determination, pitch search, and excitation codebook search. There are some other modules, such as LPC/LSP conversion, LSP vector quantization/unquantization, and data packing. As shown in Figure 1, LPC coefficients are not directly used after formant (LPC) analysis. Here, LPC coefficients go through LPC-to-LSP conversion, LSP quantization, LSP unquantization, and LSP-to-LPC conversion. They are then used for pitch and codebook search to match the LPC coefficients in the transmitter with those in the receiver.

The first step in the encoder is LPC analysis, which can be seen in Figure 2. Each frame of input speech goes through a high-pass filter and a Hamming window filter, before LPC analysis is performed. LPC (formant) analysis is, in fact, typical LPC filtering. The goal of LPC analysis is to search for a set of optimal filter coefficients, in the sense of the least mean squared error of R(n).

The formant is a resonant frequency of the human vocal tract that causes a peak in the short-term spectrum of speech. Input speech is sampled at 8 kHz, and gives a 4-kHz spectrum for analysis. Within this 4 kHz, the maximum number of formants is usually four, thus requiring the filter order to be at least eight. QCELP uses a tenth order LPC filter A(z), so that the formant resonances and general spectral shape are modeled accurately. The LPC filter is also called a short-term predictor (STP), which models the short-term correlation in the speech. Data frame size for LPC analysis is usually chosen to be within the range of 20 ms to 30 ms to meet the stationary requirement. QCELP uses 20 ms as the data frame size.

During implementation, LPC filtering is divided into two steps. The autocorrelations of input speech are calculated, then Durbin’s recursion algorithm is used to compute 10 optimal LPC coefficients. LPC coefficients are then converted into LSP frequencies. Conversion occurs because representing LPC coefficients in this pseudofrequency domain provides better quantization and interpolation properties. In addition, the natural ordering feature of LSP frequencies can be used to check the filter stability in the decoder. In Figure 2 , a high pass filter (HPF) removes the DC components from the input to prevent the DC offset from increasing the signal energy, and from disrupting the rate determination algorithm. The Hamming window smooths the block-edge effect, due to the frame-based processing nature of LPC analysis.

The encoder’s second step is determining the data rate. Data rate is decided once every frame (20 ms). After the data rate is selected, the 20-ms frame of speech is encoded at this rate, and converted to a packet that can be transmitted. The QCELP 13-kbps vocoder uses two stages of data rate determination.

The first stage of rate determination classifies the input speech into two categories, based on signal energy and background noise. One category is “active speech,” encoded at full rate or 1/2 rate. Another is “background noise or pauses,” encoded at 1/8 rate.

The second stage of rate determination further analyzes the characteristics of input speech to detemine whether the current frame can be encoded at a reduced rate without affecting voice quality. When rate-reduction mode is enabled, the second stage of rate determination starts to work. First, the active speech from the output of the first stage is classified into voiced speech, which is encoded at full rate or 1/2 rate, and unvoiced speech at 1/4 rate. Then, the voiced speech is further classified into a nonstationary or transitional frame, encoded at full rate and stationary frame at 1/2 rate. Voiced speech is made up of vowel sounds, and is characterized by periodic frequency resonance. Unvoiced speech is comprised mainly of consonants, and is random in frequency content. Unvoiced speech can represent 30% of active speech. In a QCELP 13-kbps encoder, zero crossings and the normalized autocorrelation function (NACF) are used to make the voiced/unvoiced decision. Stationary speech is the continuation of a sound that has already begun. Nonstationary, or transitional speech, is a frame in which the speaker is changing from one distinct sound to another. In a QCELP 13-kbps encoder, target signal-to-noise ratio (SNR), the differential prediction gain, differential LSP, and NACF are used to distinguish the stationary speech frame from the nonstationary speech frame.

In summary, the full rate is used for transitional frames, frames with reduced periodicity, or poorly modeled frames in which the highest encoding rate is necessary to achieve good voice quality. The 1/2 rate is used for well-modeled, stationary, and periodic frames. The 1/4 rate is used for unvoiced speech, and the 1/8 rate for background noise or pauses.

After LPC analysis and rate determination, the next step is pitch search. Pitch is the fundamental frequency caused by periodic vibration of human vocal cords. A pitch search is performed in the subframes of an LPC frame, for example, every 5 ms to 10 ms. Usually, there are two types of pitch search models: the open-loop model and the closed-loop model. A QCELP 13-kbps encoder uses the closed-loop model.

In an open-loop model ( Figure 3 ), after the input speech S(n) goes through the LPC filter A(z) to produce the LPC residual signal R(n) ( Figure 2 ), R(n) enters the pitch predictor P(z) to produce the pitch residual E(n). The pitch predictor P(z) has two parameters: pitch gain b and pitch lag L. The goal of pitch search here is to look for optimal pitch gain and pitch lag, in the sense of the least mean squared error of E(n). The pitch predictor is also called the long-term predictor (LTP), which models long-term correlations between the speech samples that are one or multiple pitch periods away. In other words, after the input speech passes A(z), the short-term correlation is removed. When the output of A(z) passes P(z), the long-term correlation is removed. The resulting E(n) is very much like white noise.

Now, let’s reverse the open-loop process (shown in Figure 3 ) to see what will happen. If we use additive white Gaussian noise (AWGN) as an excitation source, let it go through a pitch synthesis filter 1/P(z), and then a formant synthesis filter 1/A(z). The final output should be synthesized speech. That is the basic speech reconstruction model used in the QCELP algorithm.

In the closed-loop model ( Figure 4 ), an AbS method is used to determine the optimal pitch gain b and pitch lag L. The goal is to minimize the weighted error between the input speech and the synthesized speech. The synthesized speech is the output of the formant synthesis filter that processes the estimated output of the pitch synthesis filter. W(z) (=A(z)/ A(z/r), r=0.8~0.9) is a perceptual weighting filter. It is used to de-emphasize the frequency regions that correspond to the formants as determined by LPC analysis. The noise, located in formant regions, that is more perceptibly disturbing can be reduced. The de-emphasis is controlled by r. The zero input response of 1/A(z) represents the initial state condition of the filter. It is subtracted from the input speech, because the zero initial state of formant synthesis filter 1/A(z) is used in the closed-loop.

After determining the formant synthesis filter 1/A(z), the pitch synthesis filter 1/P(z), and encoding data rate, we can do an excitation codebook search. The codebook search is performed in the subframes of an LPC frame. The subframe length is usually equal to or shorter than the pitch subframe length.

The codebook search model is shown in Figure 5 . For full rate and 1/2 rate frames, a form of Gaussian codebook is used. It consists of 128 codebook vectors, with each vector containing 128 samples. The codebook has two parameters: codebook index I and codebook gain G. For 1/4 rate and 1/8 rate, a pseudorandom vector generator is used. From Figure 5 , we can see that an AbS method determines the codebook parameters. The goal is to minimize the weighted error between the input speech and the synthesized speech. The synthesized speech is the scaled codebook vector, filtered by the pitch synthesis filter 1/P(z) and the formant synthesis filter 1/A(z). The codebook used here is also called “circular codebook with single shift.” The vector in any row of the codebook is the result of shifting the vector in the previous row by one sample. This kind of codebook structure reduces the computation of the codebook search, and requires less memory storage. A perceptual weighting filter W(z) is used to reduce the noise.

In summary, from LPC analysis, LPC-to-LSP conversion, and LSP vector quantization, we get the vector quantization codebook index line spectral pair vectors (LSPVs). From the pitch search, we get pitch gain and pitch lag; and from the excitation codebook search, we get codebook gain, codebook index, or random seed. The last step the encoder performs is to pack these parameters into the transmission codes and send them out. Remember, the chosen encoding data rate is only used for pitch search, codebook search, and data packing. It is not a transmitted parameter.

A 13-kbps decoder
A QCELP 13-kbps decoder is shown in Figure 6 . In the receiver, the channel decoder determines the received traffic channel frame, and provides the packet type and packet data to the voice decoder. The packet type is basically the received frame data rate. The first step the decoder performs is to unpack the received packet and restore the transmitted parameters. Then the speech can be reconstructed based on these parameters.

An excitation codebook, or pseudorandom vector generator, goes into the pitch synthesis filter 1/P(z) as a source signal. The signal then goes on to the formant (LPC) synthesis filter, and then into an adaptive post-filter (APF)(z) to produce synthesized speech ( Figure 6 ). This is essentially the same speech reconstruction model as the one used in the encoder. Therefore, only the differences are discussed here.

An adaptive post filter APF(z) = B(z) A(z/p)/A(z/s) (p=0.625, s=0.775) is used to reduce the subjective noise in the synthesized speech. B(z) = 1/ (1+0.3 Z -1 ) is an antitilt filter that compensates for the spectral tilt introduced by A(z/p)/A(z/s) . The postfiltered speech is characterized by less background noise (quiet-room effect) and increased smoothness. For the lower-rate CELPs, this enhancement to subjective quality is particularly noticeable. The result is that the synthesized speech sounds much cleaner and more pleasant.

The pitch prefilter is used to enhance the periodicity. In fact, it is a second pitch filter, with gain limited to 0.5 for stability. The gain control ensures that the power of the output signal is approximately the same as the power of the input signal. The method used here estimates the power of the input and output signals, and then determines an appropriate scaling factor based on the ratio of the two estimated power values. In addition, the received parameters of the LSPVs must be unquantized, interpolated, and converted back to LPC coefficients. This module is identical to the one used in the encoder. The interpolation of LSPs is needed for each pitch/codebook search subframe.

Major components
QCELP vocoders code speech by using a source-filter model. This model includes three major components: the formant synthesis filter, pitch synthesis filter, and an excitation of Gaussian vectors. The encod-er uses the formant (LPC) analysis, pitch search, and excitation codebook search to extract out the model parameters from the input speech. The decoder reconstructs the speech from these model parameters. Dynamically adjusting the coding data rate makes the QCELP algorithm different from the traditional CELP algorithm. It significantly reduces the average encoded data rate without sacrificing the voice quality. The rate determination is based on the input speech energy, speech characteristics, and the background noise.

Derek Q. Wang is a member of the CDSP software technical staff at Silicon Wireless, Inc., in Mountainview, CA. He received his MSEE from the University of Florida in Gainsville, FL. He can be reached at derek.wang@siliconwireless.com.

Illustrations
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Resources
  1. High Rate Speech Service Option 17 for Wideband Spread Spectrum Communication Systems, TIA/EIA/IS-733, March 1998.
  2. Speech Service Option Standard for Wideband Spread Spectrum Digital Cellular Systems, TIA/EIA/IS-96A, May 1995.
  3. Kondoz, A.M., Digital Speech, Coding for Low Bit Rate Communications Systems, Wiley Publishers, 1994.





Virtualab

  • Analysts: Five observations on mobile from MWC
  • M'soft says no comment on Project Pink phone
  • What made you become an EE? Join the Conversation
  • Nvidia blames sales shortfall on TSMC
  • MORE
    Prototype fuel cell for handsets eyes fivefold run-time boost
    As part of a research collaboration on miniaturized energy sources, the French Atomic Energy Agency (CEA) and STMicroelectronics NV (Geneva) have prototyped a hydrogen fuel cell for mobile phones that aims to reduce dependency on the use of electrical power supplies to recharge batteries. EE Times' Anne-Francoise Pele Takes a closer look.Click here to learn more.

    Tech Article Library
    Check out CommsDesign's Design corner to find a detail technical articles on a host of communication design issues. To access the design corner, click here.

    Phyworks demos 10G copper interconnects
    Communications chip specialist Phyworks (Bristol, England) has demonstrated 10Gbits/s rack-to-rack copper interconnects of up to 30 metres using technology it originally developed for the optical module market. EE Times Europe's John Walko gets the story. Click here for details.

    Puzzled by a network processing design issue?

    Join former NPF CEO Colin Mick in discussing net processing design issues by clicking here!


    EE Times TechCareers
    Search Jobs

    Enter Keyword(s):


    Function:


    State:
      

    Post Your Resume
    -----------------
    Employers Area
    Most Recent Posts
    Accenture seeking Project Management Team Lead in Charlotte, NC

    Accenture seeking Software Engineer in Salt Lake City, UT

    Boeing Company seeking Software Engineer in Herndon, VA

    Switch and Data seeking Customer Solutions Engineer in Dallas, TX

    Chart Industries seeking Sr. Developer in Cleveland, OH

    More career-related news, resources and job postings for technology professionals




    Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
    All materials on this site Copyright © 2010 EE Times Group, a Division of United Business Media LLC All rights reserved.
    Privacy Statement ¦ Terms of Service