Speech quality continues to play a big role in the development of VoIP architectures. While developers have made improvements, VoIP speech quality still falls short in delivering the toll-quality performance today's end users expect. Fortunately, there are a host of techniques designers can employ to improve speech coding architectures and, in turn, the quality of speech transmissions over VoIP equipment.
This is the second installment in our two-part set on VoIP speech coding challenges. Part 1 of this article discussed the key attributes required in speech coder technology as well as challenges that the packet network creates in a speech coding architecture. Now, in Part 2, we'll examine five techniques designers can employ to improve the quality of their VoIP speech coders.
When working on speech coding architectures, there are really five challenges that designers must tackle: error correction, adaptive wideband coding, transcoding across networks, background noise reduction, and low delay modems. Let's look at how developers can tackle each of these challenges, starting with error correction
1. Error Correction
There are really three main methods for handling error correction in VoIP systems. These include receiver-based error concealment, media-independent FEC, and content-dependent FEC.
There are several receiver-based concealment techniques designers can employ in VoIP speech coding architectures. These include: repetition-based concealment, model-based concealment, and noise substitution schemes.
A. Repetition-based concealment
Repetition-based concealment involves replacing the lost portion of speech or the speech parameters by a gradually attenuated copy of the ones that arrived immediately before the loss. Some form of repetition is done in G.729 whereby the codebook gains are gradually decayed after the first repeated frame. Various heuristics around this general concept have been proposed. For instance in the architecture highlighted in reference 7 below, features were added to the basic repetition of parameters of a CELP coder: The first is a muting algorithm for the excitation signal. The second is a pitch jittering during bursty frame erasure to ensure the reconstructed frames are not excessively periodic, thus more naturally sounding.
Under the model-based recovery approach,11 the line spectral frequency (LSF) parameters in a missing frame are recovered based on a Gaussian predictive model. The effectiveness of these schemes is function of the fidelity (order) of the model used. They are beneficial when 1 or more LSF subsets are lost as a result of a corrupted frame.
Noise substitution schemes entail substituting missing speech frames with noise frames. These frames may be either generic white Gaussian noise, or a more realistic comfort noise, whose statistics are determined during non-speech periods at the encoder. Often these substitution schemes are combined with other methods. For instance the voicing-based method recovery described in reference 9, noise is used to fill in the missing unvoiced speech frames or in the unvoiced subband of a mixed excitation frames.
B. Media-independent FEC
Media-independent FEC mechanisms entail adding parity bits or packets at the encoding source to allow the receiving end to recover lost or erroneously received packets. It is independent of the underlying information content and typically uses blocks or algebraic codes to produce additional parity packets. Block coding schemes such as Reed-Solomon may be used, such as the one shown in Figure 4.19
Click here for Figure 4
Figure 4: Parity packet generated from Xoring n packets.
Other media-independent FEC mechanisms proposed involve exclusive-OR operations, whereby a redundant parity packet is send every nth data packets, by exclusive ORing the other n packets.12 This method allows the recovery from a single loss in an n-packet message. FEC mechanisms in general are desirable when lost packets are dispersed throughout the stream of packets. Their advantage is that they are independent of the underlying media and they yield an exact replacement of the lost packet. Their computational requirements are relatively small and generally simple to implement. On the other hand, they lead to an increase in bandwidth as well as an added delay at the decoder side.
C. Content-dependent FEC
In content-dependent FEC approaches, each frame of speech is transmitted in more than one packet with each copy represented in different compressed format. The first copy is referred to as the primary encoding and the subsequent copies as the secondary encoding.
For instance, in the method proposed by Bolot,2 the first frame is PCM-encoded and sent in packet n and secondary encoding of the same frame is done with a low bit rate coding such as the LPC (2.4 to 5.6 kbit/s) or GSM coding (13.2 kbit/s) and sent in packet n+1 (Figure 5).
Click here for Figure 5
Figure 5: Coding-dependent FEC using 3 packets.
In Bolot's architecture, the choice of the primary and secondary encoding is a function of the computational cost, the available bandwidth and the degree of error robustness. Using GSM encoding for example is computationally demanding but is more robust to the type of errors experienced over the Internet. In addition, the amount of redundancy can be adjusted dynamically as the characteristics of the IP network changes. Thus during high loss periods, the secondary GSM encoding for packet n may be sent in packets n+1 and n+2 or in packets n+1, n+2, and n+3. The tradeoffs between the rewards of better information recovery and the added bandwidth and complexity are illustrated in reference 2 below for a number of combinations.
2. Adaptive Wideband Coding
IP networks provide potentially wide bandwidth and this in turn offer the possibility of sending high-quality speech through wideband coding. However, network congestions or other impairments also result in a high variance in the available bandwidth. As a result, a speech coder must ideally be able to exploit the high bandwidth to transmit higher fidelity speech (7 kHz instead of 4 kHz) yet at the same time be able to drop the bit rate (and gradually compromise on the fidelity) during congestion or whenever the available bandwidth on the IP network or the access network is no longer guaranteed.
Adaptive multirate wideband coders have been proposed in the context of wideband coding. The coder in highlighted in reference 14 below operates in five different modesand bit ratesranging from 24 to 9.1 kbit/s.
The goal is to provide at the higher rate a speech quality that equals or exceeds the quality of G.722 wideband coder (48 kbits). Thus, the coder scheme described in reference 14 exploits human auditory perception in that the lower band (0 to 6 kHz) is coded with a variable rate ACELP and the higher 1 kHz (representing 1 critical band in the auditory log scale) uses either a bandwidth expansion scheme or ADPCM coding, depending on the availability of BW and application. Most of the bits are reserved for the lower band with the upper band using as low as 6 bits per 20-ms frame (160 samples) or as much as 2 bits per sample when the overall bit budget is sufficiently big (24 kbit/s mode).
3. Transcoding Across Networks
Smart transcoding refers to the ability of providing a transparent and quality-wise effective way to map the various coefficients between two speech coders at the boundaries of a VoIP network.17. For instance, the scheme proposed in reference 15 below, maps a G.723 to an EVRC coder. The 2 coders have inherently different bit rates: 5.3 or 6.3 kbit/s for G.723 and 8 kbit/s for EVRC as well as frame size (30 ms with 7.5 lookahead delay vs. 20 ms with 10 ms lookahead for EVRC).
The line spectral pairs (LSP) are converted by translating 2 sets of G.723 information into 3 sets of LSP parameters for EVRC using an interpolation scheme over 3 frames. After the LSP conversion, the open-loop pitch of EVRC is computed using the closed-loop pitch of G.723 using the perceptually weighted speech.
The closed-loop pitch of G.723 is compared with the one from the previous EVRC subframe. If the distance of the 2 values is less than 10 samples, the closed-loop pitch of G.723 is determined as the open-loop pitch of EVRC. Otherwise, a pitch smoothing method is applied whereby a pitch value is searched in a range of +/-3 samples around the closed loop pitch of G.723 and EVRC. The 2 maxima are compared and a decision is made based on pitch value and the pitch gain in the previous subframe.
4. Background Noise Reduction
The aim of noise reduction is to minimize the effect of noise on the performance of voice communications systems. This means improving the perceived quality to the human listener as well as providing a more appropriate signal for estimating crucial signal parameters such as spectral content, pitch and voicing. There are a variety of methods for achievement speech enhancement. A detailed survey on the subject is found in reference 20 below:
- Wiener filtering: enhance speech by spectral subtraction and optimal linear filters. These filters are derived by minimizing the MSE or other criteria.
- Comb filtering: reinforcing the harmonic structure of the speech by combing through the spectrum and enhancing the periodic peaks.
- Maximum likelihood estimation: involved an estimation of the speech envelope or the magnitude spectrum based on a statistical model of the speech and noise.
- Psychoacoustics methods: which consist of special filtering that takes into account the peculiarities of perceptually important speech parameters or acoustic criteria of human hearing.
5. Low Delay Modems,br>
An analysis by B. Goodman10 about the total delay in a typical VoIP call using dialup modems concluded that the component added by an analog modem significantly exceeds the theoretical lower limit. This limit is determined analytically, given the data rate of the modem, the number of speech frames per packet and the bit rate of the speech coder. In modems such as V.34, the actual measured delay can be up to 3 times that lower limit.
Further analysis showed that data compression, though resulting in higher effective rates, adds delay due to the buffering process, whose size depends on the compression ratio. The error correction and detection adds significant delay due to added framing required and the retransmission of errored blocks. This retransmission is effectively useless for VoIP applications that cannot tolerate additional waiting for a retransmitted packet.
Furthermore, the block size used in error correction is not optimized for speech frames. Since procedures vary across modems with respect to when a partial buffer is transmitted, the delay impact is unpredictable. Other features in typical modems, such as the equalizer filters, the interleaving of data as well as the trellis modulation adds more delay to speech frames.
While the problem is somewhat alleviated with high-speed modems, such as DSL or cable, there is still room to optimize the operations of the lower layers in order to keep the overall delay as small as possible in a VoIP call. This is particularly important if higher data rate, such as wideband coders are used, thus necessitating larger speech frames. Particular functions such as the bit error correction and the channel equalization need to be revised and adapted to the delay and error tolerance of speech transmission.
Final Thoughts
The merging of telecom carriers with other service providers, such as cable and Internet is becoming the norm, as the business arguments for bundling services and reducing operational costs become more and more compelling. Some of the challenges remaining in offering a competitive VoIP-based telephony are the service quality as well as the speech quality that consumers naturally expect to be equal, or even exceed that of the PSTN system. While most of the hurdles that are inherent to the VoIP context have been tackled to some degree, more robust and optimized solutions remain to be developed.
The speech coders currently used were not originally developed for today's IP telephony applications or for a high available bandwidth. As such, they do not fully exploit the available features and do not optimally address the problems of this new IP context. It is clear however that as these problems are properly addressed, IP-based telephony will be a natural progression to its current PSTN counterpart and will eventually provide a higher voice quality and service quality, at a competitive cost to all parties involved.
Editor's Notes:
- This article is based on a presentation made at the 2002 Communications Design Conference; www.commdesignconference.com
-
- To view part 1 of this article, click here.
References
- A. Watson and M. Sasse. "Measuring Perceived Quality of Speech and Video in Multimedia Conferencing Applications", Proceedings of ACM Multimedia, pp. 55 -- 60. Sept. 1998.
- J. Bolot and A. Vega-Garcia. "Control Mechanism for Packet Audio in the Internet". IEEE INFOCOM '96. Volume: 1, 1996 pp: 232 --239.
- C. Padhye and K. Christensen. "A New Adaptive FEC Loss Control Algorithm for Voice Over IP Applications". IEEE Computing, and Communications Conference, 2000. IPCCC '00. Page(s): 307 --313.
- R. Cox. "Three New Speech Coders from The ITU Cover a Range of Applications". IEEE Communications Magazine. Sept 1997, pp 40 -- 47.
- G. Schroder and M. Hashem. "The Road to G.729: ITU 8 kbps Speech Coding Algorithm with Wireline Quality". IEEE Communications Magazine. Sept 1997, pp 48 -- 54.
- R. Salami, C. Laflamme, B. Bessette, JP Adoul. "ITU-T G.729 Annex A: Reduced Complexity 8 kbit/s CS-ACELP Codec for Digital Simultaneous Voice and Data". IEEE Communications Magazine. Sept 1997. pp 56 -- 63.
- J. DeMartin, T. Unno and V. Viswanathan. "Improved Frame Erasure Concealement for CELP-Based Coders". IEEE ICASSP '00. Volume: 3 pp 1483 --1486. 2000.
- F. Poppe, D. DeVleeschauwer and G. Petit. "Guaranteeing QoS to Packetized Voice over the UMTS Air Interface". IEEE IWQOS. 2000. pp 85 --91.
- J.F. Wang, J.C. Wang, J.F. Yang, and JJ. Wang. "A Voicing-driven Packet Loss Recovery Algorithm for Analysis-by-Synthesis Predictive Speech Coders over Internet". IEEE Transactions on Multimedia. Vol. 3, No. 1, March 2001. pp 98 -- 107.
- B. Goodman. "Internet Telephony and Modem Delay". IEEE Network. May/June 1999. pp 8 -- 16.
- R. Martin, C. Hoelper, I. Wittke. "Estimation of Missing LSF Parameters Using Guaussian Mixture Models". IEEE Acoustics, Speech, and Signal Processing, 2001. Volume: 2. pp 729 -732
- N. Shacham and P. McKenney. "Packet Recovery in High-Speed Networks Using Coding and Buffer Management". IEEE INFOCOM '90, pp: 124 -131 vol.1.
- D. Rahika, J. Collura, T. Fuja, D. Sridhara, T. Fazel. "Error Coding Strategies for MELP Vocoder in Wireless and ATM environments". Speech Coding for Algorithms for Radio Channels (Ref. No. 2000/012), IEEE Seminar, 2000. Page(s): 8/1 -833
- C. Erdmann et al. "A Candidate Proposal for a 3GPP Adaptive Multi-rate Wideband Speech Codec". IEEE ICASSP Volume: 2, 2001. Page(s): 757 -760 vol.2
- K. Kim. "An Efficient Transcoding Algorithm for G.723.1 and EVRC Speech Coders". IEEE Vehicular Technology Conference, 2001. VTC 2001 Fall. pp: 1561 -1564 vol.3 pp 1561 -- 1564.
- E. Morgan. "Voice over Cable". White paper. www.telogy.com.
- H. Kang, H. Kim, R. Cox. "Improving Transcoding Capability of Speech Coders in Clean and Frame Erasures Channel Environments". IEEE Workshop on Speech Coding, 2000. pp: 78 --80.
- M. Borella and D. Swider. "Internet Packet Loss: Measurement and Implications for End-to-End QoS". Architectural and OS Support for Multimedia Applications/Flexible Communication Systems/Wireless Networks and Mobile Computing, 1998. Page(s): 3 --12.
- C. Perkins, O. Hodson, V. Hardman. "A Survey of Packet Loss Recovery Techniques for Streaming Audio". IEEE Network. Sept/Oct 1998 pp 40 -- 48.
- E. Nemer. "Acoustic Noise Reduction for Mobile Telephony". DSP World Spring Design Conference. April 2000.
- D. O'Shaughnessy. "Enhancing Speech Degraded by Additive Noise or Interfering Speakers". IEEE Comm. Magazine, Feb 1989, pp 46-52.
About the Author
Elias Nemer is a senior member of technical staff at Intel Corp. He holds a B.Eng(EE), M.Eng(EE) and MBA from McGill University (Montreal, Canada) and a Ph.D. (EE) from Carleton University (Ottawa, Canada). Elias can be reached at enemer@ieee