One of the greatest concerns in the move towards packet voice networks has been the issue of quality. Call-quality has typically been synomonous with voice quality. However, there are many other factors in addition to "voice" that affect a customer's perception of the quality of a call. While hearing intelligible words is the primary prerequisite of a useful phone call, a customer's experience of call-quality includes much more than just the voice heard. The sounds that are heard when there is no voices on the line are often overlooked yet are important contributors to the perceived call quality.
These non-voice sounds are typically called background noise (BGN). Two technologies used by packet voice systemsecho cancellation (EC) and silence suppression (SS)have been employed to artificially replace the BGN of a call with what is commonly called comfort noise (CN).
This paper discusses some of the considerations and challenges in producing comfort noise that meets a customer's expectation of a toll-quality phone call. In addition, the technical issues related to BGN and CN generation in EC and SS applications will be discussed.
Curbing the Echo
EC is not a new technology in telephony systems. In the traditional phone network, the delay that occurs in long distance phone calls creates an audible and annoying echo. Echo cancellers have been used to remove this echo. In packet voice networks, all calls have a delay imposed by the collection of voice samples to create a voice packet. This packetization delay has the same effect as the delay of traditional long distance phone calls. It is therefore necessary in a packet voice network to use an echo canceller on all calls regardless of the distance they travel.
As illustrated in Figure 1, an EC consists of an adaptive filter and a non-linear processor. In an idealized echo canceller, the adaptive filter would instantaneously create a perfect model of the echo response and subtract the resultant echo from the return signal. In the real world, an adaptive filter is neither instantaneous nor able to always create a perfect model.

Figure 1: Diagram of a typical echo cancellation system.
It is the non-linear processor that removes any residual echo when the adaptive filter is not up to the task. The non-linear processor does this by suppressing both the local signal and the residual echo, which are indiscernibly combined. If it failed to do this, the echo would return to the remote user, causing a very distracting echo and an unacceptable degradation of quality. When such suppression by the non-linear processor occurs, the local user's signal no longer makes it to the remote handset; this is an undesirable but inevitable side-effect of eliminating the residual echo. Despite this, no words are usually lost in the conversation because only one user at a time speaks during normal dialogue.
However, the actual BGN present at the local end no longer reaches the remote user, causing an unpleasant discontinuity. To circumvent this problem, a good non-linear processor must replace any suppressed local BGN by an artificially generated CN, which should be subjectively indistinguishable from the original.
Silence Suppression
One of the goals of a packet voice network is to reduce both the required power and bandwidth for voice communication. The most widespread method is to make use of a technique commonly called silence suppression (SS).
SS algorithms cease sending a signal when no voice is present; this is called a silence period (even though there may be background noise present). Since a person typically speaks only half the time, this reduces transmission bandwidth and power by about half. Bandwidth is especially costly in wireless infrastructure; low power consumption is just as important for battery-operated devices such as mobile phones.
Making Noise in the Background
Background noises (BGN) come in many different shapes and sizes (figuratively speaking). In layman's terms, BGN is often described as "office ventilation noise", "car noise", "street noise", "cocktail noise", "background music", etc. Although this classification is practical for human understanding, the algorithms that model and produce comfort noise see things in more mathematical terms.
The most basic and intuitive property of BGN is its loudness. This is referred to as the signal's energy level. Another less obvious property is the frequency distribution of the signal. For example, the hum of a running car and that of a vacuum cleaner can have the same energy level, yet they do not sound the same: these two signals have distinctly different spectrums.
The third important property of BGN is the variability over time of the first two properties. When a BGN's energy level and spectrum are constant in time, it is said to be stationary. Some environments are prone to contain non-stationary BGN. The best example is street noise, in which cars come and go.
Good CN algorithms must cope well with all types of BGN. The regenerated comfort noise must match the original signal as closely as possible. Furthermore, in instances when the CN model poorly matches the original signal, a good algorithm should try to minimize the degradation of subjective quality. Today the trend in CN algorithms is to base them on a technique generically called spectral comfort noise (SCN), which tries to recreate the power and frequency spectrum of the original noise.
Common Ground
Both EC and SS have many common sub-functions (Figures 2 and 3). The first sub-function is called BGN modeler. The modeler's function is to distinguish voice from BGN, and to calculate the moving average of the BGN over a given time window when voice is not present. It then passes on the BGN model, comprised of the BGN energy level and spectrum, to the CN filter coefficient generator.

Figure 2: Comfort noise circuit in an echo canceller.

Figure 3: Basic silence suppression in a packet network.
Based on information received from the modeler, CN filter coefficients are generated periodically in a format required by the CN filter. This CN filter is applied to a continuous stream of Gaussian white noise in order to shape this generic signal into a spectrum- and energy-matching mimic of the original BGN, which is the sought CN.
The first step in the design of a comfort noise system is to choose the CN filter structure. This choice implies a trade-off between processing power and faithful recreation of the original BGN.
Many filter types exist that can fulfill this role. They all have the property of providing increased spectral precision when granted more processing power per time-unit. One must also choose the number of desired frequency bands and their distribution across the spectrum of the telephone signal (ranging from approximately 0 to 4 kHz). Two often-encountered distributions of frequency bands are the linear distribution (i.e. N bands of (4000/N) Hz each in width) and the logarithmic distribution (i.e. like musical notes).
Subjective testing is the best way to verify the adequacy of the chosen filter. A large selection of typical background noises must be indistinguishably reproduced by the filter system. It is trivial to generate filter coefficients from a purely stationary signal, so a very simple BGN modeler can be used for this exercise. Some time-varying BGN sources (such as "street noise" and "cocktail noise") are difficult to model with standard linear filters.
Once the CN filter has been chosen, designers can keep the CN quality at its peak by always feeding it the best CN filter coefficients. The design of the BGN modeler is the key to maintaining optimal CN filter coefficients.
A basic implementation of a SCN algorithm is defined in the International Telecommunications Union (ITU) recommendation G.711.II. It uses a voice activity detector (VAD) to determine when only BGN is present in the processed signal. It then uses an autocorrelation of the signal to average out both energy level and spectrum of the BGN. This autocorrelation is converted into lattice filter coefficients using the Levinson-Durbin algorithm. These coefficients are then interpreted by a lattice filter to produce the actual comfort noise.
The rudimentary BGN modeling algorithm described above is sufficient for trivial BGN cases, but performs poorly when faced with complex BGN. Unfortunately, such complex BGN has become commonplace in today's telephony networks, especially given the ubiquity of mobile phones used just about everywhere. A more robust scheme is presented later in this paper.
Principles of Human Sound Perception
Much research has been done in the characterization and modeling of the human ear. In the context of CN generation, the most important characteristic of human hearing is its acute sensitivity to transitions and discontinuities in otherwise stationary signals. This acute sensitivity poses an important challenge in comfort noise generation because the transition from the original BGN to CN often occurs when only BGN is present in the signal.
When played in a continuous and uninterrupted manner, a CN model that loosely resembles the original BGN will be both natural and indistinguishable from the original. However, during transitions between the CN and the original BGN and the reverse, the same CN model may cause unpleasant and noticeable discontinuities. When these transitions are no longer subjectively noticeable, one has an adequate CN model.
There are moments during normal dialog when transitions in the BGN are less noticeable. For example, when someone is speaking during a conversation, they are less focused on what they are hearing in their handset than they would when they are listening. This plays to the advantage of echo cancellation applications, in which the majority of BGN suppression only occurs when the remote user is speaking (and therefore echo is returning to him).
In these situations, two transitions will occur. The first one will occur when the talk spurt begins: at that time there will be a switch from the real BGN to the CN signal. The second one will occur when the talk spurt ends: the suppression will cease and the original BGN will be allowed to pass once more.
During the first transition, the remote user is speaking and is unlikely to notice a transition in the BGN he is receiving. However, at the end of the spurt, the remote user is no longer speaking and is becoming more sensitive to the signal he is hearing. The time between the end of speech and the occurrence of the second transition is approximately twice the end-to-end delay of the network. Therefore, the longer the network delay, the more noticeable the transition.
Different properties apply to SS applications. Three types of transitions to or from the CN model can occur in this environment:
- The first transition occurs at the beginning of a talk spurt. It causes a switch from the CN to the signal containing both the voice and the original BGN.
- The second transition occurs when the talk spurt ends and CN is again put on the line.
- The third type of transition occurs when the injected CN is updated because the modeled BGN changed during a prolonged silence period.
In the first transition (i.e. start of received speech), the listener expects a transition in the signal due to the arrival of a strong new signal (the voice), and any small changes in the BGN are not noticeable. In the second case (i.e. end of received speech), if the transition occurs close enough to the end of the voice spurt, the listener's ear has not yet accustomed itself to the stationary environment of the BGN, and this transition is barely noticeable even with a non-optimal CN model. However, as the delay between the end of the voice spurt and the transition to CN increases, the ear becomes more accustomed to the original BGN, and is bothered by the transition to a non-optimal CN signal.
In the third situation (i.e. prolonged silence period with Background Noise change), since the ear has been hearing nothing but a stationary CN signal for a long period of time, it will be very bothered by even the slightest energy or spectrum transition difference. Because of this, much care should be taken in updating the CN model during a continuous silence period. As a matter of fact, it is sometimes better to not refresh the CN model even if the original BGN has changed. In the latter case, the model should be updated during the next talk spurt.
Another Problem with Hearing
Another often cited property of human hearing is the number of independent frequency bands that the human ear can distinguish. This should theoretically determine the number of bands that need to be distinguishable in the CN system sub-functions.
This number of bands is in fact quite large. However, most BGN's can be perfectly mimicked with many fewer bands than those distinguishable by the human ear. For example, reproducing the sound of a vacuum cleaner requires much less spectral precision than reproducing Beethoven's 5th Symphony.
Therefore, designers need to carefully choose the minimum number of bands necessary to adequately model the BGN encountered in telephony systems. The number of bands is on the order of a couple of dozen bands (anywhere between 10 and 50). This number of bands is manageable using current signal processing technology.
However, it's important to note that the number of bands cannot reproduce all BGN signals well. Specifically, signals with strong narrow components will not be adequately reproduced. Later, we'll discuss methods to circumvent this problem.
Advancing Comfort Noise Generation
In an echo canceller, anytime the remote user speaks, the non-linear processor must suppress the return signal if it contains residual echo. When this occurs, the local BGN will also be unwillingly suppressed along with the residual echo. Because the remote or far-end user may speak at any time, the non-linear processor must have a valid CN model at all times. Otherwise, discontinuities will arise in the returning signal each time the remote user speaks.
Finding the BGN in the above situation is particularly difficult because the returning signal contains the mix of the local voice, the local BGN, and the echo of both the remote voice and the remote BGN. At most points in a conversation one of the two users is speaking.
In the previously mentioned method of updating the moving average of the BGN (see ITU G.711.II), the VAD will almost always detect the presence of voice in the signal. This will prevent the average BGN from ever being updated. An alternate method must be developed that will be capable of updating the moving average even while voice is present, without allowing the voice to corrupt the CN model. Even in existing wireless networks, some CN systems erroneously update their BGN model during voice periods, thus producing undesired artificial periods of background noise, usually after voice periods. This is an example of the imperfect results achieved using typical methods.
A high-quality BGN modeler is one of the key intellectual property (IP) building blocks necessary to producing commercial-grade echo cancellation, and is a major differentiator between echo cancellation algorithms available today.
Advanced Comfort Noise for Silence Suppression
As mentioned earlier, a CN generation system may encounter some BGN signals that it will have difficulty reproducing. These include non-stationary signals as well as signals that have very narrow frequency components. Luckily, in the SS application, the VAD function has the luxury of deciding to continue transmitting the original BGN, even in the absence of speech, if it determines that the quality of the Comfort Noise that would be regenerated is inadequate.
To take this decision, a flag is created which attests to the validity of the CN model. A new sub-function, the CN validity checker (Figure 4), generates this flag by comparing the spectral content of the current local signal to the signal that would be generated based on the current CN model.

Figure 4: Advanced silence suppression in a packet network.
The new sub-function can have two thresholds: one that is applied when the VAD is currently in silence/suppress mode and another when the VAD is in voice/send mode. This is because, once the decision has been made to suppress, changes in the BGN will not be forwarded to the remote user and thus not noticed.
When using the CN validity checker, extra processing cost is incurred on the transmitter-side. The reason this happens is because the CN generation block must be implemented in the transmitter, as well as still being present in the receiver-end of the call. In addition, the CN validity checker itself can be very expensive because it must model what can be distinguished by the human ear; as such, it must analyze many frequency bands.
In environments containing very complex background noise (such as restaurants or street noise), such advanced SS techniques are necessary to maintain the subjective quality that users expect.
Wrap Up
During our discussion above, we laid out various problems and issues related to comfort noise (CN) generation in echo cancellation (EC) and silence suppression (SS) applications. We also briefly presented the concepts of how advanced digital signal processing techniques can be used to optimize these two functions in modern telephony systems.
Indeed, advanced EC techniques, such as the ones discussed in this paper, are increasingly needed in current and next generation telecommunications systems. New voice-over-packet (VoP) technology must make use of echo cancellation due to the increased transmission delays incurred in the packet world. Such technology will only be widely accepted by users if it offers the same call quality available today in the PSTN. In the wireless market, customers have accepted to sacrifice some quality for convenience; yet better call quality is still the main factor for customer satisfaction and retention. High-quality CN systems are therefore a key part in the success of current and future network developments.
About the Author
Frédéric Bourget is a senior product manager at Octasic. He received his B. Eng (Physics) from Laval University in Quebec and is an active member of the ITU-T Study Group 15/WP2 focusing on signal processing network elements. Frédéric can be reached at
frederic.bourget@octasic.com.