, }
When the capture buffer fills, or at periodic intervals, its contents are passed to a filter/dispatcher function for further processing. The filter/dispatcher function may perform some pre-processing filtering, and dispatch the frame or packet into the statistics calculator followed by the decoder.
The statistics calculator computes a number of statistics about the packet. These statistics usually include packet counts, byte counts, protocol distributions, packet delay variation estimations, and transmission delay estimations.
The decoder is responsible for the actual packet decodes. The decoder can resolve VoIP call signaling protocols to determine when a call is initiated and terminated, as well as determining what call a particular voice packet might belong to, as several calls may be active simultaneously.
The information generated by the statistics calculator and decoder components is then passed into the agent along with other pertinent data, such as call initiation and termination event.
As voice packets are received, the locally generated receive-timestamp and the contents of specific packet header fields are indicated to the passive call-monitoring agent. The agent first hands the packet to the jitter buffer emulator to determine whether packets have been lost or whether packets would be discarded due to jitter or excessive delay. The agent records both events, since both significantly impact voice quality calculations and perception by the end-user.
The timestamp required in a class A target environment should deliver 1-ms or finer resolution. Generally, a 1-ms timestamp resolution allows the jitter buffer to handle most if not all popular voice codec frame types. More accurate timestamps, however, can be used.
Tracking in Class B Designs
Class B environments, as shown in Figure 2, typically include firewalls, edge routers, or traffic shapers. These environments are designed around the concepts of packet filtering, packet queuing, and packet forwarding, and may not have the ability to generate a local timestamp for packets received off the physical medium.

Figure 2: Typical class B environment.
When locally generated timestamps are not available for each packet received, a passive call-monitoring agent should be implemented to deliver a real-time clock interrupt. Using the real-time clock interrupt, the agent can estimate packet loss due to packet delay variation through the implementation of the fully functional jitter buffer emulator.
As Figure 2 above illustrates, class B environments typically include at least one media access controller (MAC), or a similar such device, that has the ability to promiscuously receive all traffic traversing the physical medium. The MAC driver should handle interrupts from (or poll) the MAC hardware, transferring frames or packets into a frame or packet buffer. Class B devices may often include multiple physical network medium interfaces as indicated in the figure.
As each frame or packet is received, it is passed up the stack to the filter function. The filter is responsible for applying any pre-processing frame or packet filtering.
From the filter, the frame/packet buffer is passed to the decoder. The decoder performs the necessary frame/packet decoding to direct the dispatcher on how to route the frame or packet.
The dispatcher examines the basic decode information and determines how to continue the packet processing. The dispatcher can distinguish VoIP signaling and voice data packets and then pass call start and termination events on to the agent for processing. The dispatcher is also responsible for identification and association of the incoming packets to a particular VoIP call.
The delay estimator is responsible for estimating the transmission delay for packets received. This information is passed into the monitoring agent to ensure accurate call quality measurement.
Transmission Delay Estimation
In both the class A and B environments, the embedded monitoring agent uses transmission delay estimates to calculate impairment factors incurred as a result of excessive delay and ultimately to calculate the quality level, often called an "R factor". There are several methods suitable for delay estimation.
One method is a passive estimation of round-trip delay based on the real time control protocol (RTCP) sender or receiver reports. When an RTCP receiver report is received, the appropriate fields can be passed into the agent, where the agent can then calculate the round-trip delay in the method specified in IETF RFC 1890.
Under another method, round-trip delay is estimated using an Internet control message protocol (ICMP) echo. An ICMP echo request can be sent from the target environment to each call endpoint of interest. When the requests are transmitted, a timer can be started that is used to determine the elapsed time to when the ICMP echo response is received.
Using ICMP echo for active round-trip delay estimation has some drawbacks. For example, if the call passes through a firewall, most firewalls prevent ICMP echo requests from passing through. In these situations, another protocol can be used to request a response from the call endpoints, recording the elapsed time between the request and response.
If it is not possible to implement either passive or active round-trip delay estimations, don't panic. A call-quality monitoring agent can still generate useful call quality metrics.
Generally, the effects of delay do not impact the network "R" factor, such as call quality measures that do not include perceptual effects. Moreover, the impact of round-trip delay on the agents user "R" call quality metrics, such as metrics that include perceptual effects and thus delay, is relatively small for round-trip delays as high as 300 ms, which is unusually high for IP networks. The small impact of delay is due to the fact that we are measuring call or speech clarity as opposed to conversational difficulty, where delay has a much bigger role to play producing significant impairments such as double talk.
In cases of extreme delay--350 ms or higher--the accuracy of the user "R" factor will suffer. However, even in these extreme situations, designers can still use the network "R" factor as a measure of voice quality.
If there is no way to estimate round-trip transmission delay, the agent's quality metrics can still be generated, but will tend to show somewhat higher quality than actually experienced. While this is not a good thing, the data is still useful in trend analysis, i.e. over time, one can still discern if call quality provided by the network is improving or getting worse.
Resource Requirements
Now that we've laid out how the agent will work in the class A and B environments, let's look at the system resources required to integrate these agents. In order for a call quality agent to generate accurate voice quality metrics, the target must provide sufficient host processor resources, enough code space to support the agent, and heap and stack memory.
Let's look at these three requirements in greater detail. Note that the host system resource requirements indicated below are general estimates based on the use of an Intel Pentium III style processor and are included for rough resource estimation purposes. Use of a different processor will affect the required resources. Different compilers will also generate differing agent code sizes based on the compiler optimizations. Memory requirements also vary with compilers due to memory alignment optimizations.
1. Host Processor Resources
Determining a valid measure of processor loading generated by a very lightweight agent implementation is difficult to do. One logical basis for processor load calculations is the number of Intel Pentium machine instructions required to complete a certain agent task. The actual number of machine cycles will vary based on compiler optimizations and instruction pipelining. The number of instructions indicated is an approximation of the worst-case processing path for the indicated event.
Generally, the task set is as follows:
- Create voice call packet stream. As a new call is detected, instruct the agent to construct and initialize its call-tracking data set.
- Per packet tracking. As each packet is received, its timestamp and some of the transport protocol packet header fields are indicated to the agent for the jitter buffer emulation functions.
- Calculate call quality at desired intervals during the call. If quality metrics are desired more frequently than just at the end of the call, instruct the agent to calculate call quality on demand. This is especially useful for generating alarms.
- Calculate call quality at the end of the call . When the call is indicated as terminated, the call quality metrics are automatically calculated and the MIB or data store is updated .
Total instructions to accomplish this set of tasks would be on the order of 1500 to 2000 instructions for a class A target while the class B target would require anywhere from 50 to 100 more instructions.
The additional processing in class B environments is attributed to the fact that packet receive timestamps are not available. Thus a real-time interrupt is used to trigger the jitter buffer "playout" of received packets. In order to handle this type of architecture, the agent must record more information for each packet received, thus the additional instructions per packet.
2.Code Space
Estimated code space requirements for the call quality agent code, compiled for an Intel Pentium processor without optimizations, would be in the range of 50 to 100 KB. With optimizations, the estimated code space requirements could be lowered to 20 to 30 KB.
3. Memory Needs
In order to effectively implement a passive monitoring agent, the target must provide heap memory for each agent instance. This includes the RAM needed to store information about the voice codecs supported by the agent. Generally 700 to 1000 bytes could be allocated for a class A or class B environment.
There should also be on the order of 200 bytes heap memory allocated for each physical MAC interface supported by the agent. Each voice stream constructed on each physical interface should also be allocated heap memory. Generally, 150 to 300 bytes would suffice in a Class A environment while a the Class B target would require approximately twice that amount as it must record more data to handle the real-time interrupt.
The actual amount of stack memory needed would also depend upon the number of compile-time build options offered within the agent. These build options control the jitter buffer emulator configuration and statistical information about each voice call stream. Stack memory usage would be on the order of 125 to 175 bytes in either environment.
About the Authors
Bob Massad is vice president of product strategy at Telchemy. Prior to Telchemy Bob was director of advanced technologies at NetScout Systems. He can be reached at rmassad@telchemy.com.
Shane Holthaus is a principal software engineer at Telchemy. Prior to Telchemy, he was a principal engineer at Virata Corp. Shane can be reached at sholthaus@telchemy.com.