The emerging MPEG-4 multimedia standard combines interactivity, natural and synthetic digital video, audio, and computer graphics, and is targeted at video conferencing, mobile videophones, multimedia, and other applications. The standard consists of a collection of tools that support applications ranging from digital TV and streaming video to mobile multimedia and games. The standard provides tools for shape coding, motion estimation and compensation, texture coding, error resilience, sprite coding, and scaleability.
MPEG-4 promises significant advances in high-quality personal video communications and conferencing, but it also promises design difficulties associated with greater algorithmic complexity and an evolutionary standard. MPEG-4 exhibits a marked increase in algorithmic complexity compared to the older generation MPEG-1 and MPEG-2.
The key to developing innovative, mobile MPEG-4-capable devices may lie in a technique called adaptive processing. Rather than conventional microprocessor and DSP approaches, adaptive processing reconfigures processing resources according to current needs, putting power where and when it is needed.
3G handsets ahoy!
Among the most compelling potential uses for MPEG-4 technology is the third-generation (3G) wireless handset arena. Designers of such devices may assume that a small screen is not computationally demanding, but that is far from an engineering reality. Figure 1 shows MPEG-4 computational complexity for a rectangular video object (VO) or frame of quarter common immediate format (QCIF) [176 x 144 picture elements (pel)] size. The MPEG-4 encoder/decoder pair demands 3.7 billion operations per second (BOPS) to produce high-quality video.
Coupled with such performance requirements are stringent demands for low power consumption by battery powered, real-time visual communication applications. A real-time video design, based on full-search motion estimation, requires a computational load of 9.34 billion integer arithmetic operations per second (with 8- and 16-b data) and a memory bandwidth of 6.22 billion 8-b accesses per second. For the video encoder alone, motion estimation shares about 66 percent of the total computational complexity. Conventional microprocessor, digital signal processor (DSP), and ASIC design approaches are not powerful enough, nor do they exhibit sufficiently low power consumption, to efficiently handle these motion estimation computations.
Aside from motion estimation, while many MPEG-4 tools are not as computationally intensive, they are major contributors to high-quality visual content. Error resilience is a key example.
Error resilience is vital for universal access through error-prone environments such as wireless and mobile communications. Several mechanisms create error resilience with different degrees of robustness and complexity. These mechanisms are made possible by tools that provide the means for resynchronization, error detection, data recovery, and error concealment. MPEG-4 video has four error resilience tools: resynchronization, data partitioning, header extension code, and reversible variable length codes.
Resynchronization is the most frequently-used method for bringing error resilience to a bit stream. It consists of inserting unique markers into the bit stream so that, in the case of an unrecoverable error, the decoder can skip the remaining bits until the next marker and start decoding from that point. MPEG-4 allows for insertion of resynchronization markers after an approximately constant number of coded bits or video packets, as opposed to MPEG-2 and H.263, which allow for resynchronization after a constant number of coded macro-blocks.
Separating bits
Data partitioning is a method that separates the bits of coding for motion information from those for the texture information. When an error occurs, more efficient error concealment may be applied when, for example, the error occurs on the texture bits only, by making use of the decoded motion information.
Header extension codes are binary codes that allow an optional inclusion of redundant header information - vital for correct decoding of video. Header extension codes reduce header information corruption and eliminate the complete skipping of large portions of bit stream.
Reversible variable length codes further reduce the influence of error occurrence on the decoded data. They are code words that can be forward and backward decoded. In case of an error, instead of skipping the bit stream until the next resynchronization marker, these codes make it possible to decode portions of the corrupted bit stream in the reverse order to limit the influence of the error.
Figure 2 provides a glance at the computational requirements for discrete cosine transform (DCT) and motion estimation computations required for implementing MPEG-4 in a wireless handset design. DCT computations are 30 percent each for addition, subtraction, and multiply-accumulate (MAC) functions. On the other hand, motion estimation incurs about 40 to 50 percent additions and another 40 to 50 percent in absolute difference in accumulation computations. For system engineers, DSPs are not well suited to do absolute difference in accumulation computations.
A hybrid concept
MPEG-4's video compression scheme is founded on a block-based hybrid coding concept used within the International Standards Organization/International Electro-Technical Commission (ISO/IEC) MPEG-1, MPEG-2, and the Consultative Committee for International Telegraph and Telephone (CCITT) H.261/ITU-T H.263 video compression standards and recommendations, which are extended within the MPEG-4 standard to support arbitrarily shaped video objects. The arbitrarily shaped video objects are split up into macro-blocks of 16 x 16 pel within a bounding box and are coded on an MB macro-blocks and block (8 x 8 pel) basis, similar to block-based video compression schemes, but with MPEG-4 coding tools.
The shape of a video object is represented by an alpha-plane with pel resolution and is gained by an application-dependent method, which is beyond standardization. Visual objects can be translucent, their size can be changed, they can be of natural video object (VO) type; or they may be synthetic or computer generated objects that can be manipulated by the user. Every object is encoded and decoded by a different encoder and decoder instance and may use different coding options. The video object representation at a specific time instance is called video object plane (VOP), which is the equivalent to a "frame" for block-based video.
Each video object has different quality constraints, which means each could potentially have a different set of motion estimation algorithms. For example, a foreground object can be in high quality mode, while several background objects can be in various lower quality modes. This next step from block-based to object-based video coding requires a new methodology for very large scale integration (VLSI) design, as well as new VLSI architectures with considerably higher flexibility. The main reason for these additional requirement, is that motion estimation and the DCT are the most computationally demanding tools within the MPEG-4 standard.
DCT, a widely used transform in image compression, consumes about 20 to 30 percent of MPEG-4 computations. A one-dimensional (1-D) DCT is used as a building block for two-dimensional (2-D) DCT. Overall performance of the conventional two-dimensional DCT using these one-dimensional DCTs is dependent on the organization of the one-dimensional DCTs. The JPEG still image compression standard, H.261, H.263 video conferencing standards, and MPEG-1, 2, and MPEG-4 digital video standards all use the DCT. In these standards, a 2-D DCT is applied to 8 x 8 blocks of pixels in the image that is compressed.
The 64 coefficients produced by the DCT are then quantized to provide the actual compression. In typical images, most DCT coefficients for an 8 x 8 block of pixels are small and become zero after quantization. This property of the DCT on real-world images is critical to the compression schemes.
Moving objects
Motion estimation is one of the key elements to implementing MPEG-4 in a mobile phone design. Motion estimation is based largely on a search scheme, which tries to find the best matching position of a 16 x 16 macro-block of the current frame with a 16 x 16 block within a predetermined or adaptive search range in the previous frame. But since the motion estimation algorithm is not standardized, several variations exist on the depicted scheme. The matching position relative to the original position is described by a motion vector, which is transmitted in the bit stream to the video decoder.
In video encoding for wireless handsets, motion vectors are used to compensate the motion within the video sequence and only the remaining signal or prediction error has to be encoded and transmitted. Hence, motion vectors must be selected to minimize the prediction error and to minimize the number of bits required to code the prediction error. Figure 3 shows the classification of motion estimation algorithms as time-domain and frequency-domain algorithms.
Designers can also use fast motion estimation algorithms with reduced computational complexity and associated power consumption advantages. However, when using conventional microprocessors and DSPs to run these algorithms, the design trade-offs are poor visual quality and data flow irregularities. But these algorithms can be accelerated when they are implemented in newer technologies, such as adaptive computing.
The evolution and immaturity of these algorithms, including the MPEG-4 specification itself, assure the mobile phone designer that there will definitely be future algorithmic changes, enhancements, bug fixes, interoperability fixes, and performance fixes, among other changes. The changing nature of MPEG-4 algorithms generally precludes an ASIC-based approach. These barriers place additional demands on a traditional ASIC-based approach or a "best practices" approach of a combined processor/DSP with ASIC cores.
More power issues
To accommodate the increasing demands on higher performance while maintaining low power, it is important for communication design engineers to closely analyze today's technologies. One way to do so is to compare computational power efficiency (CPE), which is a metric for comparing the advantages and disadvantages of current design techniques. CPE is defined as the ratio of the number of gates actively working in a clock cycle to solve a problem to the total number of gates in the device.
For example, the CPE of a typical DSP or embedded core is about 10 percent. This means that, typically, only 10 percent of the gates on a DSP core or chip are used to perform real work at any given time (see Figure 4). Figure 4 shows that only a small portion of the DSP is actually performing useful work. The remaining portion is overhead. This overhead is required to keep the small piece of the process that is performing the real work busy. Although the CPE number is low, a DSP has the advantage of being software programmable with all the time-to-market advantages.
The CPE of an ASIC averages about 25 percent, a 2.5 times advantage over the DSP design approach. These gains, which can either increase performance or reduce power dissipation, come at the expense of flexibility. Any changes not envisioned during the design cycle of the ASIC result in a re-spin of the entire ASIC.
Adaptive computing
A third category of processing power, unlike conventional microprocessors and DSPs, is adaptive computing machine (ACM) technology, which can be more efficient for developing MPEG-4-enabled wireless designs. In short, an ACM chip adapts its architecture on the fly to perform a variety of tasks.
As the ACM-based system is running, the silicon changes or adapts itself to become the most efficient "ASIC" to perform the given function. The silicon adapts itself over and over again so that within an instant, the ACM literally changes its architecture hundreds of times. The ACM also allows software algorithms to build and then embed themselves into the most efficient hardware possible for their application.
The optimum hardware that an algorithm requires comes into existence for the minimum time that the algorithm needs to run. This CPE metric is independent of process technology. In other words, an ACM-based processor will always be more efficient than a DSP or an ASIC solution at the same process geometry.
Processing inefficiencies of conventional microprocessors and DSPs also adversely affect the full power of algorithms. In many instances, the mobile phone designer must alter a given algorithm to fit a particular RISC microprocessor or DSP architecture. The design engineer or programmer must take a given algorithm and try to fine-tune the code so that it runs as efficiently as possible on a particular microprocessor or DSP architecture.
Conversely, an ACM architecture allows the system engineer to run a given algorithm in its most efficient form. In other words, the processing work required to run the code efficiently is performed in a few clock cycles, rather than in an inordinate amount of clock cycles.
Too many cycles
Running an MPEG-4 motion estimation or DCT algorithm on a conventional DSP or microprocessor incurs a large number of clock cycles, primarily due to instruction fetches and operand read/writes. Most of the cycles performed by microprocessor and DSP implementations are overhead - they simply set up the actual desired work output. Figure 5 shows a benchmark to point out these inefficiencies when running motion estimation or DCT algorithms. In this example, the benchmark is adding together 27 floating point numbers for a 27-input adder function.
In the microprocessor example, it takes 1,013 clock cycles to perform the addition of 27 floating point numbers. The processor does so by going through innumerable data executions. First, the microprocessor issues an address to memory, and then it fetches an instruction. At this point, roughly 40 different optimization techniques are applied to improve the performance of the instruction execution base.
Various microprocessor types go through four to 25 pipeline stages to pipeline the instruction process. If the microprocessor is a very long instruction word (VLIW) or superscalar RISC processor, for example, it will perform speculative execution, register scoreboarding, a variety of different branch prediction techniques, out-of-order execution of the instructions, and a variety of other optimizations. All these steps occur to issue an address to memory, fetch the first data element and input it into a register. These steps are thus performed 27 times and require 1,013 clock cycles.
A DSP implementation with dual MACs performs the same benchmark in 107 clock cycles. A multiply-accumulate operation is performed in a single unit cycle. However, two multiply-accumulates can be performed simultaneously.
The DSP is a modified Harvard architecture, which incorporates two independent data streams and an independent instruction stream. Compared to the conventional microprocessor, the system engineer can take advantage of the coarse-grain parallelism. This results in the 107 clock cycles, which is an order of magnitude faster than a microprocessor.
An exact approach
On the other hand, adaptive computing has the ability to implement into silicon the exact hardware required at any given time. Thus, an algorithm that implements the 27-input adder can be downloaded into the adaptive computing silicon, and this computation is completed in seven clock cycles, a large improvement over DSP. Using this approach, the adaptive computing chip can run a given algorithm in its most efficient form.
A generic ACM chip integrates a RISC processor core with a large adaptive circuitry function next to it. Tailored interface circuits are used to closely mate the RISC microprocessor core and the fine-grained adaptive computing circuitry. Hardwired peripherals surround the RISC microprocessor core and are used to interface with external circuits.
Several on-chip static RAMs are used to support the RISC microprocessor or the adaptive circuitry. Configurable I/O allows data to be placed in or out of the adaptive circuitry. A circuit can thus come into existence as needed and, once its task is completed, can be replaced by other circuits or functions implemented in the adaptive circuitry function.
The latest DSPs are optimized to perform finite impulse response (FIR) filtering - the hot application three years back. Using a modified Harvard architecture, these DSPs allow the reading, decoding, and executing of two separate instructions in parallel. Two memory banks with added buses to access them allow two simultaneous data accesses per clock cycle.
These memories can be read sequentially via specialized address mode hardware without executing separate instructions. Also, specialized hardware is included so zero-overhead loops can be performed. With these new changes, the DSP is able to process two simultaneous MAC operations every clock cycle in a sustained manner. In short, the FIR algorithm is the basis for
these architectural changes so that a DSP can perform FIR filtering more efficiently.
Addressing complexity
Newly emerging DCT algorithms are considerably more complex than the FIR filter. DSPs optimized for FIR filtering fail to efficiently perform DCT operations. DCT complexity lies mostly in the richness of the interconnection, the high usage of MACs, and the large number of specialized arithmetic logic unit (ALU) operations required. Each 1-D DCT requires about 29 additions and five multipliers.
A DCT operation can be implemented in two different ways. Keep in mind that two instructions can be executed in one DSP clock cycle. For regular address modes, four data words are pulled from the dual data memories. The dual MACs operate on this data and the specialized zero-overhead loop hardware keeps track of the looping.
However, the addressing modes required for DCT are irregular. The DSP hardware, which is optimized for FIR filter addressing, is the wrong type for DCT addressing. This irregularity of addressing forces the programmer to explicitly calculate an address for each MAC operation. The accessing of data for DCT is a irregular address scheme. The eight pixels of data are all addressed multiple times in a non-sequential, mixed manner until the DCT is finished, and the next eight pixels are then processed.
There are two different ways to structure the instructions to process the DCT. First, two instructions are executed on each clock cycle - one to (calculate-address and get-data) + (MAC operation). Secondly, on the first clock cycle, (calculate-address and get-data) + (calculate-address and get-data) is performed. On the second clock cycle, (MAC operation) + (MAC operation) is performed.
In the second implementation, (compute two MACs in parallel with no read), even if both MACs are used in the same clock cycle, they must first be fed with data, which means an extra clock cycle to read pixels from memory. This leads to a minimum of two clock-cycles execution time.
There are two disadvantages in the first implementation (reading a new pixel and computing a MAC operation in parallel). Since the DCT algorithm accesses data in memory, the indirect addressing mode widely used in DSP applications is implemented here. The first issue is that indirect addressing is not compatible with a parallel instruction. In other words, two instructions cannot be executed in parallel when one of them uses indirect addressing. To avoid this design issue, the new address must be computed and stored in a register, which incurs extra clock cycles and additional registers. In both of these implementations, due to the advanced DSP's architecturally inherent limitations, only one MAC can be used in a sustained manner.
Given the inconsistent dataflow issues certain to be associated with any MPEG-4 device, handset designers will need to explore all avenues open to balancing computational demands and power dissipation. Adaptive processing according to dataflow is one possible solution, but engineers working on this challenging discipline will want to maintain currency with all aspects of this still-emerging standard.
Further, until MPEG-4 is finalized, ASIC solutions may be scarce or inadvisable, forcing the prudent designer to explore architectures that combine known good processing capabilities with interim techniques.
About the Author
Paul Master is the vice president of technology at QuickSilver Technology. Paul has over 15 years experience in managing high technology communication and computer programs incorporating reconfigurable computing, digital signal processing, protocols, and systems. He can be reached at paul.master@qstech.com.
Keith Lane is senior systems engineer at QuickSilver Technology. Keith has over 15 years experience in the design and implementation of communications systems. Keith specializes in the design and implementation of wireless protocol designs, wireless multimedia applications, and wireless channel characteristics. He can be reached at keith.lane@qstech.com
Illustrations
References
- Peter, K., Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation, Kluwer Academic Publishers, 1999.
Adaptive Computing Machines - Guide to New IC Technology - Chapter 1, http://www.quicksilvertech.com/white papers.htm.
Ebrahimi and Horne, "MPEG 4 Natural Video Coding - An Overview," Image Communication Journal, 1999.