


















|
 |
 |
 |

|
|
11 March 2010
|
|
|
Feature
|
|
Hardware Implementations of Multirate Digital Filters
Its important to map interpolation and decimation functions into hardware efficiently. The challenge is in choosing the
right hardware types. Heres a look at DSP, PLD, and ASIC implementations for multirate filters.
By Tony San
Many communication systems require multirate filters. A multirate filter is a filter in which the output data rate and the input data rate are not equal. This often occurs near a physical interface such as a digital-to-analog converter (DAC) or an analog-to-digital converter (ADC). When the filter is outputting to a DAC, the end user
usually wants an interpolation filter, which generates more data points to create a smoother waveform. When the filter is receiving information from an ADC, the end user generally wants a decimation filter. The decimation filter allows data to be oversampled and facilitates a higher signal-to-noise ratio (SNR). By incorporating a decimation filter, the system only needs to operate at the information rate.
Interpolation and
decimation
Interpolation is used to increase the output sample rate (see
Figure 1
). Its necessary to generate new sample points that are located between the original sample values. Because the value of the new sample points is unknown, their values are set to zero. This is called upsampling, inserting zeros, or zero stuffing. Since there are more data points at the output,
the sample rate has changed, pushing out the Nyquist frequency. Inserting zeros into the data in the time domain has an interesting effect in the frequency domain; it creates reflections of the original spectrum. There cannot be more information present than in the original sample rate, and, therefore, all the reflections (which have been artificially put into the system) are noise. Fortunately, the noise can be removed by applying an ideal low-pass filter.
A decimation filter works in a similar manner. In
this case, sample points are removed, which decreases the sample rate and reduces the Nyquist frequency. Any frequencies outside this reduced Nyquist frequency will be aliased back around and will appear as noise. It is necessary to apply a low-pass filter before removing the data (downsampling) to assure that noise is not introduced into the system (see
Figure 2
).
Implementation
strategies
A desirable characteristic for low-pass filters (employed in both interpolation and decimation) is linear phase. In practice, linear phase filters are implemented using finite impulse response (FIR) filters. FIR filters are computationally more expensive to implement than infinite impulse response (IIR) filters. To get better performance, it is necessary to increase the order of the filter. FIR filters range from
tenth order to 200th order and beyond. For every output there must be anywhere between ten to 200 calculations. Since FIR filters are so computationally expensive, designers often use dedicated hardware to perform this function. The dedicated hardware may come in the form of a dedicated filtering chip, a programmable logic solution, or a semicustom (standard cell implementation) integrated circuit.
A standard cell implementation of an FIR function will generate the highest possible throughput.
Programmable logic devices (PLDs) and dedicated filter chips come next in terms of speed, followed by general-purpose DSPs.
When the end user requires a high data throughput, nothing beats custom hardware like ASICs or PLDs. While the ASIC design flow is well understood, it can be a long and challenging process. In the case of PLDs, automated tools that generate FIR filters are available, speeding up your development flow. For interpolation and decimation filters, it is possible to employ certain techniques in
order to decrease area and increase performance.
When designing a multirate filter, there is no best implementation; there are many ways to evaluate a solution. The cost of a solution is determined by the required performance for an implementation. The performance is defined as the total number of multiplications required per second. (Because additions can usually be combined with multiplications, they will not be included in the computational expense.) Assuming a single multiplication requires a
single clock cycle, the MIPS required to implement a solution can be determined.
There is a straightforward approach for examining the computation rate of an interpolation filter. First, the data is upsampled, then it is filtered. The example shown in
Figure 3
requires a 388-tap filter that must operate at a data rate of 12 megasamples per second (MSPS). The required computation rate for this implementation is approximately 4,500 MIPS.
Multistage filtering
Fortunately, there are ways to decrease the required computation rate. It is possible to interpolate by 12 in three stages. At the first stage, the designer can interpolate by a factor of two. The output of the first interpolation stage is then further interpolated by a factor of two, and the output of the third stage would be interpolated
by a factor of three. There are now three filters.
Figure 3
illustrates the specifications for the individual filters.
By interpolating in stages, the individual filter requirements have been relaxed, which reduces the order of the filter required. Furthermore, the first two filters are operating at two and four MSPS. Only the last filter is operating at 12 MSPS. In the original approach, the entire filter was operating at 12 MSPS. Similarly, when you
decimate in stages, the required computation rate often decreases.
The multistage filtering approach has reduced the computation rate down to 1,035 MIPS. By redistributing the computation across multiple stages, the required result is obtained with much smaller filters. This method of optimization is a relatively high-level approach.
Polyphase decomposition
Another strategy for decreasing the computation rate involves looking at the details of implementing an interpolation filter. Because it is known that zeros are inserted and then filtering is performed, it is possible to break the problem up into several shorter filters. Each of the filters would operate at a different point in time. This is known as polyphase decomposition, as shown in
Figure 4
.
A simple case can explain how the polyphase
interpolator works. In this example, there is a filter with 24 coefficients, which interpolates by 4. Since the filter interpolates by 4, most of the data input to the filter is actually zero. The coefficients with zero data can be removed when performing a particular multiplication. For instance, the first output would be determined solely by coefficients C
0
, C
4
, C
8
,
C
20
. The next output would be determined by coefficients C
1
, C
5
, C
9
,
C
21
. In this case, there are only six multiplications required per output instead of 24. We have reduced the computation rate by the interpolation factor. In the case of a 388-tap filter that interpolates by 12, each output could be determined from only 33 multiplications. A polyphase interpolator would require 388 MIPS to perform the same computation.
These same techniques can be applied to decimation structures as well. In this case, by decimating by a factor of 4, three
out of every four data points are thrown away after filtering. Its unnecessary to calculate the data points being thrown away. The polyphase decimator distributes (for a decimation factor of 4) the data across four shorter polyphase filters. Finally, the outputs of the four filters are added together to obtain the final result. Each of the four polyphase filters needs to produce an output at the decimated data rate, reducing the performance requirements for the decimator.
Naturally, it is
possible to combine the approaches to further reduce the computation rate. For instance, multistage filtering could be performed with each of the individual stages implemented in a polyphase structure.
Implementation with DSPs and cores
At an implementation level, it is necessary to come up with an architecture (which takes up a minimum amount of
device resources, operates at the lowest power level, and so on) to perform the calculations. The solution will depend on the computation rate. For situations that require a few hundred MIPS, DSPs are ideal. While some DSPs are able to perform at up to 1 GOPS, a typical design uses a DSP for more than just filtering; it is necessary to carefully allocate MIPS among all the various functions that the processor is performing. In many cases, a MIPS budget will be developed and a DSP selected based on the
required performance.
There are alternatives if the required performance exceeds the capabilities of a single DSP. These alternatives involve splitting the tasks across several DSPs or using hardware coprocessors to speed up even the most computationally intensive tasks. At this point, ASICs and PLDs enter the picture.
Implementation with dedicated logic
Specialized chips that perform an interpolation function are available from various semiconductor vendors. These chips contain several multipliers that perform the filtering function, obtaining a decisive performance advantage over a DSP. The chips can support a fixed number of coefficients and a particular interpolation or decimation factor.
The ASIC and PLD approach can be lumped into the build your own dedicated hardware category. With this approach, its possible
to calculate an entire 127-tap FIR filter in a single clock cycle (nearly two orders of magnitude faster than a DSP). The design challenge is to be aware of everything that is going on (both in DSP, HDL simulation, synthesis, verification, testability, and fault coverage).
For a fully-parallel interpolation filter, breaking down the filter into a polyphase decomposition provides a set of multiple shorter filters. In order to obtain one filter calculation per single clock cycle, there should be one
multiplier for each coefficient in the polyphase filter. At every input clock, two things happen: 1. the data will be stored in every polyphase structure, and 2. N outputs are generated by each of the filters. Finally, the output clock sweeps through all of the individual filters in the same time span as a single input clock.
Clock domains and static timing analysis
Breaking down the design, there are two clock domains: the input clock and the output clock. The output clock rate is an integer multiple of the input clock rate. The output structure (a simple multiplexer) needs to operate at a higher data rate than the input polyphase filters. When designing dedicated hardware (be it ASIC or programmable logic), reducing the number of clock domains is often desirable. With ASICs, extra clock domains need to be tied together when generating scan
vectors. There may be false paths that must be removed from the static timing analysis. With programmable logic, there is a fixed number of clock signals allowed, which makes each clock domain a precious item.
Its possible to clock the entire structure with the output clock if clock enables are placed on the flip flops used in the polyphase structures. By using clock enables, the polyphase structure only needs to run at an input clock rate (slower clock signal), and the timing on these complex
structures is relaxed. This makes the polyphase structures a multicycle element, and it is necessary to perform static timing with the multicycle specification. Timing analyzers used in ASIC- and PLD-oriented design flows support multicycle specifications.
When designing an ASIC, the required number of multiplication units can be placed into silicon and the desired speed can be obtained using minimal space. ASIC implementations tend to be less flexible than DSPs and PLDs. When using a dedicated piece
of silicon, changes require a complete respin (which costs time and money).
PLD structures for filtering
The PLD implementation takes a different approach. There are two structures used to perform the filtering operation in a PLD: serial and parallel. Both structures take the coefficients and efficiently map them into look-up tables to
perform the multiplication. The fully parallel structure performs the entire filtering operation in a single clock cycle. The serial structure distributes the calculation across several clock cycles (as determined by the input bit width). This results in lower throughput, but serial structures are efficient in terms of silicon utilization (requiring minimum storage and logic).
Today, there are tools that automatically generate FIR filters for programmable logic. At a minimum, these tools generate single
filters when given a set of coefficients. The more advanced tools generate fixed-point coefficients for the user and can produce polyphase filters based on them, along with area and speed estimates.
Evaluating solutions
There are many ways to implement logic that will perform the interpolation and decimation. The engineer has to evaluate the
required throughput, come up with an efficient implementation, and balance time spent optimizing the design against completing the project quickly.
Tony San
is an engineer at Altera in San Jose, CA, and has over 10 years of design experience. He received his BE and MSEE from Manhattan College. Tony can be reached at
tony_san@altera.com.
|
|
Illustrations
|
Figure 1
|
Figure 2
|
Figure 3
|
Figure 4
|
|
Resources
|
Crochiere, R.E., and Lawrence R.R.,
Multirate Digital Signal Processing
, Prentice-Hall.
Implementing FIR Filters in FLEX Devices
, Altera Application Note 73.
ATM Forum Technical
Committee. "An Introduction to POS-PHY Level 3: A System of Interdace for Cell and Packet Transfer for OC=48 Aggregate Bandwidth Applications" (ATMF 99-0421)
|
|
Return to the
Table of Contents
|
|
|
|
|
|
 |
 |
 |
|