|
Soft Radios and Modems on FPGAs
In this article, the benefits of using FPGAs over DSPs in the design of a 16-QAM RF transmitter data pump are presented.
By Les Mintzer
The field
program-mable gate array (FPGA) competes with DSP chips for soft radio and modem designs. While the FPGA is more than a match for logic-intensive functions such as convolution encoders, it has been considered seriously deficient in compute-intensive tasks. The fastest array multiplier configured in an FPGA cannot match the cost and performance provided by a $5 DSP chip. And DSPs are still the chip of choice when dealing with CAD tools. With the application of distributed arithmetic (DA) techniques, however, the
scales are once again tipping in favor of the FPGA.
There are additional features such as architectural flexibility that favor the FPGA. Indeed, the functional blocks of radio and modem data paths can readily be mapped into individual, concurrently operating hardware nodes. This approach avoids the difficult programming challenges of scheduling time-critical tasks through a single, time-shared digital signal processing arithmetic processor.
These features will be illustrated by the
design of a 16-QAM RF transmitter data pump. The easy transition from data path functional blocks to the logic circuits of the Xilinx 4000 family of FPGAs will be presented in sufficient detail to obtain a good estimate of the required number of circuits. While the design of a 16-QAM data pump that meets the same system requirements and uses the same FPGA type has been reported in the open literature, the reported number of circuits seemed far greater than necessary. In the headlong rush to market, it is
very possible to trip over CAD tools. The exclusive reliance on CAD tools does not always yield optimal solutions. A large dose of sweat, experience, and ingenuity is often required.
A home for all digital machines
Given an unlimited supply of universal logic gates such as NANDs and NORs, any digital machine can be built. The FPGA possesses
these gates in abundance. In the Xilinx 4000 family, these take the form of a truth table, or more commonly, a look-up table (LUT) of 16 words by 1 bit that can be configured to any arbitrary Boolean function of four input (the address lines of the LUT) variables. Since generating the function usually entails the equivalent of several NAND gates, the LUT is considered coarse-grained logic. The Xilinx 4000 configurable logic block (CLB) consists of two 16-word LUTs which may be combined to produce an
arbitrary Boolean function of five input variables. Additionally, the LUTs may be configured as two 16 x 1 RAMs or as a single 32 x 1 RAM.
The CLBs are laid down as two-dimensional square arrays, where both the CLBs and their interconnects are individually configured. The smallest device, the XC4002, contains an 8 x 8 array of CLBs, and the largest, the XC4085XL, contains a 48 x 48 array. Each LUT feeds a flip flop that can be toggled at 100 MHz.
The 16-QAM modulator
The 16-QAM modulator contains the key functional blocks of the RF transmitter data pump (see
Figure 1
). Serial 20-Mbps data are first grouped into 4-bit symbols and then fed, bit parallel, at a 5 megasymbol/sec rate to a differential encoder and symbol mapper. The mapper produces pairs of 3- bit orthogonal components. Next, these
components are pulse shaped in a pair of root-raised cosine filters, interpolated to 20 megasamples/sec, modulated by a 5-MHz carrier, with the outputs summed prior to conversion from digital to analog form. The crux of the design is the interpolating pulse-shaping pair of filters. In order to establish the efficacy of this design approach, however, it is necessary to include the encoding and mapping functional blocks as well as the 5-MHz modulator to determine the total gate count.
Encoding and symbol mapping
In sizing the encoder and signal mapper we can draw upon earlier standard modem designs. In the V.32, for example, the encoder consists of a differential encoder to provide 180
º
phase ambiguity protection and a convolutional encoder which adds redundancy to reduce the bit error rate (BER) at the receiver. Both
encoder and mapper are finite state machines with the total number of states realized by five registers (these registers can be implemented with 2.5 CLBs and the bridging logic of eight 2-input EXOR gates [four CLBs]), and three 2-input AND gates (1.5 CLBs). In this 16-QAM transmitter, four serial bits at the 20-Mbps rate are captured with a serial-to-parallel register (two CLBs) to form a 4-bit symbol that drives the encoders at a reduced 5 megasymbol/sec, a data flow rate easily handled by the CLB circuits.
Data path controls entail the clocking of specific registers along the data path and require fewer than fifteen CLBs. Next, the 5-bit encoded symbol output addresses the mapper which is simply an LUT with a pair of 3-bit outputs. These outputs serve as orthogonal coordinates (I&Q) for mapping symbol positions in a two-dimensional plane (the constellation). Only sixteen of the sixty-four intersections (stars) represent valid symbol positions. The mapper dimensions are 32 words x 3 bits x 2 = 6 CLBs. The
overall CLB count for these functional blocks is thirty-one.
Root-raised cosine filter
The raised cosine filter is an accepted means of limiting the intersymbol interference inherent in the limited bandwidth of the transmission path. The frequency spectrum shaping is split between the transmitter and receiver units; hence the square
root-raised cosine filter. The filter size and its coefficients were developed with the aid of the Momentum Data Systems QEDesign 1000. The responses of the 32-tap finite impulse response (FIR) filter for 12-bit fixed-point computations is shown in
Figure 2
. We shall use a 12-bit filter model and determine its gate count. (With 12-bit quantization the QEDesign program required only twenty-eight symmetrical coefficients. This design exercise will, however, consider
a full 32-tap symmetrical FIR filter.)
Design trickery
The root-raised cosine filter shapes the spectrum of both the I&Q channels. While the I&Q samples are generated at 5 megasamples/sec, the filter produces 20-megasample/ sec data for the modulator. Thus, the filter also functions as a 1:4 interpolator. The resulting
computation load (exploiting coefficient symmetry) is two channels x sixteen symmetrical taps x 20-megasamples/sec = 640 megamultiply-accumulate operations/ sec. This speed is well beyond that offered by most fixed-point DSP chips. Now, the FPGA becomes a compelling choice. There remains, however, the need to select a filter form that can map into an efficient CLB-based design.
There are several circuit configurations or forms to represent the FIR filter. Principal among them are the direct form (which is a
frequent software model), the transpose form with variations (which has been implemented in dedicated filter chips), and the polyphase filter (which is favored for multi-rate applications). None of these forms can exploit coefficient symmetry to reduce the multiplication load. One trick for designing the multi-rate filter is to plot the signal flow trajectory in a sample-coefficient plane. With samples ordered along the vertical axis and coefficients listed horizontally the data trajectory to generate a
single filter response takes the form of a V rotated 90º (see
Figure 3
). With coefficient symmetry, only half the number of filter coefficients must be listed. Interpolation by
K
is expressed as
K 1
zeros stuffed between input samples. The V trajectories for the 32-tap FIR are plotted in
Figure 3
. While the interval between input data samples is 200 ns, a new trajectory must be initiated
every 50 ns.
There are two computation models that can be derived from this plot. The first is a variation of the transpose form where a non-zero input sample is multiplied by all thirty-two coefficients and the products are accumulated in partial sum registers. After thirty-two products have been accumulated, a full filter response is output and a multiplier-accumulator circuit can be assigned to compute a new trajectory. Here, thirty-two MACs are performed every 200 ns. The second model is a delay
and sum, which is the direct form of the FIR filter. As observed in the filter trajectory, eight stored samples are needed to compute a filter response. Computing five consecutive filter responses we observe a pattern as shown in
Table 1
.
Four consecutive 20-MHz responses are computed with the same input data set of eight samples. Only two sets of filter coefficients are used. The filter coefficients appear in reverse order with respect to the samples in
the third and fourth responses (
y
d
and
y
e
) of each data set. Can these response equations be mapped into efficient FPGA circuits? Yes, and the key is DA a computation algorithm that is absent from all major design tools. Before proceeding with the implementation of the response equation set, a simplification can be made.
The 5-MHz carrier
Carrier modulation is defined by the simple equation:
Y
k
= y
I
k
cos w
C
t + y
Q
k
sin
w
C
t
where
w
C
is the carrier frequency = 2p(5 MHz), and I&Q denotes the inphase and quadrature symbol components.
This equation is executed every 50 ns. Over one
symbol period (200 ns) only four values of the carrier occur. These can be conveniently defined as:
cosw
C
t = 1, 0, -1, 0
and
sinw
C
t = 0, 1, 0, -1
.
1
The modulated output does not require any explicit multiplications or additions, nor is it necessary to compute the I&Q filter responses every 50 ns. Rather, an
I
response is computed in 50 ns followed by a
Q
response the next 50 ns. This is followed by a
I
response, and then a
Q
response, and the cycle repeats. The modulated filter response is expressed in
Table 2
.
The DA technique
DA is a computation technique specifically targeted to sum-of-product equations where one of the product term factors is a constant. DA design
choices range from gate-efficient, bit-serial arithmetic to high-performance bit-parallel operations, a classic serial/parallel trade-off. DA techniques can be applied to many important linear, time-invariant digital signal processing algorithms such as filters (FIR and IIR), transforms (fast Fourier transfer [FFT]), and matrix-vector products such as the 8 x 8 discrete cosine transform (DCT).
DA techniques have been known for over two decades, but have proved inappropriate for the fixed
instruction set architecture of the programmable DSP. DA is, however, a good choice for FPGA implementation. It is a particularly good choice for LUT logic modules, such as the Xilinx CLB. A DA FIR filter design for the Xilinx XC3000 FPGA was first proposed in 1992.
1
There are no explicit multipliers in DA circuits. Multiplication is achieved with LUTs. DA carries this process a step further by pre-storing the sum of partial products for all the terms within an equation and addressing the table
(now a DALUT) with the bits of all the input variables. The serial DA circuit has a single DALUT that is addressed least significant bit first. The output sum of partial products is stored in an accumulator register, and, in a manner reminiscent of the old shift-and-add multiply subroutine in early computers, successive DALUT outputs are added to binary downshifted accumulated sum of partial products. A full double precision solution is available. A detailed description of the serial DA circuit is provided
in Reference 1.
Filter implementation
The data path of the root-raised cosine filter is shown in
Figure 4
. The path is defined by standard functional blocks that are readily converted to CLBs. The 3-bit I&Q signals from the mapper are loaded every 200 ns into a parallel-to-serial shift
register (PSR). The seven prior symbols are stored in a RAM shift register (SR) chain. The first three filter responses,
Y
b
, Y
c
, Y
d
,
are computed with data recirculating in the shift registers. A feedback path is required for the PSR, but the RAM SRs affect recirculation by module addressing while reading only. The module here is six with the first three shifts for
Y
b
,
the next three shifts for
Y
c
,
and the last three for
Y
d
.
In computing
Y
e
the data are shifted down the SR chain. The module address pattern is repeated with data transferred (written) from the previous stage. All twelve shifts and associated PSR load, and RAMSR address and write controls can be derived from a 60-MHz system clock.
Because the same coefficient set is used for two sample periods one computed with I data, the other with Q
data a single set of DALUTs may be used with 2/1 multiplexers routing the serial data streams to the appropriate address ports. These ports are annotated in
Figure 4
to indicate the contents of the DALUT. A logic 1 level at the
h
3
port selects all memory locations where
h
3
is included in the sum of partial products. Similarly, a logic 1 at the
h
7
port selects all locations where
h
7
is included, and logic 1 levels at both
h
3
and
h
7
ports select all locations where both
h
3
and
h
7
are included. This pattern is continued for the remaining six coefficients. Indeed, eight coefficients would require 2
8
or 256 words of storage. And with 12-bit coefficients this would consume (256/32 words per CLB) x 12 = 96 CLBs. Another trick is to use two DALUTs with four
coefficients each and adding their outputs. The CLB count reduces to (2 x 24)/32 x 12 + 13/2 (parallel adder) = 18.5 CLBs.
The same reduction also applies to the second set of filter coefficients, starting with
h
1
. With a 2/1 multiplexer the parallel adder can be time-shared. The adder expands to 13 bits that are then applied to the scaling accumulator, which performs the shift and add operations described earlier. When the sign bits of the input variables address the DALUT, a
subtraction occurs. This is executed in standard fashion by complementing the DALUT output with the EXOR gates shown in
Figure 4
and applying a carry-in to the first stage of the accumulator. For the negative responses,
Y
d
and
Y
e
, the data samples are complemented by inverting all the DALUT output data excluding any data provided by the sign bits.
With I&Q data formatted in fractional 2s
complement, the filter coefficients can be scaled to prevent overflow in the final output. The ten most significant output bits can be loaded into the D/A converter driver register.
The total CLB count for the filter data path is 71.5. An itemized breakdown is given in
Figure 4
. The FPGA output ports have flip-flops which can serve as the D/A converter driver register. The grand total, which includes the encoder (thirty-one CLBs) and timing and control functions
(estimated to be less than fifty CLBs), is approximately 159 CLBs, placing it well within the capacity of the second smallest chip of the Xilinx XC4000 family, namely the XC4005 (196 CLBs). With more advanced FPGA devices, such as the Xilinx Virtex, the CLB count is lower and the performance is higher.
Performance with a 60-MHz system clock is assured. The data flow is uniform and unidirectional. Pipeline registers can be inserted (at no increase in CLBs) to shorten combinatorial paths. The longest
combinatorial path is the carry propagation through the fourteen stages of the scaling accumulator. With the built-in look-ahead carry circuits, however, adequate speed margins are provided.
Les Mintzer
retired from full-time engineering at Rockwell and part-time teaching at the University of California at Irvine. He now does part-time consulting for Excelsus Technologies and can be reached at
lmintzer@excelsus-tech.com
.
|