Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


09 March 2010

Feature

Soft Radios and Modems on FPGAs


In this article, the benefits of using FPGAs over DSPs in the design of a 16-QAM RF transmitter data pump are presented.

By Les Mintzer

The field program-mable gate array (FPGA) competes with DSP chips for soft radio and modem designs. While the FPGA is more than a match for logic-intensive functions such as convolution encoders, it has been considered seriously deficient in compute-intensive tasks. The fastest array multiplier configured in an FPGA cannot match the cost and performance provided by a $5 DSP chip. And DSPs are still the chip of choice when dealing with CAD tools. With the application of distributed arithmetic (DA) techniques, however, the scales are once again tipping in favor of the FPGA.

There are additional features such as architectural flexibility that favor the FPGA. Indeed, the functional blocks of radio and modem data paths can readily be mapped into individual, concurrently operating hardware nodes. This approach avoids the difficult programming challenges of scheduling time-critical tasks through a single, time-shared digital signal processing arithmetic processor.

These features will be illustrated by the design of a 16-QAM RF transmitter data pump. The easy transition from data path functional blocks to the logic circuits of the Xilinx 4000 family of FPGAs will be presented in sufficient detail to obtain a good estimate of the required number of circuits. While the design of a 16-QAM data pump that meets the same system requirements and uses the same FPGA type has been reported in the open literature, the reported number of circuits seemed far greater than necessary. In the headlong rush to market, it is very possible to trip over CAD tools. The exclusive reliance on CAD tools does not always yield optimal solutions. A large dose of sweat, experience, and ingenuity is often required.


A home for all digital machines

Given an unlimited supply of universal logic gates such as NANDs and NORs, any digital machine can be built. The FPGA possesses these gates in abundance. In the Xilinx 4000 family, these take the form of a truth table, or more commonly, a look-up table (LUT) of 16 words by 1 bit that can be configured to any arbitrary Boolean function of four input (the address lines of the LUT) variables. Since generating the function usually entails the equivalent of several NAND gates, the LUT is considered coarse-grained logic. The Xilinx 4000 configurable logic block (CLB) consists of two 16-word LUTs which may be combined to produce an arbitrary Boolean function of five input variables. Additionally, the LUTs may be configured as two 16 x 1 RAMs or as a single 32 x 1 RAM.

The CLBs are laid down as two-dimensional square arrays, where both the CLBs and their interconnects are individually configured. The smallest device, the XC4002, contains an 8 x 8 array of CLBs, and the largest, the XC4085XL, contains a 48 x 48 array. Each LUT feeds a flip flop that can be toggled at 100 MHz.


The 16-QAM modulator

The 16-QAM modulator contains the key functional blocks of the RF transmitter data pump (see Figure 1 ). Serial 20-Mbps data are first grouped into 4-bit symbols and then fed, bit parallel, at a 5 megasymbol/sec rate to a differential encoder and symbol mapper. The mapper produces pairs of 3- bit orthogonal components. Next, these components are pulse shaped in a pair of root-raised cosine filters, interpolated to 20 megasamples/sec, modulated by a 5-MHz carrier, with the outputs summed prior to conversion from digital to analog form. The crux of the design is the interpolating pulse-shaping pair of filters. In order to establish the efficacy of this design approach, however, it is necessary to include the encoding and mapping functional blocks as well as the 5-MHz modulator to determine the total gate count.


Encoding and symbol mapping

In sizing the encoder and signal mapper we can draw upon earlier standard modem designs. In the V.32, for example, the encoder consists of a differential encoder to provide 180 º phase ambiguity protection and a convolutional encoder which adds redundancy to reduce the bit error rate (BER) at the receiver. Both encoder and mapper are finite state machines with the total number of states realized by five registers (these registers can be implemented with 2.5 CLBs and the bridging logic of eight 2-input EXOR gates [four CLBs]), and three 2-input AND gates (1.5 CLBs). In this 16-QAM transmitter, four serial bits at the 20-Mbps rate are captured with a serial-to-parallel register (two CLBs) to form a 4-bit symbol that drives the encoders at a reduced 5 megasymbol/sec, a data flow rate easily handled by the CLB circuits. Data path controls entail the clocking of specific registers along the data path and require fewer than fifteen CLBs. Next, the 5-bit encoded symbol output addresses the mapper which is simply an LUT with a pair of 3-bit outputs. These outputs serve as orthogonal coordinates (I&Q) for mapping symbol positions in a two-dimensional plane (the constellation). Only sixteen of the sixty-four intersections (stars) represent valid symbol positions. The mapper dimensions are 32 words x 3 bits x 2 = 6 CLBs. The overall CLB count for these functional blocks is thirty-one.


Root-raised cosine filter

The raised cosine filter is an accepted means of limiting the intersymbol interference inherent in the limited bandwidth of the transmission path. The frequency spectrum shaping is split between the transmitter and receiver units; hence the square root-raised cosine filter. The filter size and its coefficients were developed with the aid of the Momentum Data System’s QEDesign 1000. The responses of the 32-tap finite impulse response (FIR) filter for 12-bit fixed-point computations is shown in Figure 2 . We shall use a 12-bit filter model and determine its gate count. (With 12-bit quantization the QEDesign program required only twenty-eight symmetrical coefficients. This design exercise will, however, consider a full 32-tap symmetrical FIR filter.)


Design trickery

The root-raised cosine filter shapes the spectrum of both the I&Q channels. While the I&Q samples are generated at 5 megasamples/sec, the filter produces 20-megasample/ sec data for the modulator. Thus, the filter also functions as a 1:4 interpolator. The resulting computation load (exploiting coefficient symmetry) is two channels x sixteen symmetrical taps x 20-megasamples/sec = 640 megamultiply-accumulate operations/ sec. This speed is well beyond that offered by most fixed-point DSP chips. Now, the FPGA becomes a compelling choice. There remains, however, the need to select a filter form that can map into an efficient CLB-based design.

There are several circuit configurations or forms to represent the FIR filter. Principal among them are the direct form (which is a frequent software model), the transpose form with variations (which has been implemented in dedicated filter chips), and the polyphase filter (which is favored for multi-rate applications). None of these forms can exploit coefficient symmetry to reduce the multiplication load. One trick for designing the multi-rate filter is to plot the signal flow trajectory in a sample-coefficient plane. With samples ordered along the vertical axis and coefficients listed horizontally the data trajectory to generate a single filter response takes the form of a “V” rotated 90º (see Figure 3 ). With coefficient symmetry, only half the number of filter coefficients must be listed. Interpolation by K is expressed as K – 1 zeros stuffed between input samples. The V trajectories for the 32-tap FIR are plotted in Figure 3 . While the interval between input data samples is 200 ns, a new trajectory must be initiated every 50 ns.

There are two computation models that can be derived from this plot. The first is a variation of the transpose form where a non-zero input sample is multiplied by all thirty-two coefficients and the products are accumulated in partial sum registers. After thirty-two products have been accumulated, a full filter response is output and a multiplier-accumulator circuit can be assigned to compute a new trajectory. Here, thirty-two MACs are performed every 200 ns. The second model is a delay and sum, which is the direct form of the FIR filter. As observed in the filter trajectory, eight stored samples are needed to compute a filter response. Computing five consecutive filter responses we observe a pattern as shown in Table 1 .

Four consecutive 20-MHz responses are computed with the same input data set of eight samples. Only two sets of filter coefficients are used. The filter coefficients appear in reverse order with respect to the samples in the third and fourth responses ( y d and y e ) of each data set. Can these response equations be mapped into efficient FPGA circuits? Yes, and the key is DA — a computation algorithm that is absent from all major design tools. Before proceeding with the implementation of the response equation set, a simplification can be made.


The 5-MHz carrier

Carrier modulation is defined by the simple equation: Y k = y I k cos w C t + y Q k sin w C t where w C is the carrier frequency = 2p(5 MHz), and I&Q denotes the inphase and quadrature symbol components.

This equation is executed every 50 ns. Over one symbol period (200 ns) only four values of the carrier occur. These can be conveniently defined as: cosw C t = 1, 0, -1, 0 and sinw C t = 0, 1, 0, -1 . 1

The modulated output does not require any explicit multiplications or additions, nor is it necessary to compute the I&Q filter responses every 50 ns. Rather, an I response is computed in 50 ns followed by a Q response the next 50 ns. This is followed by a –I response, and then a –Q response, and the cycle repeats. The modulated filter response is expressed in Table 2 .


The DA technique

DA is a computation technique specifically targeted to sum-of-product equations where one of the product term factors is a constant. DA design choices range from gate-efficient, bit-serial arithmetic to high-performance bit-parallel operations, a classic serial/parallel trade-off. DA techniques can be applied to many important linear, time-invariant digital signal processing algorithms such as filters (FIR and IIR), transforms (fast Fourier transfer [FFT]), and matrix-vector products such as the 8 x 8 discrete cosine transform (DCT).

DA techniques have been known for over two decades, but have proved inappropriate for the fixed instruction set architecture of the programmable DSP. DA is, however, a good choice for FPGA implementation. It is a particularly good choice for LUT logic modules, such as the Xilinx CLB. A DA FIR filter design for the Xilinx XC3000 FPGA was first proposed in 1992. 1

There are no explicit multipliers in DA circuits. Multiplication is achieved with LUTs. DA carries this process a step further by pre-storing the sum of partial products for all the terms within an equation and addressing the table (now a DALUT) with the bits of all the input variables. The serial DA circuit has a single DALUT that is addressed least significant bit first. The output sum of partial products is stored in an accumulator register, and, in a manner reminiscent of the old shift-and-add multiply subroutine in early computers, successive DALUT outputs are added to binary downshifted accumulated sum of partial products. A full double precision solution is available. A detailed description of the serial DA circuit is provided in Reference 1.


Filter implementation

The data path of the root-raised cosine filter is shown in Figure 4 . The path is defined by standard functional blocks that are readily converted to CLBs. The 3-bit I&Q signals from the mapper are loaded every 200 ns into a parallel-to-serial shift register (PSR). The seven prior symbols are stored in a RAM shift register (SR) chain. The first three filter responses, Y b , Y c , Y d , are computed with data recirculating in the shift registers. A feedback path is required for the PSR, but the RAM SRs affect recirculation by module addressing while reading only. The module here is six with the first three shifts for Y b , the next three shifts for Y c , and the last three for Y d . In computing Y e the data are shifted down the SR chain. The module address pattern is repeated with data transferred (written) from the previous stage. All twelve shifts and associated PSR load, and RAMSR address and write controls can be derived from a 60-MHz system clock.

Because the same coefficient set is used for two sample periods — one computed with I data, the other with Q data — a single set of DALUTs may be used with 2/1 multiplexers routing the serial data streams to the appropriate address ports. These ports are annotated in Figure 4 to indicate the contents of the DALUT. A logic 1 level at the h 3 port selects all memory locations where h 3 is included in the sum of partial products. Similarly, a logic 1 at the h 7 port selects all locations where h 7 is included, and logic 1 levels at both h 3 and h 7 ports select all locations where both h 3 and h 7 are included. This pattern is continued for the remaining six coefficients. Indeed, eight coefficients would require 2 8 or 256 words of storage. And with 12-bit coefficients this would consume (256/32 words per CLB) x 12 = 96 CLBs. Another trick is to use two DALUTs with four coefficients each and adding their outputs. The CLB count reduces to (2 x 24)/32 x 12 + 13/2 (parallel adder) = 18.5 CLBs.

The same reduction also applies to the second set of filter coefficients, starting with h 1 . With a 2/1 multiplexer the parallel adder can be time-shared. The adder expands to 13 bits that are then applied to the scaling accumulator, which performs the shift and add operations described earlier. When the sign bits of the input variables address the DALUT, a subtraction occurs. This is executed in standard fashion by complementing the DALUT output with the EXOR gates shown in Figure 4 and applying a carry-in to the first stage of the accumulator. For the negative responses, Y d and Y e , the data samples are complemented by inverting all the DALUT output data excluding any data provided by the sign bits.

With I&Q data formatted in fractional 2’s complement, the filter coefficients can be scaled to prevent overflow in the final output. The ten most significant output bits can be loaded into the D/A converter driver register.

The total CLB count for the filter data path is 71.5. An itemized breakdown is given in Figure 4 . The FPGA output ports have flip-flops which can serve as the D/A converter driver register. The grand total, which includes the encoder (thirty-one CLBs) and timing and control functions (estimated to be less than fifty CLBs), is approximately 159 CLBs, placing it well within the capacity of the second smallest chip of the Xilinx XC4000 family, namely the XC4005 (196 CLBs). With more advanced FPGA devices, such as the Xilinx Virtex, the CLB count is lower and the performance is higher.

Performance with a 60-MHz system clock is assured. The data flow is uniform and unidirectional. Pipeline registers can be inserted (at no increase in CLBs) to shorten combinatorial paths. The longest combinatorial path is the carry propagation through the fourteen stages of the scaling accumulator. With the built-in look-ahead carry circuits, however, adequate speed margins are provided.


Les Mintzer retired from full-time engineering at Rockwell and part-time teaching at the University of California at Irvine. He now does part-time consulting for Excelsus Technologies and can be reached at lmintzer@excelsus-tech.com .



Illustrations

Figure 1
Figure 2
Figure 3
Figure 4
Tables

Table 1
Table 2
References

  1. Mintzer, L., “FIR filters with the Xilinx FPGA,” FPGAFirst International ACM/SIGDAProceedings, 1992.




Return to the Table of Contents


Return to: Communication Systems Design
Send comments to the: ">Webmaster

All material on this site Copyright © 2000 CMP Media Inc. All rights reserved.




Virtualab

  • Analysts: Five observations on mobile from MWC
  • M'soft says no comment on Project Pink phone
  • What made you become an EE? Join the Conversation
  • Nvidia blames sales shortfall on TSMC
  • MORE
    Prototype fuel cell for handsets eyes fivefold run-time boost
    As part of a research collaboration on miniaturized energy sources, the French Atomic Energy Agency (CEA) and STMicroelectronics NV (Geneva) have prototyped a hydrogen fuel cell for mobile phones that aims to reduce dependency on the use of electrical power supplies to recharge batteries. EE Times' Anne-Francoise Pele Takes a closer look.Click here to learn more.

    Tech Article Library
    Check out CommsDesign's Design corner to find a detail technical articles on a host of communication design issues. To access the design corner, click here.

    Phyworks demos 10G copper interconnects
    Communications chip specialist Phyworks (Bristol, England) has demonstrated 10Gbits/s rack-to-rack copper interconnects of up to 30 metres using technology it originally developed for the optical module market. EE Times Europe's John Walko gets the story. Click here for details.

    Puzzled by a network processing design issue?

    Join former NPF CEO Colin Mick in discussing net processing design issues by clicking here!


    EE Times TechCareers
    Search Jobs

    Enter Keyword(s):


    Function:


    State:
      

    Post Your Resume
    -----------------
    Employers Area
    Most Recent Posts
    Accenture seeking Project Management Team Lead in Charlotte, NC

    Accenture seeking Software Engineer in Salt Lake City, UT

    Boeing Company seeking Software Engineer in Herndon, VA

    Switch and Data seeking Customer Solutions Engineer in Dallas, TX

    Chart Industries seeking Sr. Developer in Cleveland, OH

    More career-related news, resources and job postings for technology professionals




    Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
    All materials on this site Copyright © 2010 EE Times Group, a Division of United Business Media LLC All rights reserved.
    Privacy Statement ¦ Terms of Service