Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


22 November 2009

SOC Design Special Section

Bringing Parallel Processing to FPGA Designs


By Wei-Leng Lim

Standard products combine configurable embedded functions with field-programmable logic, delivering the performance and efficiency of ASICs with the flexibility and short development cycles of PLDs.

Communication design engineers are turning to digital signal processing to enable or enhance a variety of advanced system features. While most engineers use specialized digital signal processors (DSPs) for implementing signal processing functions, they are also increasingly using traditional FPGAs for hardware-based math coprocessors or sometimes as stand-alone processing engines.

Unfortunately for design engineers, general-purpose FPGA logic cells aren't very fast or efficient for the logic-intensive multiply-accumulate (MAC) operations at the heart of most DSP algorithms. That's because they've been designed for "random" logic functions that are usually very different from arithmetic logic. As a result, users must limit the level of functionality in their designs, accept reduced performance, or pay a lot of money for big, fast devices. With the larger device usage comes higher power consumption and increased board real estate.

To overcome these limitations, some designers have begun using standard-cell ASICs. These devices are fast and efficient for arithmetic functions, but lack the flexibility needed for quick design changes. Additionally, they are often not economical for all but the highest volume production applications.

A new type of device called an embedded standard product (ESP) combines configurable embedded functions with field-programmable logic to deliver the performance and efficiency of ASICs with the flexibility and short development cycles of programmable devices. This concept has been used to embed into silicon dedicated blocks of reconfigurable embedded computational units (ECUs) specifically designed to perform high-speed arithmetic logic. With embedded dedicated logic and high-speed memory, the devices can provide up to four times the computational bandwidth of similar-sized, general-purpose logic devices without using any of the available logic cells. Algorithms and coefficients are loaded into memory to configure a versatile computational unit at consecutive stages for varying high-speed arithmetic operations.

Architectural overview


ESP devices contain three basic elements: two rows of dual-port SRAM blocks, a logic cell array, and a row of ECUs (see Figure 1 ). Additional features such as programmable phase-locked loops (PLLs) enable frequency synthesis, which reduces cost and electromagnetic interference (EMI) generated by high-speed external clock signals.

An ESP can store multiple instructions and data in memory. Embedding a large number of dual-port SRAM memory blocks that can be configured as individual or cascaded RAM, ROM, or FIFO blocks allows the DSP designer to create small or large memory block structures. For instance, once a pattern is determined, the memory can be configured as separate synchronous or asynchronous FIFOs and the device can read and sequentially write instructions and data. RAM functions can also be created for data or variable-coefficient storage, and ROM functions for program memory or fixed-coefficient storage.

The ratio of the number of memory blocks to ECUs in an ESP device is 2:1. Various data paths can be created, such as:

  • Data can be stored in an X,Y memory configuration, with the ECU results routed to the logic-cell array for further data manipulation.
  • Results from the ECUs can be directly routed to the bottom memory row.
  • Data can be concatenated into a Memory A, processed through the ECU and then stored in Memory B.

In contrast to the memory blocks, the logic cell array is used for implementing general-purpose logic. Within this array, as shown in Figure 2 , each logic cell in the array has multiple outputs, which allow the use of the entire cell for large functions or parts of the cell for smaller functions.

The PLLs, also called frequency synthesizers, are used to create a master clock from a lower input frequency clock and can also multiply or divide an incoming clock. In addition, a PLL can be configured as an early clock option to further reduce the TCO of a system or as an output option to drive external devices. A lock detect signal is used to indicate a PLL is in lock.

Computational units


The ECU is the heart and soul of ESPs. This computation unit consists of four major building blocks: a multiplier, adder, register, and mode-selection circuit. ECUs can be configured for eight arithmetic functions via the instruction set (see Table 1 ). The modes for the ECU block are dynamically reprogrammable through the instruction set sequencer.

ECUs are placed in a row configuration between the memory block and the logic-cell array for maximum flexibility. This ensures fast and efficient memory/instruction fetch and addressing of DSP algorithmic implementations. After processing, the ESP can route data back into memory or directly into the logic cells. Embedding the ECU into the silicon guarantees performance of arithmetic functions such as single-cycle, zero-clock-latency MACs (8-bit) at 144 MHz and adds (16-bit) at 396 MHz.

Within the ECU, a sequencer transitions instruction-set modes. This sequencer can be a variety of logic operators such as a FIFO loaded with various algorithms, an external software-driven algorithm, or an internal state-machine. This flexibility is invaluable for algorithm-intensive applications such as adaptive filtering, in which functions change on the next clock cycle.

Larger multipliers can be constructed by using multiple ECUs in parallel. For instance, a zero-latency 16-bit multiplier can be constructed using four ECUs (see Figure 3 ). The maximum delay is one multiplier and three adders. Fully pipelining the larger multipliers incurs a delay for only one multiplier and one adder. A single ECU multiplier can be used to spread out the four multiplications in time to make a 16 x 16 multiplier, which becomes easier to do with an on-board PLL.

Adders wider than 16-bit inputs can also be constructed using multiple ECUs. For example, a 32-bit adder can be built using two ECUs, both configured as adders. One ECU implements the addition for the lower 16 bits and the other ECU implements the upper 16 bits. The COUT of the lower ECU connects to the CIN of the upper ECU to provide the carry bit. Much larger adders are built in a similar fashion.

Software


As with any DSP system design, software design support is crucial. To facilitate the task of filter design, coefficient generation and analysis is required. In addition, the design engineer must have access to optimized macros or libraries of DSP functions in order to speed the design process to beat time-to-market constraints.

Hardware description languages (HDLs) such as Verilog or VHDL are employed to create system-level designs that are synthesized, placed, and routed into the ESP. The ECU is modeled in both HDLs and as a schematic symbol. All that is needed is to instantiate the ECU model into the behavioral code as shown in Code Listing 1 .

Simulation


Verification through system simulation reduces the debug cycle by ensuring that all components perform as specified in the design. Both behavioral and post layout simulation using a Verilog or VHDL simulator can be used. The trial-by-error technique is an ineffective method and should be avoided especially considering the large, complex designs now being implemented.

To illustrate this point, let's look at a filtering example. In an eight-tap FIR filter, the input shifts through eight registers (taps). Each output stage of a particular register is multiplied by either a known or variable coefficient. The resulting outputs of the multipliers are then summed to create the filter output.

A high-performance system employs a separate MAC for each tap. Eight ECUs are configured as MACs by toggling the 3-bit instruction set to binary 110. Since the ECUs are cascaded, some of the input ports are tied to the previous ECU outputs as follows:

  • XBus input is tied to the input data.
  • YBus input is tied to a coefficient.
  • AdderBus input is tied to the previous ECU output.
  • CIN input is tied to the previous ECU COUT.

An example of the Verilog code for the parallel filter is shown in Code Listing 2 . For linear phase response FIR filters, the coefficients are symmetric about the center taps. This knowledge allows the designer to "fold" the filter in half, reducing the amount of MACs required.

Multi-clocking operations


One strength of the ECU is its ability to compute various arithmetic functions in a short time period, allowing the ECU to be used in several different configurations for a given clock period. For instance, for the linear FIR filter example, a 70-MHz system would normally require four adders and four MACs. The PLL can create a clock multiplied 2 and 4 times of the original.

Yet only four ECUs accomplish the same functions with the help of the embedded PLLs (they can generate 2 to 4 times the original clock frequency). The PLL is able to generate a 14-MHz clock, which allows the same ECU to be used for tow clock cycles and still meet the required 70-MHz clock performance. During cycle one, the four ECUs are configured in parallel as adders via the instruction set (set to binary 011) and values calculated for Sum0, Sum1, Sum2, and Sum3. During cycle two, the same four ECUs are configured as MACs (instruction set to binary 110), and the values and coefficients form cycle one are used to compute the required output. Logic cell based multiplexers are used to select between the different inputs that are fed into the ECUs.

An example of the Verilog code for the linear filter is shown in Code Listing 3 . Another use of the ECU in filtering applications can be derived directly from DSP processors. Here, an ECU can be configured as a MAC and multi-clocked with the same input data with different coefficient values associated with subsequent taps. The resulting values are stored in memory and accessed during the required filter stage.

A DSP uses a single MAC multiple times per cycle. The linear algorithm is looped eight times:

result = 0; --- Clears result
for (i = 0; i < 8; i ++) --- Repeats 8 times
{
result += c[i] * data[i];
}

Using Verilog and the ECU the code is as follows:

ECU ECU_looped(.Clock(clk),
Reset(rst), .XBus(data), .YBus(c),
.AdderBus(), .Cin(cout), .IBus(ibus),
.Sign(sign), .Result(result), .Cout(cout));

A variety of counter-driven state machines can be constructed to sequence the data through the ECU, providing the same functionality as the linear algorithm above. For instance the RAM can be configured as a FIFO and used to store the data and coefficient values. The state-machine pops data and coefficients from the FIFO and sequences the values through the ECU.

Adaptive filtering


In adaptive filtering, the coefficients change dynamically with the input data. Due to its fast operational speeds, the ECU works well for this type of application. The data and control inputs to the ECUs can be tied directly to the I/Os of the device, allowing coefficients to be altered on subsequent clock cycles with no performance degradation.

In other programmable-logic architectures, the predetermined multiplied coefficient values are placed into a look-up table (LUT) structure, which requires a large amount of resources. This approach does not work well when the coefficients are not known beforehand, such as in the dynamic environment of adaptive filtering. Using two LUTs causes a serious problem for those types of architectures. The first LUT contains the predetermined multiplier values, which are fed to each tap. The second LUT is loaded with the new coefficient values and eventually mutiplexed with the first LUT to the filter taps. Once the multiplier outputs are obtained a complex pipelined adder tree is created to handle the summation operations.

Clearly, the use of two LUTs provides an inefficient use of resources. This inefficiency will translate to greater power consumption over an ESP implementation.

Multi-sample processing


The abundant amount of ECUs can also perform multiple-sample processing. The high data rate stream can be split among the numerous ECUs to use resources efficiently for the same or different arithmetic operations.

For example, instead of processing one sample point for multiple sample sets at the same time, do item i for 18 different sets of series calculations on 18 different ECUs contained on one device. Each ECU can be fed the same constants and the same variable locations with different offsets for each sample path. This minimizes memory transfers for the constants and instructions and enables the code to do 18 sample sets in the time it takes to do one, thus an 18X speedup.

ESP devices can also be configured for multiple data paths. Individual ECUs can either be linked directly to separate data paths or linked together to run parallel ECU operations per cycle. This is ideal for applications consisting of multiple subtasks that need little or no interdependency. An example is processing of multiple channels of speech coding ý multiple designs, in a single ESP using separate ECUs, execute the same program on different channels of data.

Leaving tradition behind


Design engineers using traditional programmable-logic devices (PLDs) to implement DSP functions in hardware are making functional trade-offs that are unnecessary with an embedded approach. Limiting designs to fixed-coefficient inputs where variable coefficients are required for real-time prototyping or implementation is an example of such a trade-off.

Embedding a reprogrammable computational unit and memory blocks into silicon allows DSP design engineers to efficiently implement complex algorithms and multiple-sample processing across single or multiple data paths without sacrificing performance. Since the logic utilization is efficient even for very complex designs, design engineers can use smaller, less expensive devices with the added advantage of low power consumption.


Illustrations

Wai-Leng is a member of the QuickLogic customer engineering group, focusing on FPGA design. He received a BSEE degree from the University of Arizona, Tucson and can be contacted at waileng@quicklogic.com .

Figure 1
Figure 2

Tables
Table 1

Code Listings
Code Listing 1
Code Listing 2
Code Listing 3



Return to the SOC Design Special Section





Virtualab

  • Bozotti details ST's outlook for Q4
  • Wind River's Klein on Intel , multicore, embedded Linux
  • 15 sobering predictions, cures for economy
  • EU project set to make chip cards more secure
  • MORE
    Prototype fuel cell for handsets eyes fivefold run-time boost
    As part of a research collaboration on miniaturized energy sources, the French Atomic Energy Agency (CEA) and STMicroelectronics NV (Geneva) have prototyped a hydrogen fuel cell for mobile phones that aims to reduce dependency on the use of electrical power supplies to recharge batteries. EE Times' Anne-Francoise Pele Takes a closer look.Click here to learn more.

    Tech Article Library
    Check out CommsDesign's Design corner to find a detail technical articles on a host of communication design issues. To access the design corner, click here.

    Phyworks demos 10G copper interconnects
    Communications chip specialist Phyworks (Bristol, England) has demonstrated 10Gbits/s rack-to-rack copper interconnects of up to 30 metres using technology it originally developed for the optical module market. EE Times Europe's John Walko gets the story. Click here for details.

    Puzzled by a network processing design issue?

    Join former NPF CEO Colin Mick in discussing net processing design issues by clicking here!


    EE Times TechCareers
    Search Jobs

    Enter Keyword(s):


    Function:


    State:
      

    Post Your Resume
    -----------------
    Employers Area
    Most Recent Posts
    SEL seeking Business Development Manager in Pullman, WA

    SEL seeking Integration / Automation Engineer in Charlotte, NC

    ESRI seeking Business Manager - Support Services in Redlands, CA

    Amcor PET Packaging seeking Facilities Engineer in Philadelphia, PA

    Mentor Graphics seeking Embedded SW Tele-Sales in San Jose, CA

    More career-related news, resources and job postings for technology professionals




    Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
    All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
    Privacy Statement ¦ Terms of Service