|
Bringing Parallel Processing to FPGA Designs
By Wei-Leng Lim
Standard products
combine configurable embedded functions with field-programmable logic, delivering the performance and efficiency of ASICs with the flexibility and short development cycles of PLDs.
Communication design engineers are turning to digital signal processing to enable or enhance a variety of advanced system features. While most engineers use specialized digital signal processors (DSPs) for implementing signal processing functions, they are also increasingly using traditional FPGAs for hardware-based
math coprocessors or sometimes as stand-alone processing engines.
Unfortunately for design engineers, general-purpose FPGA logic cells aren't very fast or efficient for the logic-intensive multiply-accumulate (MAC) operations at the heart of most DSP algorithms. That's because they've been designed for "random" logic functions that are usually very different from arithmetic logic. As a result, users must limit the level of functionality in their designs, accept reduced performance, or pay a lot of
money for big, fast devices. With the larger device usage comes higher power consumption and increased board real estate.
To overcome these limitations, some designers have begun using standard-cell ASICs. These devices are fast and efficient for arithmetic functions, but lack the flexibility needed for quick design changes. Additionally, they are often not economical for all but the highest volume production applications.
A new type of device called an embedded standard product (ESP)
combines configurable embedded functions with field-programmable logic to deliver the performance and efficiency of ASICs with the flexibility and short development cycles of programmable devices. This concept has been used to embed into silicon dedicated blocks of reconfigurable embedded computational units (ECUs) specifically designed to perform high-speed arithmetic logic. With embedded dedicated logic and high-speed memory, the devices can provide up to four times the computational bandwidth of similar-sized,
general-purpose logic devices without using any of the available logic cells. Algorithms and coefficients are loaded into memory to configure a versatile computational unit at consecutive stages for varying high-speed arithmetic operations.
Architectural overview
ESP devices contain three basic elements: two rows of dual-port SRAM
blocks, a logic cell array, and a row of ECUs (see
Figure 1
). Additional features such as programmable phase-locked loops (PLLs) enable frequency synthesis, which reduces cost and electromagnetic interference (EMI) generated by high-speed external clock signals.
An ESP can store multiple instructions and data in memory. Embedding a large number of dual-port SRAM memory blocks that can be configured as individual or cascaded RAM, ROM, or FIFO blocks allows the
DSP designer to create small or large memory block structures. For instance, once a pattern is determined, the memory can be configured as separate synchronous or asynchronous FIFOs and the device can read and sequentially write instructions and data. RAM functions can also be created for data or variable-coefficient storage, and ROM functions for program memory or fixed-coefficient storage.
The ratio of the number of memory blocks to ECUs in an ESP device is 2:1. Various data paths can be created,
such as:
Data can be stored in an X,Y memory configuration, with the ECU results routed to the logic-cell array for further data manipulation.
Results from the ECUs can be directly routed to the bottom memory row.
Data can be concatenated into
a Memory A, processed through the ECU and then stored in Memory B.
In contrast to the memory blocks, the logic cell array is used
for
implementing general-purpose logic. Within this array, as shown in
Figure 2
, each logic cell in the array has multiple outputs, which allow the use of the entire cell for large functions or parts of the cell for smaller functions.
The PLLs, also called frequency synthesizers, are used to create a master clock from a lower input frequency clock and can also multiply or divide an incoming clock. In addition, a PLL can be configured as an early clock option to
further reduce the TCO of a system or as an output option to drive external devices. A lock detect signal is used to indicate a PLL is in lock.
Computational units
The ECU is the heart and soul of ESPs. This computation unit consists of four major building blocks: a multiplier, adder, register, and mode-selection circuit. ECUs can be
configured for eight arithmetic functions via the instruction set (see
Table 1
). The modes for the ECU block are dynamically reprogrammable through the instruction set sequencer.
ECUs are placed in a row configuration between the memory block and the logic-cell array for maximum flexibility. This ensures fast and efficient memory/instruction fetch and addressing of DSP algorithmic implementations. After processing, the ESP can route data back into memory or directly
into the logic cells. Embedding the ECU into the silicon guarantees performance of arithmetic functions such as single-cycle, zero-clock-latency MACs (8-bit) at 144 MHz and adds (16-bit) at 396 MHz.
Within the ECU, a sequencer transitions instruction-set modes. This sequencer can be a variety of logic operators such as a FIFO loaded with various algorithms,
an external software-driven algorithm, or an internal state-machine. This flexibility is invaluable for algorithm-intensive applications such as
adaptive filtering, in which functions change on the next clock cycle.
Larger multipliers can be constructed by using multiple ECUs in parallel. For instance, a zero-latency 16-bit multiplier can be constructed using four ECUs (see
Figure 3
).
The maximum delay is one multiplier and three adders. Fully pipelining the larger multipliers incurs a delay for only one multiplier and one adder. A single ECU multiplier can be used to spread out the four
multiplications in time to make a 16 x 16 multiplier, which becomes easier to do with an on-board PLL.
Adders wider than 16-bit inputs can also be constructed using multiple ECUs. For example, a 32-bit adder can be built using two ECUs, both configured as adders. One ECU implements the addition for the lower 16 bits and the other ECU implements the upper 16 bits. The COUT of the lower ECU connects to the CIN of the upper ECU to provide the carry bit. Much larger adders are built in a similar fashion.
Software
As with any DSP system design, software design support is crucial. To facilitate the task of filter design, coefficient generation and analysis is required. In addition, the design engineer must have access to optimized macros or libraries of DSP functions in order to speed the design process to beat time-to-market constraints.
Hardware description languages (HDLs) such as Verilog or VHDL are employed to create system-level designs that are synthesized, placed, and routed into the ESP. The ECU is modeled in both HDLs and as a schematic symbol. All that is needed is to instantiate the ECU model into the behavioral code as shown in
Code Listing 1
.
Simulation
Verification through system simulation reduces the debug cycle by ensuring that all components perform as specified in the design. Both behavioral and post layout simulation using a Verilog or VHDL simulator can be used. The trial-by-error technique is an ineffective method and should be avoided especially considering the large, complex designs now being implemented.
To illustrate this point, let's look at a filtering example. In an eight-tap FIR filter, the
input shifts through eight registers (taps). Each output stage of a particular register is multiplied by either a known or variable coefficient. The resulting outputs of the multipliers are then summed to create the filter output.
A high-performance system employs a separate MAC for each tap. Eight ECUs are configured as MACs by toggling the 3-bit instruction set to binary 110. Since the ECUs are cascaded, some of the input ports are tied to the previous ECU outputs as follows:
XBus input is tied to the input data.
YBus input is tied to a coefficient.
AdderBus input is tied to the previous ECU output.
CIN input is tied to the previous ECU COUT.
An example of the Verilog code for the parallel filter is shown in
Code Listing 2
.
For linear phase response FIR filters, the coefficients are symmetric about the
center taps. This knowledge allows the designer to "fold" the filter in half, reducing the amount of MACs required.
Multi-clocking operations
One strength of the ECU is its ability to compute various arithmetic functions in a short time period, allowing the ECU to be used in several different configurations for a given clock period. For
instance, for the linear FIR filter example, a 70-MHz system would normally require four adders and four MACs. The PLL can create a clock multiplied 2 and 4 times of the original.
Yet only four ECUs accomplish the same functions with the help of the embedded PLLs (they can generate 2 to 4 times the original clock frequency). The PLL is able to generate a 14-MHz clock, which allows the same ECU to be used for tow clock cycles and still meet the required 70-MHz clock performance. During cycle one, the
four ECUs are configured in parallel as adders via the instruction set (set to binary 011) and values calculated for Sum0, Sum1, Sum2, and Sum3. During cycle two, the same four ECUs are configured as MACs (instruction set to binary 110), and the values and coefficients form cycle one are used to compute the required output. Logic cell based multiplexers are used to select between the different inputs that are fed into the ECUs.
An example of the Verilog code for the linear filter is shown in
Code Listing 3
.
Another use of the ECU in filtering applications can be derived directly from DSP processors. Here, an ECU can be configured as a MAC and multi-clocked with the same input data with different coefficient values associated with subsequent taps. The resulting values are stored in memory and accessed during the required filter stage.
A DSP uses a single MAC multiple times per cycle. The linear algorithm is looped eight times:
result = 0;
--- Clears result
for (i = 0; i
<
8; i ++) --- Repeats 8 times
{
result += c[i] * data[i];
}
Using Verilog and the ECU the code is as follows:
ECU ECU_looped(.Clock(clk),
Reset(rst), .XBus(data), .YBus(c),
.AdderBus(), .Cin(cout), .IBus(ibus),
.Sign(sign), .Result(result), .Cout(cout));
A variety of counter-driven state machines can be constructed to sequence the data through the ECU, providing the same functionality as the linear
algorithm above. For instance the RAM can be configured
as a FIFO and used to store the data and coefficient values. The state-machine pops data and coefficients from the FIFO and sequences the values through the ECU.
Adaptive filtering
In adaptive filtering, the coefficients change dynamically with the input data. Due to its fast
operational speeds, the ECU works well for this type of application. The data and control inputs to the ECUs can be tied directly to the I/Os of the device, allowing coefficients to be altered on subsequent clock cycles with no performance degradation.
In other programmable-logic architectures, the predetermined multiplied coefficient values are placed into a look-up table (LUT) structure, which requires a large amount of resources. This approach does not work well when the coefficients are not known
beforehand, such as in the dynamic environment of adaptive filtering. Using two LUTs causes a serious problem for those types of architectures. The first LUT contains the predetermined multiplier values, which are fed to each tap. The second LUT is loaded with the new coefficient values and eventually mutiplexed with the first LUT to the filter taps. Once the multiplier outputs are obtained a complex pipelined adder tree is created to handle the summation operations.
Clearly, the use of two LUTs
provides an inefficient use of resources. This inefficiency will translate to greater power consumption over an ESP implementation.
Multi-sample processing
The abundant amount of ECUs can also perform multiple-sample processing. The high data rate stream can be split among the numerous ECUs to use resources efficiently for the same or different
arithmetic operations.
For example, instead of processing one sample point for multiple sample sets at the same time, do item i for 18 different sets of series calculations on 18 different ECUs contained on one device. Each ECU can be fed the same constants and the same variable locations with different offsets for each sample path. This minimizes memory transfers for the constants and instructions and enables the code to do 18 sample sets in the time it takes to do one, thus an 18X speedup.
ESP devices can also be configured for multiple data paths. Individual ECUs can either be linked directly to separate data paths or linked together to run parallel ECU operations per cycle. This is ideal for applications consisting of multiple subtasks that need little or no interdependency. An example is processing of multiple channels of speech coding ý multiple designs, in a single ESP using separate ECUs, execute the same program on different channels of data.
Leaving tradition behind
Design engineers using traditional programmable-logic devices (PLDs) to implement DSP functions in hardware are making functional trade-offs that are unnecessary with an embedded approach. Limiting designs to fixed-coefficient inputs where variable coefficients are required for real-time prototyping or implementation is an example of such a trade-off.
Embedding a reprogrammable computational unit and memory blocks into silicon allows DSP design engineers to efficiently implement complex algorithms and multiple-sample processing across single or multiple data paths without sacrificing performance. Since the logic utilization
is efficient even for very complex designs, design engineers can use smaller, less expensive devices with the added advantage of low power consumption.
|