|
Boosting Network System Designs with CAMs
By Mike Ichiriu and Chris O'Reilly
By implementing CAMs, designers can free up processor resources and accelerate table lookups in their high-speed networking designs.
Designing and working with content addressable memory (CAM) is a great way for designers to meet the challenges posed by explosive Internet growth, which has strained the traditional architecture of networking hardware to the limit. These devices benefit a range of applications, such as voice over IP (VoIP) and streaming video, which require greater bandwidth and reduced end-to-end delay in data transmission.
Specialized hardware components like CAMs are increasingly needed to provide the table look-up boost necessary to satisfy performance specifications in high-speed networks. CAMs provide the ability to perform a lookup every clock cycle, performance far superior and more scaleable than that of a hashing, tree, or other algorithm-based solution.
At the same time, off-the-shelf CAMs eliminate the cost of developing and testing a proprietary search algorithm. The CPU cycles saved by offloading lookups to a CAM can then be devoted to running more complex applications. Additionally, the complex lookups that can be can be offloaded to a CAM enable new functions, such as the wire-speed enforcement of service-level agreements (SLAs).
CAM basics
Figure 1 shows the block diagram of a typical CAM device. While the name "content-addressable memory" implies that a CAM is a memory device, it is apparent from the block diagram in Figure 1 that CAMs are more logic than memory. In fact, the memory elements in a CAM typically take up less than half of the total die size. The rest consists of the support logic for the CAM array (the logic that enables the CAM to resolve multiple matches) and the control logic that provides enhanced CAM functionality (the configuration logic).
The CAM architecture also forgoes the more traditional bus structure of memory devices such as SRAMs and DRAMs in favor of an interface that allows CAMs to maintain single-cycle pipelined searches. Single-cycle pipelining is necessary for prolonged periods of operation at peak capacity. As a result, the user interface consists of four independent buses.
Similar to a data bus, the comparand bus (CBUS) is used to load data and access the internal registers. The instruction bus (IBUS) is used to issue instructions to the CAM, which dispatches one instruction every clock cycle. Search results are driven on the results bus (RBUS), and the next free address bus (NFABUS) maintains a pointer to the empty entry with the lowest physical address.
Some ternary CAM applications, such as packet classification, require the entries to be stored in an order that is data-dependent. For these applications, the user must manually select the physical address at which to store each entry based on the data and the number of entries already stored in the CAM, and thus the NFABUS can be left unconnected.
The search instruction causes every entry in the CAM array to compare its contents with the value in the comparand register. Each entry in the CAM array has a validity bit associated with it, which signifies whether that entry contains valid data. This precludes random data in empty entries from matching during a search.
Performance requirements The first step to designing with a CAM is to determine the peak lookup requirement for the CAM. Simply put, the CAM must be able to handle a number of lookups at least equal to the sum of the data rates on each interface, divided by the minimum packet size, multiplied by the number of lookups per packet. For example, a CAM intended to support two OC-192 ATM interfaces with one virtual path identifier/virtual channel identifier (VPI/VCI) translation per cell has a peak lookup requirement of:
(1)
Note that the actual requirements are often less than shown in this calculation; any additional overhead (such as SONET framing) has not been accounted for, and should be factored in when calculating actual peak requirements. Since a CAM can execute pipelined lookups with single-cycle throughput, the CAM would need to have an operating frequency of at least 47 MHz.
The designer must then account for table maintenance and update operations. However, update operations typically occur on the order of thousands of times per second rather than millions for many applications. Adding margin for one million operations per second (MOPS) - more than enough for most applications - brings the minimum operating frequency up to 48 MHz. Since CAMs are fully static CMOS devices, they can operate at any frequency between the maximum rated operating frequency and DC, giving the designer flexibility to use the clocks available on the board.
Simplifying the interface Designers who do not need sustained single-cycle performance from their CAMs can reduce system complexity and cost by reading the register contents (including the next free address and the search results) out through the CBUS rather than through the dedicated buses. This reduces the maximum search rate as new searches cannot be performed during the read cycle and as additional cycles are required for bus turnaround. In-creasing the system frequency can help offset this performance reduction. This technique is very pop-ular with designers with lookup requirements under 15 to 20 million lookups per second, as it can save as many as 30 to 50 pins on a control ASIC or field programmable gate array (FPGA).
Designers can also realize savings of an additional 32 to 36 pins (de-pending on the CAM selection) by using only half of the CBUS and using the write-enable control pins to load or read data in 32/36-bit chunks. The two halves of the CBUS should be connected together on the printed circuit board. Since all signals, including the write enables, are synchronous registered inputs, there is no danger of contention due to tri-state overlap.
Figure 2 shows a high-level block diagram of a router or switch line card that requires just 56 pins for the CAM interface. This simplified interface provides nearly 19 million lookups per second on 72-bit-wide words at an operating frequency of 66 MHz.
Direct vs. FPGA interface Although CAMs have a bus structure incompatible with most microprocessors, certain processors, such as the Motorola PowerQUICC-II MPC8260 communication processor, are able to connect directly to CAMs without an FPGA or ASIC to adapt bus-cycle timing. These processors can be programmed to generate arbitrary bus cycles that are compatible with the CAM interface. The processor's address bus is connected to the instruction bus and the control signals (the word-enable pins), and the data bus is connected to the CBUS (see Figure 3).
Once the memory controller has been configured, the CAM can be treated as a memory-mapped peripheral. It will read and write to certain addresses and effect searches, reads, writes, and other CAM operations. In such a configuration, results are read back through the CBUS.
One of the drawbacks of a direct microprocessor interface is that the burden of preserving the order of pipelined search operations falls on the software. If a search operation finishes while other search operations are pending, the code must guarantee that the processor will read the results of the first search before the second search is complete, regardless of the state and load of the processor.
As a result, the software must disable interrupts while the searches are pending, which may be undesirable. Designers using a direct microprocessor interface typically do not pipeline searches to avoid the inconvenience of writing time-critical code unless performance dictates it.
An FPGA interface, on the other hand, easily enables high-performance pipelining. It is the only way to interface to a processor that does not have a programmable-memory bus controller, a category that includes many network processors. In addition, the only way to implement more complex features such as support for multiple threads or multiple processors is with an FPGA or ASIC. Several CAM vendors already provide source code for programmable-logic interfaces to different network processors and microprocessors. System-development time can be reduced dramatically even if the code needs to be modified to include customer-specific functions or intellectual property.
Maximizing CAM usage Today's CAMs have reached a monolithic density point (4.5 Mbit) that is ideal for many networking applications. Designers can satisfy all of their routing-table and speed requirements with several individual CAM devices. For example, a single 4.5-Mbit ternary CAM can support 96k Layer-3 packet forwarding entries, 16k Layer-2 MAC addresses, and 8k packet classification entries simultaneously.
Even though the different types of lookups have different entry widths, designers can still take full advantage of the density of the CAM device by packing the entries into the table efficiently. This intradevice configuration is accomplished through a combination of two techniques: partitioning and tagging.
Partitioning allows multiple entries to be stored within a single CAM entry. Each entry is partitioned into multiple fields, and the global masks are used to restrict the search to one or more of these fields. This enables entry depths much deeper than the CAM configuration through efficient use of the CAM's width.
Tagging is a technique of appending user-defined bits to the data to qualify searches, thus restricting a search to operate on some subset of the available entries. They can either specify the type of entry (useful for combining multiple tables within a single device) or specify additional characteristics that should optionally be used during a search (such as restricting route lookups to those learned by a particular routing protocol).
Figure 4 shows an example of Layer-2 and Layer-3 routing tables aggregated into a single CAM array. The array has been partitioned, allowing two IP addresses to be stored in every CAM entry. Three tag bits have been used: one specifies the type of the entry (Layer 2 or Layer 3), while the other two are used as additional validity bits. During a search, the user must specify which table to search by setting the type tag bit appropriately. The additional validity bits prevent a search from matching random data in an otherwise empty location.
Wildcarding True ternary CAMs give the user the ability to mask entries on a per-bit basis. This ability provides an efficient manner to represent ranges of values in a single entry. When properly prioritized, these ranges can be grouped with other ranges to provide single-cycle router lookups.
IP addresses are the most commonly ranged values in ternary CAMs. Since these values typically fall on binary boundaries, ranges of IP addresses representing an entire subnet or an aggregation of routes can be represented in a single CAM entry by using the local masks (bits set to 1 are masked). For example, Table 1 shows that a CAM word of 193.177.2.99 used in conjunction with a local mask word of 0.0.0.255 (1 corresponds to a masked bit) will match any IP address in the subnet 193.177.2/24.
Other types of lookups may incorporate ranges that do not fall on binary boundaries. For example, a user may want to specify access-control list (ACL) entries that deny or allow access to a particular range of TCP ports. These ranges may not fall on binary boundaries, as port numbers have little to no correspondence with the type of service offered on that port. The wildcarding capability of ternary CAMs allows the user to generate an expression that represents the entire range in a compact form. Table 2 shows how a 38-port range can be expressed with just five table entries.
Designers can also maintain complex routing tables that offer comparisons of multiple fields in a single clock cycle by taking advantage of the inherent parallelism of a CAM lookup. Because multiple entries may match the search key, the designer must be aware of the order imposed by the multiple-match resolution scheme of the CAM. In a CAM, whenever multiple matches are returned, the index returned features a physical address closest to 0. Thus, by ensuring that less specific table entries are assigned a lower priority than more specific table entries, multiple branches of a decision tree may be collapsed into a single lookup.
Table 3 is an example of an ACL lookup table. Table entries at indexes 991, 9012, and 32767 match the incoming key. However, index 991 is considered the highest-priority match because it has the lowest physical address among matching entries, and thus is the highest-priority matching entry.
Future CAMs
Although they are already required for high-end systems, because CAMs have a broad range of applications and offer the ability to tailor the use of the part to the application, they will become ubiquitous in networking equipment designs within a few years. The CAMs of the future will add features that extend this flexibility and will also scale to new densities, widths, and frequencies. Driven by end-user requirements for new networking equipment features that require table-based lookups, these new CAMs will enable even deeper single-cycle analysis of packet headers. Because of the added capabilities beyond what is found in standard CAMs, these devices are beginning to be known as silicon search engines. With the application of the aforementioned techniques, a single-CAM chip solution may soon be all that is needed to handle the look-up requirements for many designs.
|