Architecture and design of function-specific, wire-speed routers for optical internetworking
The requirements of tomorrow's networks will affect system design, particularly partitioning.
PARAMESH GOPI, Entridia Corp.
Internet traffic has grown dramatically over the last 24 months, a trend that is expected to continue over the next 24 months. A key element that drives the rate of growth of Internet traffic is the Internet service model. While voice and video are seen as emerging real-time services, secure commerce transactions and virtual private networks are already being widely deployed and transforming the underlying mechanics business.
As with any disruptive technology, the Internet brings with it a significant technology challenge-scaling the network infrastructure. The Internet is knit together by routers, which are considered the "neurons" of the network; they seamlessly integrate large subnetworks together using Internet protocol (IP). A network and system-level view of high-performance routers will expose the reader to design tradeoffs and approaches to solving the problem of delivering wire-speed quality of service (QoS) scalable to wavelength-division multiplexing rates.
A snapshot of the Internet backbone traffic profile is shown in Figure 1. The chart clearly shows that the distribution of packet sizes is almost trimodal, with 30% of the packets falling into the 40-byte category. These small packets are usually TCP acknowledgement messages and appropriately considered to be an important corner case in the design of high-performance routers. It is quite conceivable that a traffic flow can consist of back-to-back minimum-sized packets sustained over many seconds and can thus put a considerable amount of "strain" on a router.
Future extrapolation of the traffic model can be achieved by combining the forecasted increase in traffic with the trimodal packet-sized distribution. In the simplest case, we can expect an overall increase in the volume of acknowledgement traffic (minimum-sized packets). In the more realistic case, we must overlay best efforts and mission-critical traffic along with streaming services and consequently need to add a third dimension to the mix of traffic-the type of service.
Meanwhile, with the move toward a converged IP-based infrastructure, carriers require intelligent devices at network aggregation. With delay-sensitive streaming content being injected into the network, line-rate performance is a non-negotiable criterion. Additionally, re quirements for ultra-high connection densities, as well as highly efficient, low-power, space-conscious equipment, has led to a new metric to qualify carrier gear: routed gigabits per (power unit X rack space unit X $).
To grasp the interplay of these trends with system design, some fundamental concepts must be understood. For example, wire speed (or line rate) refers to the maximum rate at which any physical medium can sustain information transfer. The key variables that determine wire rate are the number of bits per second that the physical medium is capable of transporting as well as the size of the minimum quanta of information (packet or cell). Thus, a link capable of supporting 2.4 Gbits/sec (OC-48c) that carries 40-byte (320-bit) packets with no interpacket gap or overhead bits corresponds to a packet rate of one packet arriving every 129 nsec.
Wire-speed or line-rate processing requires that operations be performed on a per-packet basis at the maximum packet arrival rate (every 129 nsec for the aforementioned case). Guaranteeing line-rate packet processing and forwarding performance have numerous positive side effects in the QoS domain.
Meanwhile, the concept of routing focuses on using network-layer information to forward packets. The basic network-layer functions (OSI Layer 3 and 4) consist of the following:
- Route processing. Where is the packet destined to arrive?
- Flow processing. Stateful information that categorizes a packet or group of packets that belong to an information session.
The action of determining the destination of a packet based on data embedded within it is termed route processing. IPv4 networks use classless interdomain routing (CIDR), which was instituted by the Internet Engineering Task Force (IETF) in the 1980s to optimize the use of available address space. The basic principles of CIDR involve the segmentation of the Internet into a hierarchical, logically addressable group of subnetworks. Consequently, each router is required to keep track of only the paths that are directly accessible via its network interfaces. CIDR's logical addressing scheme requires a "longest network prefix match" operation, which is set by a mask on a 32-bit IPv4 address. CIDR route lookups are not direct table matches and thus become quite complex with large tables.
The complexity of a CIDR route lookup dramatically changes with the total number of routes in a route table. The nested nature of the addressing scheme causes a logarithmic change in lookup time with increased table size. Wire-speed algorithmic CIDR route lookup is nontrivial, since it involves translating an algorithm into hardware (such as an application-specific integrated circuit) and ensuring that it provides deterministic convergence under worst-case traffic conditions. A second challenge is to keep the jitter (i.e., the variation in algorithm convergence timings) bounded so as to limit latency within the network.
Packet classification is the key element of flow processing. Packets may be classified based on a parameterized set of metrics that may involve multifield packet header analysis. The parameters are usually specified by a user in conjunction with resource information that may be derived from routing protocols.
Flows in connectionless networks are determined by grouping packets that have common application-layer or session-layer information. A flow can be based on information transacted between a particular source and destination IP address or a TCP/UDP socket. Flows can also be based on DiffServ code points or type of service bits. Fundamentally, classification of like packets based upon information contained within each of them constitutes a flow.
DiffServ, or "Differentiated Services," deserves further explanation. DiffServ results from IETF initiatives to specify a means of providing end-to-end QoS in a connectionless packet-based network. The IPv4 packet header comprises a byte that consists of a 3-bit type of service field and a 5-bit field that provides 32 extra code points for marking packets to denote various levels of service. These DiffServ labels may be generated from source nodes in the network and may be altered by intermediate routers to shape network traffic. DiffServ is meant to provide a granular means of differentiating classes of service at the network edge.Figure 2. The Internet can be seen as having four components, each carrying its own processing requirements.
As mentioned earlier, a flow can be identified by various parameters, including DiffServ labels and application (TCP/UDP) information. Edge flows with granularity are termed microflows. An example of a microflow would be the classification of all packets of a certain TCP/UDP socket that originate from a particular IP address, or all RTP traffic destined for a certain IP address. Once a packet has been classified at the network edge and has been identified with a particular flow, it is forwarded out of the particular routing device onto the next level of aggregation within the network.
Figure 2 shows a multi-edge Internet model that illustrates the various levels of aggregation occurring at different points in the network and the rough route and flow metrics at these points. It is important to recognize the inefficiency of multiple examinations of the same flow of packets at various aggregation points. In fact, as we approach the backbone, data pipes get larger and packet arrival rates increase, making it impractical to perform deep-packet examination within the core. Additionally, the core may be operating on a different link-layer protocol such as Asynchronous Transfer Mode (ATM).Figure 3. A typical router contains three major pieces of hardware: network interface cards, switching fabric, and protocol processor.
Enter macroflows. A macroflow consists of a logical grouping of similar microflows. For instance, all packets entering a backbone or core device that have similar microflow information (e.g., DiffServ labels) may be grouped into a macroflow and can be metered, policed, and engineered efficiently. The concept of hierarchy in flow management and QoS classification has led to the use of Multiprotocol Label Switching (MPLS) as a means to manage and engineer macroflows.
MPLS was initially conceived to be a mechanism that unified the IP and ATM domains at the Internet core. It has, however, also become a powerful traffic-engineering tool. At the simplest level, MPLS allows core traffic to be engineered at either a circuit level via an ATM switch or at a packet level. The actual physical tag may denote an ATM virtual channel that has prescribed traffic behavior, or it may be used as a way to abstract Layer 3 microflow information and engineer macroflows at the core.
Path discovery involves the use of routing protocols. Routing protocols such as RIP, OSPF, or BGP-n are inter-router information-exchange mechanisms that build and maintain packet-forwarding tables used by the packet-forwarding blocks to physically route traffic and by policy and flow software to maintain and update flow tables. These protocols include algorithms that use value metrics based on a variety of parameters. An example is a network distance-vector metric, i.e., the closest network entity that has a path to the final destination. Other metrics used to build tables include latency and reliability.
In an IP network, the network-layer functions drive the QoS assigned to various types of traffic. QoS is applied via traffic engineering, which involves three distinct mechanisms:
- Admission control. This mechanism acts on incoming traffic that has been categorized by the network layer to ensure that all flows of information meet predetermined profiles (arrival rates), which in turn are determined by service-level agreements.
- Traffic shaping and bandwidth management. In this case, flows and other related parameters are used to determine when and at what rates various types of packets egress the system. Queuing becomes an essential part of the shaping and bandwidth management.
- Congestion control. All network devices are expected to experience congestion. While QoS is generally thought of in terms of prioritizing outgoing traffic, the avoidance of congestion is a key mechanism often sidelined or forgotten. Large, time-varying traffic patterns coupled with service overlays on the infrastructure could potentially cause network outages. Controlling congestion involves statistical coloring of traffic based on network and application layer information. Usually, processes such as random early detection (RED) monitor the state of various queues within the system and start to drop packets based on their capacity. It is important to note that drop processes such as RED can be modulated by weights that are user-supplied.
Fundamentally, a high-performance router can be subdivided into two pieces-the routing/path-discovery plane and the packet-forwarding plane. These two distinct pieces are subject to various levels of implementation and partitioning depending on the router's position within the network. Figure 3 illustrates the typical architecture of a high-performance router.Figure 4. In a centralized packet-forwarding architecture, the NICs provide little more than interface capabilities, while the switching fabric performs the lion's share of the QoS/traffic-engineering functions.
It is extremely important to note the path-discovery process time constant is on the order of 10 to 100 msec, while that of the packet-forwarding process scales with line rates (1/129 nsec at OC-48c). The large time-constant difference between these processes presents a logical opportunity for first-order partitioning-separation of the packet classification and forwarding paths from the routing control path. Subsequent architectural decisions involve further partitioning of the packet classification/forwarding paths.
There are two broad methods generally followed: centralized packet forwarding and distributed packet forwarding. Key factors that drive the choice of approach include scalability, protocol support, and power.Figure 5. The distributed packet-forwarding architecture puts more responsibility and more intelligence in the network interface cards. This architecture tends to be more scalable than the centralized architecture.
The basic concept of centralized packet forwarding is illustrated in Figure 4. Network interface cards (NICs) comprise physical media dependent (PMD)- and link-layer functions (e.g., framing or SAR), limited packet processing (Layer 3), and in the simplest case, interface directly to a switch fabric. The switch fabric becomes the crucial element in this architecture, since it handles the bulk of the network-layer-driven QoS/traffic-engineering functions as well as basic packet transport across line cards.Figure 6. A router aimed at 40-Gbit/sec applications in growing networks should use a distributed packet-forwarding architecture, with several of the processing functions residing in a single application-specific integrated circuit on the network interface card.
The basic concept of distributed packet forwarding is illustrated in Figure 5. NICs essentially assume the role of a full router. They comprise all the hardware, including the PMD and link layer, but are also fully equipped with packet-processing functionality as well as local QoS and traffic engineering. The switch fabric is purely optimized for nonblocking transport of packets across line cards and does not include any sophisticated QoS/traffic-engineering functions.
A summary of the differences between centralized and distributed packet forwarding-as well as the implications of these differences-appears in Table 1.
Let's look at relevant requirements and explore "first-cut" architectural partitioning for a backbone routing device capable of line-rate performance at 40 Gbits/sec. Specifications for the functional requirements at a system level are contained in Table 2. Thus, most of the discussion within this section explores architectural and component-level implications of these requirements.
As stated previously, scalability, performance, and power requirements drive system partitioning. A distributed packet-forwarding architecture clearly lends itself to a more scalable system. The distributed architecture allows for building-out of a maximum capacity chassis and backplane and de-couples the scaling of the network layer from the switch fabric. It's possible to take a "divide and conquer" approach to solving system performance and scaling issues by:
- Reducing the complexity of the switch fabric and making it an ultra-fast, highly integrated, dedicated data-transport layer.
- Building a chassis with an optical backplane (fiber) that can scale as high as 20 Gbits/sec per link.
- Decoupling network layer from switch fabric allowing for maximum flexibility in scaling each independently.
The power and space requirements dictate a minimum number of components on each line card. Power constraints dictate efficient utilization of silicon. Generalized network-processing components may not be able to provide the optimal power/functionality point required for function-specific systems. The solution requires function-specific ASICs that efficiently implement network-layer and QoS/traffic-engineering functions.
It is important to note that the tradeoffs associated with distributed routing include complexity in the control and management planes. The maintenance and updating of distributed packet-forwarding intelligence may necessitate a more powerful processor and will result in more complex protocol structures. That, however, is a small price to pay for scalable redundant line-rate performance.
As mentioned earlier, the clear separation of the software routing-control plane from the packet-forwarding path is key to the realization of line-rate performance. Figure 6 indicates the architecture of a typical line card and switch-fabric card for the proposed system and delineates relevant component-level requirements for optimal realization.
It is clear that Internet growth coupled with service-laden traffic requires a new breed of IP-specific routers. These backbone devices require ultra-high connection densities, low power, and small-form-factors (SFFs). Additionally, they require user-configurable, sophisticated traffic-engineering mechanisms that are MPLS- and DiffServ-based and tightly controlled deterministic packet-forwarding performance.
The basic set of requirements for a scalable Internet backbone device requires separating the control plane from the packet-forwarding plane and distributing the packet-forwarding function on a per-line-card basis. This approach allows the switch fabric to scale at the same pace as the optical transport layer, enabling an ultra-high performance architecture that is limited only by the fundamental physical data-transport layer.
Paramesh Gopi is co-founder and vice president of marketing at Entridia Corp. (Irvine, CA).