Resilient IP control plane: Path to network convergence
Service providers for a long time have realized that the real costs of providing services are operational. Yet, many providers have not integrated their voice and data services onto a single network because of cost and the lack of proven technologies. So why is now any different?
For starters, many service providers' revenues are shrinking, partly because enterprise customers are seeking inexpensive transport over IP, including voice services. These pressures in the wake of a prolonged economic downturn have created a situation in which service providers can become profitable only if they lower their operating expenditures. As a result, service providers are forced to spend money to get "over the hump" to a converged network.
Secondly, the migration of services to an IP/MPLS backbone is quickly becoming a reality. IP networks are already deployed, global, and capable of supporting the largest capacities offered. Obstacles remain, however. To enable convergence, IP/MPLS networks must support the requirements of the services migrated from circuit-based networks. While most requirements are easily met, control-plane reliability remains an issue.
Traditionally, the IP control plane, consisting of routing and signaling protocols, has been developed to be "good enough" for the traffic it supports. Continuous services and high reliability have historically taken a backseat to providing global reach and connectivity. This approach is sufficient for most Web traffic, email, and less-critical services that can tolerate temporary disruptions, routing loops, and instabilities. Many services now migrated to IP/MPLS, however, cannot tolerate these temporary disruptions.
Circuit-based services require exceptional control-plane reliability. The main reason for this reliability is that circuit-based services are generally thought of as provisioned services: Once a circuit is established, disruptions in the control plane do not affect the data-forwarding plane. With IP networks, however, proper operating behavior during a disruption in the routing protocol is to change the data-forwarding behavior. This behavior has always been tolerated because current Internet services can adapt to it. Industry analyst RHK refers to some of these disruptions as "microbursts" and estimates that for real time applications microbursts can cost service providers up to $230,000 per minute in traffic disruption.*
To adequately support circuit-based services, IP control planes need to evolve. Improving the IP control plane so that it is "good enough" for converged services requires addressing the following deficiencies:
- Long routing convergence time. Internet paths through the network are calculated by processing a shortest path algorithm based on the topology of the network, which is received through the routing protocols. With any failure or change in topology, it takes some time to re-converge onto the next best path. During this time, service is disrupted. Internet routing convergence is considered excessively slow compared to other technologies.
- Routing stability. Routing-protocol sessions between routers are used to exchange routes and topology information. If routing-protocol sessions are disrupted, which is fairly common during maintenance and failures, data-forwarding is disrupted until sessions can be reestablished. With some border gateway protocol (BGP) sessions taking five to10 minutes to re-synchronize, and the effects of route flap propagation, it is extremely important to keep the sessions up and synchronized.
- Routing scalability. Routing vendors have traditionally designed their routing capacity to stay ahead of the growing number of global Internet routes. So far, vendors have easily been able to support the current capacity of about 130,000 routes. However, many service providers are looking at request for comment (RFC) 2547 Layer 3 virtual private networks and BGP-based discovery of Layer 2 tunnels and virtual-private-LAN service technologies to provide network convergence. The result is an explosive growth in the routing capacity needed by core routers. Some estimates of RFC 2547 routes run as high as 10 million.
The Internet control plane always adapts to the demands placed on it—and there's no reason to think it is incapable of evolving to support the requirements of converged services. Several emerging technologies address the need for a more reliable, scalable, and resilient IP control plane.
Perhaps the most significant technology development is the advent of "stateful" protected-protocol sessions. With protocol protection, routing-protocol sessions are maintained during routing maintenance and unexpected failures. As a result of persistent routing-protocol sessions, data-forwarding is never interrupted, thus providing greater routing stability and eliminating the need for re-convergence of the routing protocols. While the benefits seem obvious, the difficulty in recovering the state of the routing protocols as well as the underlying transmission control protocol sessions erected a huge obstacle; previously, overcoming this obstacle could not be justified. Because of the increased demand for more reliable routing, we are starting to see scalable and deployable implementation of this technology.
It is estimated that implementing stateful protocol protection can eliminate about 60% of routing convergence events. If a routing convergence event occurs, however, it's important that it happens quickly and data-forwarding is protected. While significant technological advances in this area remain elusive, vendors have found that optimization of routing convergence can be achieved through protection techniques and implementation improvements such as the following:
- Fast reroute and link bundling with fast recovery. Although it does not improve the convergence time, fast protection schemes allow traffic to continue to be passed down an alternate path until the routing protocols converge on a new topology. The effect is continuous operation.
- BGP next-hop indirection. One possible reason for excessive convergence times is the time it takes for BGP to process each individual BGP route. Because BGP routes are recursively resolved through the interior gateway protocol (IGP) OSPF/IS-IS next-hop, great improvements in BGP convergence times can be achieved by grouping BGP routes that use the same next-hop together to make forwarding changes as a group and not individually for each BGP route.
- Fast link failure detection. One of the more time-consuming aspects of convergence after a failure is the actual time to detect the failure. This time can be reduced through the application of direct notification to routing-protocol processes instead of waiting for protocol hello messages to timeout. When hardware mechanisms notice loss of light or component failure, the mechanisms can immediately notify routing processes and initiate recovery. Similar effects can be obtained by trimming the IGP hello interval from 30 sec to subseconds.
- Continuous forwarding upon routing changes. Another important behavior of routers is persistent forwarding during routing changes. When a router receives an update that causes traffic to use a different route or next-hop, the router should not drop packets during the transition. That follows the philosophy of make before break, as opposed to withdrawing the route before installing a new one.
Routing scalability can be addressed through a combination of network architectures and more powerful router-processing capabilities. Current router-processing engines typically employ moderately powered microprocessors using a moderate amount of memory. In next-generation networks, routing processors will more closely resemble the chips found in high-performance, fault-tolerant workstations (see Table). By taking advantage of Moore's law, with current technology, it is feasible to obtain extremely high route capacities that should provide plenty of headroom.
With increased size, larger routing processors will take on greater responsibility for managing the routes and connectivity of a variety of services. As such, the need for scalable and reliable routing processors will increase. Providers will likely focus on building a routing architecture based on standalone route processors (route reflectors, route servers) that do not forward data. The benefit of such an out-of-band architecture is that it allows the routing to scale without affecting the data plane. Additionally, these routing systems can be based on an open operating system that permits provider management systems to obtain routing and provisioning information for integration with current operating systems.
In short, the convergence of voice and data services onto a single network is already well underway. In the current climate, service providers must evolve their networks to remain competitive. Before voice and data services can migrate to a single IP/MPLS core, the IP control plane must evolve to support the scalability, reliability, and stability required by business services. Fortunately, technologies exist today that enable these advances. Routing technologies such as stateful protocol protection, optimized route convergence, and high-performance scalable routing processors can provide the IP control plane with the resiliency that's needed.
Eric Brendel is senior network architect at Chiaro Networks (Richardson, TX).
*S. Yin and K. Twist, "The Coming of Age of Absolute Availability," RHK, May 2003.