Network availability is vital in the new world of IP
SONET/SDH, ATM, IP
Implementing a high-availability Internet Protocol infrastructure is a key step in establishing IP as the multiservice protocol of choice.
ALEX DOBRUSHIN, Amber Networks
The explosive growth of Internet Protocol (IP) network traffic represents a major revenue opportunity for service providers that can step up to the challenge of delivering mission-critical IP-based services. Unfortunately, today's IP networks lack the resiliency needed to support the high service availability (99.999%, or "five-nines," uptime) needed for these applications.
IP's tremendous momentum is forcing service providers to take a new look at how these networks are being built. The days of using IP networks for just Web browsing and e-mail are gone. IP has become a ubiquitous communications protocol that underpins an increasing array of enterprise services. The criticality of services running on IP continues to increase (see Figure 1).
Mission-critical enterprise resource planning (ERP) applications are moving onto IP. Businesses find themselves more reliant on IP-based virtual private networks (VPNs) that serve as conduits for geographically dispersed collaborative engineering, the transfer of financial data, and access to remotely hosted servers.
First-generation IP services (e.g., Internet access, Web hosting) have become a commodity. At the same time, service providers are looking for ways to increase revenue. Service providers are viewing the converged IP network as the basis for their next-generation business models. They are investing in a robust IP infrastructure to take advantage of increasing data revenues.
They also see the synergy of combining data and voice traffic on the same multiservice IP network. Although it is not growing at the same rate as data traffic, voice revenue still represents the bulk of the wireline carrier's revenues. Offering voice-over-IP (VoIP) services is one way to migrate these revenues to IP networks.
Another key source of revenue comes from traditional transmission services, like time-division multiplexing (TDM) private line, frame relay, and ATM. According to industry analyst RHK Inc. (San Francisco), these services will represent $50 billion in revenues in 2003. Using new service transformation technologies now available in the industry, service providers can encapsulate this non-IP traffic for IP transport.
This type of multiservice aggregation represents another way of lowering the total cost of ownership of the IP network. It provides a way to improve the return on investment for the estimated $70 billion in core routing and transmission equipment that service providers will purchase by 2003, according to RHK.
Supporting mission-critical IP services and migrating non-IP services to IP networks requires service providers to achieve the same level of network availability that they support in their traditional public-switched telephone network (PSTN) infrastructure. Yet, the industry acknowledges that IP networks have not yet achieved this level of reliability. Sprint spokesperson, Charles Fleckenstein, recently made this point in the trade press: "When you pick up a telephone and dial, you want it to work-and it does-99.999% of the time. And IP does not deliver that kind of reliability."
While customers would like to collapse a diverse set of mission-critical applications to the service provider's IP networks, it will not happen in a meaningful fashion until the network's reliability has improved. It is obvious that as IP moves up in the service food chain, achieving the PSTN-level availability will be crucial for meeting customer expectations and for growing IP revenues.
Five-nines availability is a staple in traditional TDM networks, which employ techniques such as real-time monitoring of switches and automatic rerouting at the facilities level to attain maximum network reliability. Five-nines availability translates into approximately 5 minutes of downtime per year. Although IP networks run on top of a resilient optical transmission layer and the IP core can maintain high availability using a diversely routed mesh architecture, the IP edge remains a single point of failure (see Figure 2).
At the IP service edge, hundreds-if not thousands-of subscribers can be connected to a single router. Yet, the most popular edge routers today were not designed for high-availability carrier applications. Outages due to software and hardware faults are common. As a result, subscribers may be locked out of their services 7-10 minutes (or considerably longer) each time the router reboots to recover from a fault. Over a 12-month time span, the typical legacy edge router is down at least 40 minutes. That's at least nine times longer than what is allowed in PSTN-grade networks. This figure assumes that the system reboot and routing protocol recovery is handled automatically. However, a significant amount of manual intervention is often required.
Clearly, service providers need to implement sufficient fault tolerance at the service edge to avoid indeterminable periods of router downtime. Having to reboot the router system just once per year can exceed the five-nines availability goal, making today's IP networks unsuitable for many mission-critical applications. The non-deterministic nature of router recovery also complicates the prediction of service availability, which makes it difficult to establish service level agreements (SLAs).
Some service providers are attempting to make the edge more tolerant to faults by deploying two routers in parallel. While it garners a slight improvement in availability (router outages still typically require 5 minutes of downtime per incident), it doubles the cost of infrastructure. Twice as many routers, digital-crossconnect ports, and wiring runs must be installed. The deployment and management complexity is also in- creased. In the long term, this approach will not scale operationally or financially. A different, better solution is needed.
At the heart of the issue is not how quickly the hardware can recover, but what happens to protect the service and network information required to maintain service continuity. Restoring this information is a complex process that must be accomplished by the faulty router, which then must propagate the status of its restored routes throughout the network.
When an enterprise-class router loses its primary control circuitry and operation falls back to its standby controller quickly (if its hardware architecture permits this), it typically takes 7-15 minutes or longer to boot the operating system (OS) and re-converge at the IP layer. The typical edge router recovery sequence is shown in the Table.
The router must be rebooted, followed by restoration of upper-layer protocols that were in effect when the interruption occurred. The two logical functions that must be restored are the routing engine, which terminates all of the routing sessions, and the forwarding engine, which actually forwards the packets.
The routing engine is responsible for getting all the network topology information from its neighbors as well as network segments configured locally. The routing engine parses its routing database to determine the best paths to all the reported destinations. It uses this information to configure the forwarding engine, which contains the packet forwarding information. Since it is the routing engine's job to populate the forwarding engine's memory with valid forwarding table entries, if the routing engine fails, the forwarding table is invalidated and all packets are dropped until the router comes back up.
Both the Interior Gateway and Border Gateway protocols (IGP and BGP) must be restored. Internal to the network, the IGP must first converge as it provides BGP with reachability information regarding other BGP routers within the routing domain. BGP is then used to advertise IP network reachability information to other routers, both internal and external. Because of the inherent risks (e.g., packet forwarding loops) associated with utilizing potentially stale packet forwarding information, the router's system controller typically invalidates the packet forwarding engine immediately when a failure is detected.
Before routing sessions can be re-established, system configurations (e.g., frame relay and ATM virtual circuit mappings) must be loaded. Then dropped point-to-point protocol (PPP) sessions must be re-established. For a large IP-serviced edge router, tens of thousands of PPP sessions may need to be restored, which can take several minutes. The impact of the outage spreads not only to subscribers but may cause havoc across the service provider's own network, peer networks, and the entire Internet (see Figure 3).
Inside the service provider's own network, the edge router failure has an impact on other routers in the network. When the routing sessions time out, adjacent routers report the loss of communication with the failed router. All of these interior routers must then re-compute the shortest path to each of the destinations that were learned via the IGP. During this computation, the router's CPU is used almost exclusively to process updates.
The disappearance of the failed router will likely cause some level of network instability as adjacent routers become aware of the problem. The failed router will lose its BGP routing sessions. Its BGP neighbors will report that the destinations learned via the failed router are no longer available. These messages propagate across thousands of networks and the tens of thousands of BGP routers that compose the Internet backbone. When the router recovers, the lengthy notification process must be repeated across the entire Internet backbone.
The act of going down and back up also triggers routing advertisement oscillations, or "route flaps," which may result in BGP route re-computation and updates to the routing and forwarding tables throughout the network. In such cases, BGP route flap dampening is used to penalize misbehaving routes by suppressing them for some amount of time. Suppression of routes may go on for several minutes or even hours, during which time that the router's connectivity is lost.
This kind of unpredictable behavior is not suitable for meeting the needs of mission-critical IP services. Nor can this operational model foster the convergence architecture needed to bring the incremental revenue afforded by the transport of traditional non-IP services.
What's needed is an edge routing platform that can recover quickly from faults, thus avoiding the downtime and negative effects on the overall network behavior and performance. A new breed of fault-resistant routers is poised to solve these issues. These routers have purpose-built, redundant hardware complemented by a new software environment capable of providing continuous service availability.
While this requirement seems trivial, providing route state protocol (e.g., BGP, open shortest path first, and intermediate system to intermediate system) and other control plane redundancy is extremely complex. It requires a ground-up approach to the router OS. The router's hardware is directed by this fault-tolerant OS, which allows routing information to be dynamically mirrored among multiple router control elements. Implementing such new OSs in the legacy router environment is practically impossible. They are simply not designed to support dynamic recovery.
Bringing routing fault tolerance to the edge enables service providers to create a PSTN-grade IP network. It allows them to offer mission-critical IP services and provides access to a large pool of revenue-generating services through multiservice aggregation (ATM, frame relay, TDM leased lines over IP). The benefits are numerous:
- Reduced equipment costs (capital expenses).
- Reduced operational costs due to the consolidation of network elements (power and space).
- Simplified capacity planning and traffic engineering and reduced administration and management.
- Improved performance.
- Improved reliability due to the reduction of cabling and related infrastructure.
- Increased profit margins, and improved customer satisfaction and retention.
With increasing competition, dedicated IP access is becoming more of a commodity than ever. Today, most IP service providers can only rise above competition based on price. With the five-nines IP network, service providers can maintain the value of their IP services by offering premium services with en-hanced SLAs. The result is significant competitive differentiation, improved customer satisfaction, and increased profit margins.
Alex Dobrushin is vice president of marketing at Amber Networks (Fremont, CA). He can be reached via the company's Website, www.amber networks.com.