Finding light at the end of the tunnel

Integrated fault and performance management gives service providers one view of their entire optical network.

By BRIAN BURBA
Concord Communications

When a freight train carrying toxic chemicals exploded in a ball of fire in a Baltimore tunnel last July, it first appeared to be a tragic accident that affected local transportation. When the accident severed a central optical artery connecting the mid-Atlantic to the rest of the country, it instantly became a national network management problem that ground eastern U.S. Internet traffic to a near halt in the mid-Atlantic states and Northeast.

Web surfers, work-at-home professionals, and IT managers at global enterprises alike suffered the same result--little or no Internet access, slow or nonexistent e-mail, and severely hampered outside communications. Washington, DC, New York, Philadelphia, and Atlanta experienced serious Web traffic flow problems. Service providers such as WorldCom, PSINet, and Genuity scrambled to find alternate routes within their networks for their data.

A lesson learned was that with a real-time, end-to-end, detailed view of their overall optical infrastructures, some service providers could have quickly identified the best alternate routes for redirecting traffic. An integrated fault and performance management system could then have helped service providers monitor how those alternate routes handled traffic so they could clear away bottlenecks before customers noticed.

But beyond enabling carriers to react to calamities, integrated fault and performance management software allows service providers with optical infrastructures to automate and simplify a very complex task: managing multilayered networks that incorporate many different protocols and systems. As fiber-optic networks penetrate deeper into metropolitan areas and telephone companies add new products and services, these service providers will have to monitor more of the handoffs between electrical and optical networks.

An integrated fault and performance management system gives them a complete view of their entire infrastructure, including the optical and electrical network and all the handoffs in between. Being able to see problems as they occur (like the broken line from the tunnel fire) and identifying troubling performance trends that could evolve into real problems help service providers deliver on the promises in their service-level agreements (SLAs).

The optical challenge

Born more than 20 years ago of a need for increased bandwidth, fiber-optic technology has become the backbone for service providers delivering a broad variety of lightning-fast, high-volume telecommunications services, including millions of telephone calls and video feeds.

Today's optical networks have become much more complex than the earliest optical systems. The original optical backbones only carried one set of data over dedicated bandwidth. Now, these networks support different applications, protocols, and equipment pushing the data about at bit rates ranging from OC-48 (2.5 Gbits/sec) up to OC-192 (10 Gbits/sec). From a management perspective, these networks are very intricate, because a web of LANs, WANs, and MANs now sends digital packets from ATM, frame relay, or wireless access protocol (WAP) to the optical network.

The handoffs between the IP/ATM networks and SONET and DWDM devices require converting electrical digital signals to lightwaves. For now, most of these handoffs occur just outside of large metropolitan areas, where most optical multiplexing originates. Optical networking will move closer to the last mile as service providers continue to offer more bandwidth-hungry services.

For example, when someone in an office building in Boston makes a voice over IP (VoIP) call to a colleague in San Francisco, the signal will begin on an IP network and may switch to ATM before reaching the gateway to the optical backbone. An MPLS device may sit on the routers and optical switches along each leg to ensure that routers are handling the VoIP packets as priority traffic so call quality doesn't suffer.

After a series of hops on the fiber-optic line, the call signal will hit a gateway outside of San Francisco and reenter the electrical domain before arriving at the recipient's desk. That's how the call is supposed to work. Making sure it goes according to plan is another story. Tracking the call's quality and ensuring against problems becomes much more involved than monitoring a fixed signal over dedicated bandwidth.

To ensure the call makes the transcontinental journey without a hitch, the service provider has to manage every hop and handoff as well as the multivendor hardware and software used to transfer data and transform it from electrical to optical signals. Now consider that the service provider may have tens of thousands of "calls" running at the same moment.

A topology view of the service provider's IP and optical networks spanning the country might look like a gigantic pile of silly string at a kindergartner's birthday party. Service providers carrying optical traffic must make sure that for all of its complexity, geographic span, and high volumes of traffic, their network continues to deliver peak performance. If not, they could face sky-high liquidated damages (penalties paid to customers for SLA violations) on the order of $1,000 a minute or more.

Patchwork approach

Because of the different systems they have driving their IP and optical networks, many service providers have deployed a potpourri of infrastructure management systems. They'll often have separate systems for their IP, ATM, and SONET networks. Some of the larger service providers may break down management even further by assigning different systems to geographic regions or business units.

Service providers with hardware and software from a variety of vendors face another network management challenge. All these products are proprietary, and they each have a unique performance-reporting scheme. Service providers running different management systems will have to spend time mapping these systems to each proprietary piece of hardware and software to ensure they have a clear view of how they're performing. If they don't, they could miss a problem that affects service.

Most service providers delivering services over optical networks have stitched together a patchwork of different management systems that only "see" one segment of the overall network. That limits their view of the network. If a problem arises, they may need weeks to track down what went wrong because they have to compare information from one management system to the next.

Performance management and capacity planning also become difficult for the same reason. When a catastrophic event like the Baltimore tunnel fire occurs, network managers using a piecemeal management approach will struggle to find good alternate traffic routes because they'll have to analyze performance information from several different management systems. All the while, they're racking up steep liquidated damages.

Moreover, many service providers that have deployed a patchwork of management systems are using technology that monitors and reports on either hard faults or performance (historical trends). Having one without the other could leave a service provider's electrical and optical networks vulnerable to problems that could go unnoticed until after they affect customer service.

One platform, one view

Service providers can simplify digital- and optical-network management with an integrated fault and performance management system that gives them an end-to-end view of their entire network. Marrying fault and performance management on one platform alerts service providers to hard faults in real time and identifies predictive performance trends that could indicate a future problem.

Fault systems on such an integrated platform pinpoint real time problems using traps that sit on individual components such as servers, routers, switches, and other network elements. In addition, network operations center (NOC) staff can place agents on systems in their optical networks to monitor different functions.

When a device such as an optical amplifier exceeds a given performance threshold, the agent generates a trap for that fault and sends an alarm to the company's management interface. These alarms help NOC staff identify problems before their customers do.

The performance systems on the integrated platform track how different infrastructure elements perform over time, so NOC staff can understand normal behavior and see trends developing that could indicate a developing problem. Usually, performance relies on polling to assess the entire infrastructure's health.

The infrastructure management system periodically sends messages to the various devices in the electrical and optical networks to sample their performance and determine if they are operating within acceptable levels. The devices reply to the management system with the performance data that the system will store in a database.

More advanced integrated fault and performance management applications automatically correlate historical trend data and put it into reports. They will also provide predictive trending so NOC managers can proactively adjust traffic flows as the demand for service fluctuates. The reports give NOC staff a clear view of what's going on in the infrastructure. By seeing how devices operate over time, managers can get a better view of how they normally run--and when something is wrong.

Keeping service up and running

Service providers facing increased infrastructure demands of next-generation services will need fault and performance working together in one system. This integrated platform gives them the combined view to identify and fix the problems that could affect service so they can deliver the quality of service their customers expect.

For example, if a router falls below a preset threshold for delivering data to a SONET gateway, most fault systems will automatically send an alarm--and keep sending an alarm for as long as the problem exists. The resulting barrage of alarms often prevents NOC staff from seeing what else is going on in the infrastructure. De-duplication capabilities can narrow down the number of alarms, but still won't tell staff if it is a real problem or something they can safely ignore.

An integrated view of fault and performance lets staff refine the threshold for the router. If it falls below the threshold for more than 20 minutes, the system will recognize that behavior as abnormal and send an alarm that a real problem is occurring and not just a random spike.

This integrated view gives service providers better information about how their infrastructures, including IP, ATM, SONET segments, and other types of networks--are running and tells them when a problem occurs and why. Armed with this information, they can quickly find and fix problems before they affect service, without having to waste time and money tracking down phantom performance issues in the infrastructure.

These features give service providers the information they need to maintain SLA conditions set with their customers and prioritize what problems they need to address and those they can ignore. The next time a major optical trunk goes down, they can use this technology to gain an integrated view of the entire infrastructure to determine the best alternate route for existing traffic. They can then monitor how that alternate route is handling traffic and compensate for any bottlenecks that arise before customers experience any slowdowns.

Boosting revenue

Products that manage critical traffic flows, together with predictive real time management, give service providers a comprehensive view of their entire infrastructures. Service providers can use these fault and performance capabilities to keep their infrastructures running smoothly despite the complexity of weaving together digital and optical networks and the need to manage multiple protocols, systems, and hardware. Telecom companies that expand their service offering can use integrated fault and performance to sustain that growth by managing the networks, systems, and applications in their optical networks to provide the reliable quality of service (QoS) customers demand.

Integrated fault and performance management gives service providers the information they need to make sound decisions about capacity planning. They can also prove that they are meeting QoS SLAs they sign with customers. And they can generate additional revenue by selling this performance information back to their customers so they can use it to plan for future needs.

Moreover, service providers using a 20th century approach to troubleshooting problems--rolling trucks when customers complain--will save money and fix problems faster with an integrated fault and performance management system that pinpoints a problem before it affects service. This capability will become increasingly important as optical networks penetrate further into metro areas and problems with optical networks become more apparent to customers.

Brian Burba is vice president of teleco and service provider marketing at Concord Communications Inc. (Marlboro, MA). He can be reached via the company's Website at www.concord.com.