Unbiased measurement of QoS on IP networks

April 1, 2001
Premises Networks

Third-party instrumentation that monitors and troubleshoots traffic can verify quality of service for the network administrator and, perhaps more important, for customers.

TIM BEAN, Shomiti Systems Inc.

Given the rapid acceptance of Internet Protocol (IP), quality of service (QoS) has become one of the biggest challenges for network administrators. That's especially true for voice and video applications that require real-time performance, which means minimizing buffering in the network infrastructure. Policy-based systems, gateways, switches, and routers are often configured with a myriad of vendor settings and protocol combinations to work in unison to provide priority for the real-time demands of this multimedia traffic. The challenge is in knowing that these QoS mechanisms are actually working.

Long the dominant architecture for workgroup and departmental LANs, Ethernet is extending its reach throughout the enterprise with the wide deployment of Gigabit Ethernet in backbone and campus architectures. The emerging integration of voice, video, and data in IP networks in the enterprise, metropolitan-area-network, and telecommunications infrastructures presents network managers with new requirements for ensuring QoS.
Figure 1. Internet Protocol telephony points of presence are interconnected by WAN links (T1, T3, and OC-N).

Driven by cost savings and a vision of the yet unrealized killer applications, an integrated voice, video, and data network over IP that supports variable degrees of QoS is a foregone conclusion. The IP QoS challenge is still being solved through a variety of evolving policy, bandwidth management, and call-control mechanisms.

With these QoS mechanisms properly configured, the network will perform flawlessly with regard to streaming video and high-quality voice 99.9% of the time. It's the other 0.1%, which adds up to almost 45 minutes a month, that will test the business relationships of the infrastructure provider, carrier, and enterprise customer. This 0.1% requires a vendor-independent view of the network, without shifting blame from one party to another. The QoS needs to be seen, understood, and improved as quickly as possible.

Carriers, enterprises, and network-equipment manufacturers all have similar QoS requirements. Carriers can see the problems with each voice and video IP call on their network through high-performance software or hardware instrumentation at each point of presence (PoP). Through accurate and comprehensive measurements of their network, the QoS problems can be corrected.

Enterprise customers that deploy voice on their traditional "data-only" IP networks will need to enhance their management and analysis capabilities to understand why a particular phone or set of calls is suddenly not working. As killer applications are deployed for the enterprise, it will become imperative for the network administrator to prove any issues are a result of the application and not the network infrastructure. If the problem is a result of the network configuration or a faulty network device, the information must be made apparent quickly to fix the problem.

Network-equipment manufacturers must measure the results of the tests they perform. While that seems obvious, it is not the easiest task to determine if a video or voice-only call "worked" unless all subjectivity is removed through quantification.

The PoP is where the competitive local-exchange carrier (CLEC) or local-exchange carrier (LEC) or regional Bell operating company (RBOC) passes phone calls destined for locations outside their local access and transport area to a long-distance carrier or interexchange carrier (IXC).
Figure 2. The point of presence (PoP) is where all the network traffic is aggregated. Therefore, network administrators can "instrument" the PoPs to view all the traffic using a robust device, which is often controlled remotely from a network operation center.

Within the PoP is the point of termination (PoT), which is where the CLEC or LEC ends. The PoT consists of a gateway that takes the traditional public-switched telephone network call from the CLEC, LEC, or RBOC, and places it onto an Ethernet pipe. The Ethernet pipe connects to a WAN access device and the call is carried via IP over disparate WAN technologies to a destination PoP where the process is reversed (see Figure 1).

The Ethernet within the PoP is the "tap" point where all the network traffic is aggregated. Network administrators can instrument these critical segments to view all the traffic-the packets at 10/100-Mbit/sec Ethernet and Gigabit Ethernet speeds-from one system simultaneously (see Figure 2).

Many PoPs are "lights out" locations with no personnel onsite. Networking devices in such locations must be very robust, requiring minimal maintenance. The equipment must also function in a distributed manner where control and configuration are done securely and remotely. These devices are driven by specialized application-specific integrated circuits and automatically controlled remotely from a centralized network operation center (NOC). For PoPs with large gateway complexes with many switched Ethernet ports, the devices can be configured to rove from port to port to monitor and analyze the ports under scrutiny if there are problems. That also can be done remotely from the centralized NOC.

The key is to capture every call at full line rate on a full-duplex network at each respective PoP and associate the QoS measurements on a call-by-call basis. The monitored and captured IP phone calls at each PoP can trigger alarms or be analyzed from a central location.

Some large enterprise networks require distributed solutions for centralized management, as well. In these scenarios, network costs can be mitigated through a device called a tap. Taps enable monitoring, analysis, or security instrumentation to be dynamically inserted into Ethernet links. External taps allow the network administrator to use distributed monitoring devices over many switch ports or on different key backbones without requiring the purchase of a device for each segment (see Figure 3).

A tap goes beyond the access granted by a switch span or mirrored port because it provides access to the actual network and to any physical errors without impacting the performance of the switch. Taps are fault-tolerant, passive to the network, and allow for the monitoring, capture, and analysis of physical errors on individual segments, or in the case of multiport taps, have the ability to rove between segments.

Taps provide significant advantages over port mirroring connections, which usually reflect only one side of a full-duplex connection. Taps give simultaneous access to both sides of a full-duplex switched LAN or virtual LAN. Some switches do not have the ability to view virtual LAN traffic.

It is usually apparent when a network has problems. Alarms trigger, computer screens flash red icons, customers call, and the boss is in the network administrator's office. But how could there be any problems? The network administrator has over-provisioned the network for the amount of traffic. There is plenty of bandwidth. Is the router-queuing algorithm malfunctioning given a new piece of data on the network? Is the policy malfunctioning?
Figure 3. Taps enable network monitoring, analysis, or security instrumentation to be dynamically inserted into Ethernet links. External taps allow the network administrator to use distributed monitoring devices over many switch ports or on different key backbones and manage those devices from a centralized network operation center.

Next to bandwidth constraints and packet loss, latency is the most common problem for real-time multimedia applications on an IP network. Latency is caused by the various buffering requirements of codecs and queuing algorithms in network devices. Policy systems can help network designers architect and provision the network to minimize latency. However, policy systems cannot measure the performance of the network that must carry out the policy. Network nodes do not queue, route, digitize, and transmit traffic equally.

When problems occur, network administrators investigate the change management system to find out "who changed what," or look at management information bases from the equipment vendor. They may even send pings and swap gear to troubleshoot the problem. If the problem is still not solved, the network administrator will be grateful for the third-party instrumentation purchased by the network designer. Most networks will have one system deployed strategically on their critical links, which the technicians can access from anywhere. No flights, no guessing at vendor logs, just pure data from the network itself to help technicians troubleshoot the problem.

When internal users call and complain about the quality of their voice call, network administrators need to prove to the powers that be that the problem is not the internal network. In many cases, it is a problem with the phone itself, such as poorly implemented echo cancellers.

With most third-party QoS instrumentation, more than 20 measurements can be performed on each channel-audio and video-within every call on the IP network. These measurements go beyond what is provided by most IP gateway, switch, and router vendors. The accuracy of the QoS measurements from third-party instrumentation is unbiased. This measurement information can be exported into a centralized repository where it can be warehoused for capacity planning, network service auditing, and customer-service applications. When a customer calls, the network administrator will know if the problem was within the network rather than some customer-premises equipment or rogue application.

Given the inherent chaos of IP, it is important for the network administrator to know what is really happening.

Tim Bean is vice president of engineering at Shomiti Systems Inc. (San Jose, CA).

Monitor, analyze, test, and troubleshoot. Can the deployed network instrument work in a distributed fashion to do more than just monitor and alarm? Can it analyze and capture the packets at full line rate? Can it generate test traffic at bit granularity to deterministically isolate a problem with a particular network device? Can the device summarize information and drill down to actually solve the problem as opposed to just generating an alarm? Can it do all these functions in one box in a cost-effective manner?

Scalability and ease of use. Does the device take less than 15 minutes to configure and install? Can the tool be used with less than 15 minutes of training? Can the same network-operation-center software work across all the 10/100-Mbit/sec Ethernet, Gigabit Ethernet, and 10-Gigabit Ethernet segments? Can this capability be leveraged across to a storage-area network?

High performance. Can the network administrator see every packet? Are the filters hardware-based to not miss particular packets? Can the network administrator see every phone call and associate the quality of service with a particular call detail record on a call-by-call and channel-by-channel basis?

Accuracy. Does the display accurately depict the traffic? Are the descriptions of the packets complete and accurate? Does the time stamp in the nanosecond range solve the most sublime problems? Can the hardware operate at full line rate in a full-duplex environment?