Building redundant networks with media translators

May 1, 1999

9 min read

Darrell Furlong

LANCAST Inc.

With customers logging onto companies` extranets and e-commerce sites, network interruptions can equal lost opportunity. Charging media translators with the novel task of redundancy offers an extra measure of security.

More traffic than ever is traversing the Internet. Not only are users driving increased bandwidth requirements, they are also demanding greater quality of service from networks. Customers accessing companies` extranets and e-commerce applications are even less tolerant of network problems than in-house users.

Network managers concerned about quantifying service-level metrics such as network availability, session integrity, and uptime percentages were once paid lip-service by equipment vendors. Today, suppliers acknowledge that companies relying on e-commerce, continuous video streaming, or split-second financial decisions based on global market factors as a source of revenue cannot afford the slightest network glitch.

Network disruptions

Controllable factors such as scheduled support, product installation, upgrades, and routine maintenance including backups can affect network availability and reduce uptime. But it`s the uncontrollable factors such as power outages, brownouts, equipment failures, electromagnetic interference (EMI), and operator errors that predominately affect the network at the worst possible moment.

PCs and workstations in a large network are subject to infrequent (monthly) but significant power interruptions, according to a recent study. Although blackouts and voltage spikes are the most noticeable of these events, they only account for 12% of power problems. Roughly 80% of power disturbances are caused by brownouts, over- voltages, surges, and power sags. Local area network-related power problems like internode communications interference (caused by ground loops between two devices linked by a data cable), noise on the data cable from EMI, and other seemingly minor interruptions can bring down the network.

Most operators take steps to protect their networks by using uninterruptible power supplies or by installing fiber-optic cables with noise immunity. These measures provide a large degree of protection for workstations and servers, but the network is left vulnerable at the cable and switch port levels.

Why 99% uptime won`t suffice

Any power disruption can result in expensive downtime. According to another study, the cost of downtime can range from $300 per minute for a medium-sized local area network (LAN) to $633 per minute for a Unix network--not including the cost of lost revenue. Market data from Contingency Planning Research calculates the cost of downtime ranging from $1200 per minute for retail businesses up to a high of $108,000 per minute for brokerage operations. If we assume for a small company that 24-hour ¥ seven-day network uptime is 99%, and that every minute of downtime costs at least $1000 per minute, a conservative estimate of lost revenue and opportunity cost is $5.25 million per year. Another penalty is the cost of technical support personnel, who typically spend 75% of their time on network problem resolution.

Traditional approaches to fault tolerance

Vendors offer many methods to increase network reliability and availability--all based on adding hardware. How the redundant hardware is switched into service is a common classification. The following are some of the methods used to achieve fault tolerance:

adopting proprietary software-based protocols such as Spanning Tree, which is universally used by vendors for link management

trunking of multiple parallel links, which typically include redundant features, for increased bandwidth

installing network interface cards (NICs) with failover software in the server

deploying server clustering.

Which method to use depends on the network topology, the number of switches, and the required recovery time. All of these methods require additional hardware (e.g., duplicate ports, switches, or routers) operating in conjunction with proprietary protocols such as Spanning Tree or drivers for the system software. And coordinating these elements is often difficult. Network managers wrestle with hardware and software compatibility issues, predict the worst-case switchover time, and determine whether the network layer sessions will survive the chosen redundancy method.

Deploying a distributed switch architecture increases the number of points of failure on the network. Demand for onboard reliability, network availability, and other fault-tolerant features has resulted in vendors incorporating these functions in low-end stackable switches and hubs. Many switch vendors offer redundant, load-sharing, "hot swappable" N+1 AC power supplies, dual fans (to prevent overheating), redundant management modules, or duplicate port expansion/network modules in the chassis to ensure high network availability.

Some network managers take preventative measures by duplicating all the equipment in the network, using two switches, for example. This method allows fully meshed network topologies and active, redundant links. It works well for video applications requiring low bandwidth. It is also tolerant of slight delays in recovery time. For time-critical, high-bandwidth applications, however, usage of Spanning Tree or other software-based protocols may cause network sessions to timeout during the re-convergence process.

The Spanning Tree protocol is used by switch vendors to provision multiple paths through an Ethernet network. It facilitates a completely fault-tolerant design by allowing multiple links at every point in the network. Although Spanning Tree offers a measure of loop-free network redundancy, its notoriously slow failover times do not provide "instantaneous" recovery in the event of a link failure. According to the available research, it can take up to 50 sec (adding parameters such as forwarding delay time and listening and learning states) to recalculate a "spanning tree" following a network change. Tuning the parameters to optimize convergence, at best, results in a recovery time of 30 sec, which does not include delay times in message delivery, session interruptions, or timeouts. It can also create backbone problems with the increased traffic flow.

Some vendors have championed using various load-balancing modules to distribute incoming traffic to multiple servers while ensuring fault tolerance. Load balancers offer redundancy to some degree, but are not designed solely for this purpose. Their primary mission is to provide for higher bandwidth over aggregate multiple links. The disadvantage is that the failover to re-route a connection in a fully active, meshed topology takes an average of 3 sec, and convergence time increases as the number of network switches goes up. Convergence time for five or more network switches can take up to 30 sec. For applications requiring a continuous flow of traffic, this loss of up to 3 billion bits of data on a Fast Ethernet connection may be unacceptable.

Another fault-tolerance method is to install a second NIC in a server with failover software. The software allows the backup adapter to kick in if the primary link fails. Network managers can also bind a single network address to multiple NICs, load-balance across multiple adapters, or use active connections to a single switch for increased fault tolerance and better performance. This straightforward approach allows the network to interoperate with multiple switch vendors` equipment, boasts a relatively short failover time of 3 to 6 sec, and is cost-effective at $300 to $500 per server card. However, it requires two adapters and uses up an additional peripheral connection interface slot in the server.

Server clustering is another option. It employs duplicate servers (each with a NIC card) connected via a SCSI or Fibre Channel link. Microsoft Windows NT operating system software, typically, runs on top of the servers. Server clustering offers real-time or near-real-time switchover requirements; in the event of a link failure, the server backup activates within a minute. But not all applications can run in an NT clustering environment.

Most fault-tolerance methods take a finite amount of time to re-engage the network link and require the Spanning Tree protocol to route traffic. Factors such as Internet protocol (IP) session-layer timeouts, incomplete packet transmission, and network performance degradation are also a concern. Using media translators to establish redundant links between components in Ethernet networks is a little-known, but effective, safeguard to ensure fault tolerance.

Media translators for fault tolerance

Media translators are commonly used to integrate fiber optics with Ethernet technology to support network demands for increased distance and enhanced data security. To establish fault tolerance, the most efficient devices are 100Base-TX-to-TX/FX media translators, which offer full redundant paths for Fast Ethernet devices like hubs, routers, servers, and switches.

The best of these devices offer data-link duplication to ensure network integrity and provide nonstop networking capability essential for high-priority traffic and mission-critical applications. These media translators actively monitor the primary link and, upon link failure, will automatically redirect traffic to the secondary link with no interruption to network operation. When the signal is re-established, the primary link is reactivated with the secondary link on standby mode transparent to the end user. The failover time is imperceptible; fewer than 1500 bits of data are lost.

Using a media translator, operators can build a fully redundant switched network without using Spanning Tree or other routing protocols. The translator is typically connected to a standards-based NIC card in the server with either copper or fiber media. By loading this device, network managers can cost-effectively safeguard a multiple, fully meshed, switched network.

Many fault-tolerant approaches have limitations and drawbacks. Most methods involve significant capital investment in duplicate equipment or additional support personnel, especially in the case of highly technical clustering software. Other approaches can significantly reduce re-convergence time, but rely on proprietary protocols and equipment. Often, reliance on software can exacerbate recovery times because it forces continual retransmissions of TCP (transmission control protocol)/IP packets, causing session timeouts or breakdowns. Re-convergence times are also directly proportional to the number of switches in the network--large switched networks will suffer longer delays. In addition, Spanning Tree protocol parameters dictate failover times of 50 sec or more.

Media translators offer a few unique advantages over traditional redundancy methods and give the network administrator an alternative for providing fault tolerance for time-sensitive, mission-critical applications. Used in conjunction with standards-based NICs and server failover software, these devices can help eliminate all points of failure in the network, preventing data loss due to cable, port, or catastrophic switch failures. Network managers running applications requiring session integrity and demanding 24-hour/seven-day reliability cannot afford to have a network failure. Employing media converters is an effective way to add an extra measure of reliability in today`s LAN market. u

Darrell Furlong is the senior vice president of research and development and the chief technology officer at lancast Inc. (Nashua, NH).