Advanced Network Automation for Next-Generation Optical Transport Networks
With the introduction of 4G and 5G mobile technology, the increasing number of networked devices, and the exploding number of services relying on connectivity, such as Internet of Things (IoT), virtual reality (VR), augmented reality (AR), gaming, and any other cloud offering, communications networking in general and optical transport networking in particular have become an increasingly central infrastructure for a large part of the global economy.
The optical networking industry is successfully addressing this important role through speed of innovation. Photonic and electronic integration have enabled smaller, more power-efficient optical transmission systems with higher throughput. With a two-year cadence, the capacity carried on a single wavelength within a DWDM wavelength band has moved from the already disruptive introduction of coherent 100G transmission per channel to 200G, 400G, 600G, and 800G wavelength capacities. The wavelength spectrum that has traditionally been used, the C-Band, has been widened to encompass the L-Band as well to provide mechanisms to further enhance reach and spectral efficiency and optical layer connectivity. To drive this fast innovation, network operators increasingly combine best-in-class products from different vendors into their networks.
All these innovations and capabilities have improved the scale, capacity, and power efficiency, but have also resulted in a significant amount of complexity in managing the impressive pool of bandwidth available across vendor and technology domains. To leverage this huge amount of scalability and flexibility in networks today, developing suitable new paradigms for control and network operation has now become the focus for network operators and the optical networking industry.
Combined with the rise of internet content providers (ICPs), a planet-scale networking infrastructure has been built in the last decade to sustain social media and rich media content exchange. With ICPs, we have seen significant investments in data centers, which represent an amalgamation of compute, storage, and networking infrastructure. This has also had a transformational impact on traditional telco carriers/service providers (CSPs). For CSPs, the imminent arrival of 5G – which is expected to act as a force multiplier for other innovative technologies, including artificial intelligence (AI), IoT, and edge computing – is going to result in a new cycle of network capacity growth and the associated capital investment in infrastructure. Advanced automation is going to play an important role to improve operational costs and network availability.
It is worthwhile to briefly discuss some common methods used by network operators to achieve automation today. The motivation is to automate repeatable tasks that inherently do not require any enhanced “intelligence.” That is, the tasks are fairly contained, are well-defined, and achieve very specific goals. The scope of automation itself encompasses the overall spectrum of FCAPS operations; automation tasks could include performing custom monitoring beyond the capabilities offered by the network element, or a slightly more advanced task could involve the movement of bandwidth (wavelengths or circuits) based on time-of-day considerations. Given the progressive evolution of telecom systems over time, automation tasks that are initially defined and maintained by operators gradually become first-class features that eventually are implemented natively by the networking equipment vendors as part of their software offerings.
Building Next-Generation Network Operations Centers
The changing operations paradigm from classic telco networking towards a new IT networking environment also affects the role of a network operations center (NOC). Instead of a central hub with operators in front of the network management system (NMS) graphical user interfaces, the general objective now is to establish autonomous networks.
Autonomous networks are the eventual goal beyond advanced automation (Figure 1). To contrast autonomous networks from automated/automatic operations is akin to comparing a Level 3+ capable “hands-off” self-driving car to the automatic/semi-automatic cars that are ubiquitous today. Autonomous networks encompass several self-* properties viz. self-bootstrapping, self-forming, self-managing, and self-healing. These networks can potentially run in a driverless auto-pilot mode with zero human-to-machine interaction. Where current NOCs include network engineers who have vendor-specialized certifications, an autonomous network pushes towards NOC-less operations. This involves retasking network engineers from existing responsibilities (i.e., deploy ⇔ provision ⇔ monitor ⇔ debug) to “DevOps” roles, where they double as software developers (Dev), developing automation systems for operational (Ops) tasks. Changes in the operational tools’ landscape and the associated paradigms imply that it is imperative to make operations personnel an active part of this transformation and equip them for this new environment.
Other new and exciting technologies will further enhance this picture, in particular, artificial intelligence and machine learning (AI/ML), which are expected to play an important role in moving toward a NOC-less network.
The ability to learn from existing information and predict future capacity, throughput, faults, or other events will turn into an important operations tool. ML algorithms are probabilistic rather than deterministic and rely purely on data for accurate predictions. ML approaches are cost-effective in comparison to conventional approaches such as heuristics or analytical in cases where the problem that is being solved either suffers from model deficit or algorithm deficit. ML is a powerful data-driven “tool” to improve/extend existing solutions – automation is still the “solution” that operators will deploy, with AI/ML as technologies to achieve this goal. We briefly highlight three optical networking use cases where ML techniques have potential to make an impact:
- Failure Prediction and Preventive Maintenance: Detection of anomalous network parameters that could cause network failures, identify root cause, and prescribe preventive actions.
- Cognitive Service Provisioning: Using historical data, an SDN-controlled autonomous network can predict traffic volume/growth and dynamically (re-)allocate resources (spectrum, wavelengths, circuits, etc.).
- Quality of Transmission (QoT) Estimation: Use ML to overcome model deficits of existing heuristic approaches and improve accuracy. The optical lightpath performance can be estimated by learning the characteristics from optical devices, especially in open disaggregated optical networks.
A NOC operator today uses vendor-specific NMSs to provision connectivity services, monitor alarms and performance data, and manually trigger actions to fix issues. With the prediction capabilities, operational practices will change substantially: The cognitive network will be expected to proactively predict impending issues, take preventive action whenever possible, and trigger steps to deal with those issues. Using AI/ML methodologies will enable operators to cover a much wider set of topics for a more versatile reaction to events. Once there are no repetitive tasks that need to be done from a central location, we move closer to achieving NOC-less operations, allowing operators to concentrate on optimizing their automated environments.
Operator and Vendor Collaboration
Finally, while the promise of automation is apparent, there are several challenges in migrating to next-generation autonomous optical networks that involve all the players in the automation ecosystem.
The need for standardized tools/APIs is crucial for multi-vendor integration. Progress in standardization activities (IETF, ONF, OIF, MEF, TIP) is slow, with parallel efforts trying to achieve the same outcomes due to siloed development. For operators, expanding the skills of the operational workforce is necessary, but involves financial/business investments and a change towards a mindset supporting a DevOps model (executives to network engineers). Further, onboarding AI/ML-based solutions poses further challenges. As the accuracy of learning-based systems is fully dependent upon the quality of data, organizational and business boundaries within operator organizations make it challenging to collect, sanitize, and share collected data, which results in disparate pools of data.
Meanwhile, equipment vendors face their own set of challenges. The primary challenge is to facilitate operator interactions and data sharing by operators – which is difficult for privacy, business, and in some cases intellectual property reasons. Vendors need to learn from operators how their networking gear behaves in the field to improve performance and reliability. Operators should be encouraged to share their operational experiences with vendors to, in turn, enable vendors to build use-case-driven, high-value automation solutions for operators. Recently in the context of ML, several key software and hardware industry players have joined forces to establish the Open Neural Network Exchange, which strives to define common data formats and open source building blocks for ML and deep learning models. A similar initiative is required in the networking community that can bring the key players together to build reusable automation and AI/ML frameworks.
Parthiban Kandappan is chief technology officer at Infinera.
Sidbar: Components of Automation
The introduction of software-defined networking (SDN) has pushed the networking industry (optical included) towards open and standardized interfaces, from routing to photonic layers, allowing greater control and visibility into the network. With well-defined APIs from the NEs, the SDN controller can manage the data, control and management planes of the NE. As the lines between legacy NMSs and an SDN controller blur, the “SDN controller layer” is coalescing many other network functions, hosting automation scripts & tasks, archiving of performance monitoring (PM) data, real-time planning, and in the near future, AI/ML frameworks for cognitive, predictive/proactive analytics. A few key components that are necessary for automation in the near term:
- Programmable Optical Hardware and NEs: The ability to automate requires fundamental capabilities in the optical data plane to be able to enact those intents. These capabilities span from colorless/directionless/contentionless (CDC) ROADMs and hybrid ODU/packet switches to highly spectrally efficient coherent WDM interfaces (that enable fine-grained tradeoff between capacity and reach). These characteristics enable optical networks to be remotely configured to changing traffic conditions. For instance, automatic migration of a circuit to a low-latency/high-priority path based on time-of-day.
- Automated Network Equipment Provisioning: Zero-touch provisioning (ZTP) is the ability to commission optical devices with very little human involvement. Field personnel install equipment within the NOC and only perform techanical and power installation procedures.
- SDN Control, Programmability, and APIs: SDN transport has brought in extensive API frameworks based on YANG data models known as model-driven networking (MDN). MDN normalizes network functions across vendor implementations through data abstractions, which are specified as YANG models. MDN results in separation of intent (what) from actuation (how), which is critical to scale operations in multi-vendor environments.
- Streaming Telemetry: Modern devices support streaming telemetry-based performance monitoring, which resolves the limitations of legacy SNMP pull-based monitoring. The devices can stream (push) data at varying frequencies (from seconds to minutes), which allows improved monitoring and observability. In lieu of AI/ML applications where timely data is key to improved prediction accuracy, telemetry allows real-time tracking of key performance metrics, which further enhances predictive/proactive analytics.
- Analytics and Machine Learning Frameworks: Popular public cloud providers and other software vendors are increasingly providing “AI/ML and analytics as a service” offerings, allowing an easier starting point than having to build software stacks from scratch. These approaches increasingly provide benefit to network operations, e.g., through the analyses of traffic flows and traffic prediction, optical performance analytics and optimization, as well as identifying security breaches or denial of service attacks.
Operators are also considering policy-driven cognitive systems that are used to (re-)configure the network to accommodate dynamicity, helping automate daily tasks depending on user-specified conditions. Such policy-based engine can be integrated with or without AI/ML capabilities, providing an optimal ecosystem for automation. Integration with AI/ML components provides the foundation for closed-loop autonomous actions. One such example was recently demonstrated by a North American operator when a policy-based system was able to dynamically reallocate/migrate optical capacity to support Ethernet bandwidth-on-demand services utilizing multiple best-in-class open source tools for streaming, messaging, data collection, and learning.
Parthiban Kandappan is chief technology officer at Infinera.
Components of Automation
The introduction of software-defined networking (SDN) has pushed the networking industry (optical included) towards open and standardized interfaces, from routing to photonic layers, allowing greater control and visibility into the network. With well-defined APIs from the NEs, the SDN controller can manage the data, control and management planes of the NE. As the lines between legacy NMSs and an SDN controller blur, the “SDN controller layer” is coalescing many other network functions, hosting automation scripts & tasks, archiving of performance monitoring (PM) data, real-time planning, and in the near future, AI/ML frameworks for cognitive, predictive/proactive analytics. A few key components that are necessary for automation in the near term:
- Programmable Optical Hardware and NEs: The ability to automate requires fundamental capabilities in the optical data plane to be able to enact those intents. These capabilities span from colorless/directionless/contentionless (CDC) ROADMs and hybrid ODU/packet switches to highly spectrally efficient coherent WDM interfaces (that enable fine-grained tradeoff between capacity and reach). These characteristics enable optical networks to be remotely configured to changing traffic conditions. For instance, automatic migration of a circuit to a low-latency/high-priority path based on time-of-day.
- Automated Network Equipment Provisioning: Zero-touch provisioning (ZTP) is the ability to commission optical devices with very little human involvement. Field personnel install equipment within the NOC and only perform techanical and power installation procedures.
- SDN Control, Programmability, and APIs: SDN transport has brought in extensive API frameworks based on YANG data models known as model-driven networking (MDN). MDN normalizes network functions across vendor implementations through data abstractions, which are specified as YANG models. MDN results in separation of intent (what) from actuation (how), which is critical to scale operations in multi-vendor environments.
- Streaming Telemetry: Modern devices support streaming telemetry-based performance monitoring, which resolves the limitations of legacy SNMP pull-based monitoring. The devices can stream (push) data at varying frequencies (from seconds to minutes), which allows improved monitoring and observability. In lieu of AI/ML applications where timely data is key to improved prediction accuracy, telemetry allows real-time tracking of key performance metrics, which further enhances predictive/proactive analytics.
- Analytics and Machine Learning Frameworks: Popular public cloud providers and other software vendors are increasingly providing “AI/ML and analytics as a service” offerings, allowing an easier starting point than having to build software stacks from scratch. These approaches increasingly provide benefit to network operations, e.g., through the analyses of traffic flows and traffic prediction, optical performance analytics and optimization, as well as identifying security breaches or denial of service attacks.
Operators are also considering policy-driven cognitive systems that are used to (re-)configure the network to accommodate dynamicity, helping automate daily tasks depending on user-specified conditions. Such policy-based engine can be integrated with or without AI/ML capabilities, providing an optimal ecosystem for automation. Integration with AI/ML components provides the foundation for closed-loop autonomous actions. One such example was recently demonstrated by a North American operator when a policy-based system was able to dynamically reallocate/migrate optical capacity to support Ethernet bandwidth-on-demand services utilizing multiple best-in-class open source tools for streaming, messaging, data collection, and learning.
About the Author
Parthiban Kandappan
Chief Technology Officer, Infinera
Parthiban Kandappan is chief technology officer at Infinera.