The Challenge of Delivering Continuous Network Operations
Introduction
• Continuous system operations: Redundant system design with intelligent software providing carrier-class service availability
• Network resiliency: Redundant network design with intelligent software to protect against unexpected network link and node failures
Continuous System Operations Overview
Hardware Redundancy
• Route switch processors (RSPs): RSPs are deployed in "active" or "standby" configurations. The Cisco ASR 9000 Route Switch Processor is designed with load-shared redundancy to support software upgrades and software patches.
• Switch fabric: Using an active/active configuration model allows for distribution of the traffic load across both switch fabrics, taking advantage of the processing capacity of both switch fabrics. If a failure occurs, the single active switch fabric continues to forward traffic in the system, with hardware support for zero packet loss on fabric online insertion and removal (OIR). Because both switch fabrics are active and forwarding traffic, they are both ready to assume the full traffic load.
• Power supplies: Redundant power supplies are deployed in load-balanced configurations to share the load across all power supplies. You can also configure the power supplies in 1:1 and 1:N modes to provide power-supply redundancy. An example of this configuration is 3:1 redundancy, where one redundant power supply is used to back up the other three power supplies.
• Fan trays: Redundancy is offered on a single fan tray (that is, through multiple fans) and between fan trays. If a fan tray fails, fan-tray redundancy allows for the protection of the fan tray, giving service providers an alarm and time window over which they can replace the failed fan tray.
• Line cards: The Cisco ASR 9000 Series can handle faults by bundling and protecting ports together on multiple line cards by supporting IEEE 802.3ad Link Aggregation. Cisco ASR 9000 linecard redundancy is supported through the bundling of up to eight interfaces across line cards into a single, logical Layer 2 or Layer 3 connection. Fast failover between ports within a bundle occurs if any port fails, providing more flexibility than simple linecard redundancy. This allows service providers to support stringent customer SLAs.
• Operating system: System infrastructure components are distributed to all cards in the system, and relevant data is replicated on different cards based on usage. This setup avoids single points of failure and allows distribution of applications based on resource availability.
Software Resiliency Through Cisco IOS XR Software
Modularity
• Release modularity: Cisco IOS XR Software is based on a development model in which features consist of components. These components are aggregated into installation packages and composites that can be independently upgraded, and are pretested and certified for use in service provider networks.
• Run-time modularity: Deployed features and components are broken down into processes, supporting fine granularity of fault isolation, restarts, and upgrade capability. Cisco IOS XR Software avoids a performance penalty by supporting multiple threads that perform tasks in parallel, taking full advantage of the Cisco ASR 9000 Series hardware architecture.
• Physical distribution of components: Software components are distributed and replicated across line cards and RSPs, creating fault isolation for resiliency.
• Logical distribution of components: Cisco IOS XR Software separates software into three distinct planes -- the control, management, and data planes. Planned or unplanned outages on any of the planes do not affect services on others.
Process Independence and Restart
Fault Handling
• Fault detection and correction: Both Cisco ASR 9000 Series hardware and software support fault detection and correction. In hardware, the router offers error correcting code (ECC)-protected memory. If a memory corruption occurs, the system automatically restarts the affected processes to fix the problem with minimum effect. If the problem is persistent, the Cisco ASR 9000 supports switchover and OIR capabilities to allow replacement of defective hardware without affecting services on other hardware components in the system.
• Resource management: As part of its fault-handling capabilities, the Cisco ASR 9000 Series supports resource threshold monitoring for CPU and memory usage to improve out-of-resource (OOR) management. When threshold conditions are met or exceeded, the system generates an OOR alarm to notify operators of OOR conditions. The system then automatically attempts recovery, and allows the operator to configure flexible policies using the Cisco IOS Software Embedded Event Manager (EEM). The system also reserves some system memory to allow the operator to log in and clean the system during worst-case OOR conditions. This setup provides a proactive-rather than a reactive-solution, avoiding router reset and network reconvergence.
• Switchover design: Cisco IOS XR Software allows system processes such as the TCP/IP stack, device drivers, routing protocols, and signaling stacks to be restarted on individual RSPs or line cards without causing service outage. In circumstances where process dependencies are distributed across separate, failed hardware or software components, recovery can require a large amount of time. To support continuous system operations, the Cisco ASR 9000 supports a fast switchover of traffic when linecard protection mechanisms are enabled. It also uses redundant RSPs in a rapid and flexible RSP switchover configuration while maintaining NSF and NSR, allowing services to be designed to the most appropriate active or standby mode based on their scale, performance, and availability requirements.
• Event management: Cisco ASR 9000 embedded manageability offers mechanisms such as fault-injection testing to detect hardware faults during lab testing, a system watchdog mechanism to recover failed processes, and tools such as the Route Consistency Checker to diagnose inconsistencies between the routing and forwarding tables.
Upgradability
• OIR: In addition to supporting fault handling, when hardware needs to be upgraded to add scale, features, or performance, the Cisco ASR 9000 supports OIR for system components such as RSPs and line cards, while the system is in service and performing at full capacity.
• Programmable network processor: To support high service velocity, system longevity for capital investments, lower operational costs, and the lowest possible MTTR, Cisco ASR 9000 Linecards support software feature upgrades through the programmable network processor.
• Simplified software upgrades: Cisco IOS XR Software release modularity makes it easy to perform the installation of a software upgrade. Most Cisco IOS XR Software fixes are non-service affecting, allowing customers to update a specific process or group of processes without affecting service. Operators may target particular system components for upgrades based on software packages or composites that group selected features. Cisco preconfigures and tests these packages and composites to help ensure system compatibility.
• Software maintenance upgrades: To simplify point fixes at component-level granularity, Cisco IOS XR Software uses software maintenance upgrades (SMUs) that can cross package boundaries, depending on what needs to be updated. Because SMUs are typically a short-term fix, permanent fixes roll into maintenance releases.
• Security and integrity: To support system security and integrity, Cisco IOS XR Software authenticates packages being installed and verifies version compatibility between the new packages and those in operation. If these two checks pass, the software restarts only those processes that a package advertises as changed, hence decreasing MTTR and improving system availability.
Continuous Forwarding
• RSP SSO: The Cisco ASR 9000 Series maintains state information on a per-protocol basis to support stateful switchovers between the RSP modules. Critical protocols protected in this way include Open Shortest Path First (OSPF), Intermediate System-to-Intermediate System (IS-IS), Border Gateway Protocol (BGP), and Label Distribution Protocol (LDP).
• NSF: Cisco IOS XR Software supports forwarding without traffic loss during a brief outage of the control plane through signaling and routing protocol implementations for Graceful Restart extensions as standardized by the IETF. In addition to standards compliance, this implementation has been compatibility tested with Cisco IOS Software and third-party operating systems.
• Graceful Restart: This control-plane mechanism ensures high availability by allowing detection and recovery from failure conditions while preserving NSF services. Graceful Restart is a way to recover from signaling and control-plane failures without affecting the forwarding plane. Cisco IOS XR Software uses this feature and a combination of check pointing, mirroring, RSP redundancy, and other system resiliency features to recover prior to timeout and avoid service downtime as a result of network reconvergence.
• NSR: This feature allows for the forwarding of data packets to continue along known routes while the routing protocol information is being refreshed following a processor switchover. NSR maintains protocol sessions and state information across SSO functions for services such as Multiprotocol Label Switching (MPLS) VPN. TCP connections and the routing protocol sessions are migrated from the active RSP to the standby RSP after the RSP failover without letting the peers know about the failover. The sessions terminate locally on the failed RSP, and the protocols running on the standby RSP reestablish the sessions after the standby RSP goes active, without the peer detecting the change. You can also use NSR with Graceful Restart to protect the routing control plane during switchovers.
Carrier Ethernet Network Resiliency Overview
• IP routing
• MPLS Traffic Engineering-Fast Reroute (MPLS TE-FRR)
• Multicast fast convergence
• Layer 2 VPNs
IP Routing
• Fast link-failure detection: Cisco ASR 9000 Linecards support interrupt-based loss-of-signal detection, which can detect link- and port-level hardware failure in a few milliseconds. Such failures are signaled to the RSP, which can then trigger Interior Gateway Protocol (IGP) and MPLS reconvergence.
• Distributed Bidirectional Forwarding Detection (BFD): BFD can be used to quickly detect forwarding-path failures and trigger the routing protocol to provide fast convergence. It can be used with IS-IS, OSPF, MPLS TE-FRR, and Protocol Independent Multicast (PIM). The Cisco ASR 9000 Series implements BFD in a distributed fashion, where the line cards are equipped with a powerful local CPU and intelligent Cisco IOS XR Software system, enabling the support of thousands of BFD sessions per line card with a configurable hello timer as low as 15 msec.
• Prefix prioritization: This feature provides a way to prioritize which prefixes converge first, based on the network administrator's guidelines. A good example would be giving a high priority to the IPTV source prefix of a video server application. Then during a change in the routing topology (for example, due to a link failure), the IPTV source prefix, which has a high priority, will be reconverged first to reduce down time for video services.
• IP FRR: This Cisco IOS XR Software innovation provides subsecond IP fast convergence for both IS-IS and OSPF routing protocols in a properly designed network topology. By taking advantage of these protocols, the Cisco ASR 9000 can extend superior routing performance and fast convergence into Carrier Ethernet transport networks to increase network resiliency.
• BGP fast convergence: The Cisco ASR 9000 supports many advanced BGP fast-convergence features in Cisco IOS XR Software, including BGP next-hop tracking, BGP local convergence upon provider edge-customer edge link failure, and BGP prefix-independent convergence (PIC) for the core and edge. For example, the BGP PIC feature provides fast convergence in a scalable way. The Internet BGP routing table has hundreds of thousands of routes, and many BGP routes share the same provider-edge next hop. The Cisco ASR 9000 Series implements the forwarding table hierarchically so that during network reconvergence it does not need to update the entire BGP prefix in the forwarding table. Only the forwarding entry for the common BGP next hop is updated, resulting in a faster convergence time that is independent of the number of BGP prefixes. This feature is just one of the many Cisco IOS XR Software routing features that can help to maximize network availability.
MPLS TE-FRR
Multicast Fast Convergence
Layer 2 VPN
• Pseudowire redundancy: Pseudowire redundancy creates both primary and backup pseudowires that are connected to different remote nodes in the network. When the primary pseudowire goes down, it can quickly switch over to the backup pseudowire, ensuring access to the network.
• Hierarchical Virtual Private LAN Service (H-VPLS) pseudowire redundancy with VPLS MAC withdrawal: When an access pseudowire is used to connect into the VPLS network -- a technology known as Hierarchical VPLS (H-VPLS) -- pseudowire redundancy can be extended to protect the access pseudowire. Combined with VPLS MAC withdrawal technology, pseudowire redundancy in the H-VPLS scenario can be used to avoid a possible packet-oriented black hole.
• Multisegment pseudowire redundancy: When a L2VPN pseudowire crosses different administrative domains, a multisegment pseudowire is typically used to stitch multiple segments of pseudowires. Pseudowire redundancy technology can be applied to multisegment pseudowire scenarios as well.
• IEEE Multiple Spanning Tree (MST) protocol: To support native IEEE Layer 2 bridging environments, the Cisco ASR 9000 supports the standard IEEE 802.1s MST protocol to protect native bridging traffic.
• MST Access Gateway: In many scenarios where native Layer 2 access and Layer 2 VPN technologies are combined to provide Layer 2 service for the end user, traditional Layer 2-based redundancy protocols such as MST do not provide sufficient protection to avoid Layer 2 forwarding loops. In these cases, a mechanism is required to connect Layer 3 MPLS and Layer 2 access for both the control and data planes. To solve this problem, the Cisco ASR 9000 MST access gateway solution was developed to provide a unique solution for aggregating Layer 2 access networks regardless of the access network topology and protocols used.
Summary