Networking Technical Note: HA: NSB NSR ISSU

Modern high-performance routers architecturally separate the forwarding plane and the control plane into separate physical components, each with its own memory and processors. The control plane runs the routing protocols, maintains the necessary databases for route processing, and derives a forwarding table (FIB). The FIB is given to the forwarding plane, which is responsible for packet forwarding.

In fact the control plane could stop functioning altogether and because the forwarding plane is a separate entity with its own processors it can continue forwarding packets based on its copy of the FIB. This is Non-Stop Forwarding (NSF): The ability of the forwarding plane to continue running “headless” if the control plane stops.

Of course this is dangerous; if the network topology changes while the control plane is down there is no way to process new route information and the forwarding plane’s FIB can become invalid, resulting in incorrectly forwarded packets. So why would you even want NSF?

The answer is redundant control planes (Cisco calls their control planes Route Processors; Juniper calls them Routing Engines). NSF allows you to switch from a primary to a backup control plane without disrupting forwarding. The FIB could still become invalid during the period between when the primary control plane goes down and the backup control plane takes over, but the risk in this period is usually an acceptable compromise.

So if the backup control plane maintains a copy of the active configuration and current state on system components such as interfaces, it can become active much faster than if it had to learn all this information first. Cisco calls this Stateful Switchover (SSO) and Juniper calls it Graceful Routing Engine Switchover (GRES).

The problem with control plane switchovers as so far described, even if it uses stateful procedures to decrease the switchover time, is that routing protocol adjacencies are broken by the switchover. When a primary control plane goes down any neighboring router that had a peering session with it sees the peering session fail. When the backup control plane becomes active it re-establishes the adjacency, but in the interim the neighbor has advertised to its own neighbors that router X is no longer a valid next hop to any destinations beyond it, and the neighbors should find another path. And of course when the backup control plane comes on-line and reestablishes adjacencies its neighbors advertise the information that router X is again available as a next hop and everyone should again recalculate best paths. All of this is can be highly disruptive to the network.
The objective of NSR is to prevent, or at least minimize, the effect of broken peering sessions.

A first attempt at controlling broken adjacencies during control plane switchovers is Graceful Restart (GR) protocol extensions. Each routing protocol has its own specific GR extensions, but they all work pretty much the same. When a router’s control plane goes down its neighbors, rather than immediately reporting to their own neighbors that the router has become unavailable, wait a certain amount of time (the grace period). If the router’s control plane comes back up and reestablishes its peering sessions before the grace period expires, as would be the case during a control plane switchover, the temporarily broken peering sessions do not effect the network beyond the neighbors.

There are, however, a couple of problems with GR:
.Neighbors are required to support the GR protocol extensions. yet small CE routers are less likely to support GR.
.If there is a complete control plane or router failure rather than just a switchover, the GR grace period can slow network reconvergence.

A newer generation of NSR uses internal processes to keep the backup control plane aware of routing protocol state and adjacency maintenance activities, so that after a switchover the backup control plane can take charge of the existing peering sessions rather than having to establish new ones. The switchover is then transparent to the neighbors, and because the NSR process is internal (and vendor specific) there is no need for the neighbors to support any kind of protocol extension.

Here’s where the confusion comes in: Different vendors use these terms differently. Juniper, for example, calls its graceful restart implementation Graceful Restart, whereas Cisco calls its graceful restart implementation Non-Stop Forwarding Awareness (even though GR applies to routing, not forwarding). Juniper users often confuse GRES and GR: Although the “G” in both acronyms stands for “Graceful,” GRES and GR are two different things. And both Cisco and Juniper have internal NSR capabilities, but the circumstances in which each can be used are quite different.

So enjoy the circus, but be aware that different vendors sometimes use different names for essentially the same act. When a vendor talks about NSF, GR, and NSR, be sure you know that vendor’s.

RPR
RPR enables a quicker switchover between an active and standby RSP if the active RSP experiences a fatal error. When you configure RPR, the standby RSP loads a Cisco IOS image on bootup and initializes itself in standby mode. In the event of a fatal error on the active RSP, the system switches to the standby RSP, which reinitializes itself as the active RSP, reloads all of the line cards, and restarts the system.

RPR+
The RPR+ feature is an enhancement of the RPR feature. RPR+ keeps the VIPs from being reset and reloaded when a switchover occurs between the active and standby RSPs. Because VIPs are not reset and microcode is not reloaded on the VIPs, and the time needed to parse the configuration is eliminated, switchover time is reduced to 30 seconds.

SSO
SSO establishes one of the supervisor engines as active while the other supervisor engine is designated as standby, and then SSO synchronizes information between them. A switchover from the active to the redundant supervisor engine occurs when the active supervisor engine fails, or is removed from the router, or is manually shut down for maintenance. This type of switchover ensures that Layer 2 traffic is not interrupted.
SSO switchover preserves FIB and adjacency entries and can forward Layer 3 traffic after a switchover. Configuration information and data structures are synchronized from the active to the redundant supervisor engine at startup and whenever changes to the active supervisor engine configuration occur.

ISSU: In-Service Software Upgrade (ISSU) CISCO
Requires Dual RE
1. Primary and Standby Supervisors Running Current Image
2. Load New Image on Standby Supervisor
3. Make Standby Supervisor “Active” (<150ms)—Switch Now Running New Image
4. Rapid Rollback Option (<150ms) if Necessary
5. Load New Image on Primary Supervisor and Commit Change

Networking Technical Note

Monday, August 23, 2010

HA: NSB NSR ISSU

No comments:

Post a Comment