Networking Technical Note: July 2010

Tuesday, July 27, 2010

BPDU protection

EX-series switches provide Layer 2 loop prevention through Spanning Tree Protocol (STP), Rapid Spanning Tree protocol (RSTP), and Multiple Spanning Tree Protocol (MSTP). Configure BPDU protection on interfaces to prevent them from receiving BPDUs that could result in STP misconfigurations, which could lead to network outages.

Enable BPDU protection on switch interfaces connected to user devices or on interfaces on which no BPDUs are expected, such as edge ports. If a BPDU is received on a BPDU-protected interface, the interface is disabled and stops forwarding frames.

Configure BPDU protection, use below CLI:

set protocols rstp interface ge-0/0/5 edge
set protocols rstp interface ge-0/0/6 edge
set protocols rstp bpdu-block-on-edge

Use below cli to check edge bpdu protoction is configured correctly:

ser@switch> show spanning-tree interface

Spanning tree interface parameters for instance 0

Interface Port ID Designated Designated Port State Role
port ID bridge ID Cost
ge-0/0/2.0 128:515 128:515 32768.0019e2503f00 20000 BLK DIS
ge-0/0/3.0 128:516 128:516 32768.0019e2503f00 20000 FWD DESG
ge-0/0/4.0 128:517 128:517 32768.0019e2503f00 20000 FWD DESG
ge-0/0/5.0 128:518 128:518 32768.0019e2503f00 20000 BLK DIS (Bpdu—Incon) <<<<<<<
ge-0/0/6.0 128:519 128:519 32768.0019e2503f00 20000 BLK DIS (Bpdu—Incon) <<<<<<<
ge-0/0/7.0 128:520 128:1 16384.00aabbcc0348 20000 FWD ROOT
ge-0/0/8.0 128:521 128:521 32768.0019e2503f00 20000 FWD DESG

When BPDUs arereceived from interface ge-0/0/5.0 and interface ge-0/0/6.0 the output from the operational mode command show spanning-tree interface shows that the interfaces have transitioned to a BPDU inconsistent state. The BPDU inconsistent state makes the interfaces block and prevents them from forwarding traffic.

Disabling the BPDU protection configuration on an interface does not unblock the interface. If the disable-timeout statement has been included in the BPDU configuration, the interface automatically returns to service after the timer expires. Otherwise, use the operational mode command clear ethernet-switching bpdu-error to unblock the interface.

Monday, July 12, 2010

pvst/pvst+

In Ethernet switched environments where multiple Virtual LANs exist, spanning tree can be deployed per Virtual LAN. Cisco's name for this is per VLAN spanning tree (PVST and PVST+, which is the default protocol used by Cisco switches). Both PVST and PVST+ protocols are Cisco proprietary protocols and they cannot be used on 3rd party switches, although Force10 Networks and Extreme Networks support PVST+, Extreme Networks does so with two limitations (lack of support on ports where the VLAN is untagged/native and also on the VLAN with ID 1). PVST works only with ISL (Cisco's proprietary protocol for VLAN encapsulation) due to its embedded Spanning tree ID. Due to high penetration of the IEEE 802.1Q VLAN trunking standard and PVST's dependence on ISL, Cisco defined a different PVST+ standard for 802.1Q encapsulation. PVST+ can tunnel across a MSTP Region.

PVST works only on ISL trunks. That is because an ISL trunk natively
supports multiple spanning trees per vlan. A ISL header has a bit
dedicated to indicate that the packet is a BPDU packet or not, so it is
very easy to seperate different BPDUs of different VLANs:

BPDU Flag VLAN TAG
BPDU VLAN 10 1 10
Normal packet VLAN 10 0 10
BPDU VLAN 15 1 15
Normal packet VLAN 15 0 15
etc.....

PVST+ is a modification of PVST which allows per vlan spanning trees
over standard 802.1q links.
802.1q does NOT natively support multiple spanning tree instances, only
one instance. BPDU packets on a 802.1q link are not tagged and are
transported on the native vlan. So only one spanning tree could be
supported.

Also, Cisco could not change the packet header of an 802.1q packet,
since it was a standard. So how did they manage to transport different
spanning trees over a standard 802.1q trunk ???

A standard BPDU packet is sent to the mac address: 01-80-C2-00-00-00,
untagged.

in PVST+, the BPDUs of the native vlan are transported like the
standard, untagged. the BPDUs of the other vlans are transported TAGGED
to the Cisco shared spanning tree mac address: 01-00-0C-CC-CC-CD

The other end - also a Cisco - understands this mac address and finds
the BPDUs of the other vlans. If the other end is NOT a Cisco, the mac address is flooded across the native vlan. This allows Cisco switches to maintain a per-vlan spanning tree across non-cisco switches.

----------
Cisco switches run different types of STP protocol, depending on whether the connected port is access, ISL trunk or 802.1q trunk. Natively, a Cisco switch runs a separate STP instance for each configured and active VLAN (this is called Per-VLAN Spanning Tree or PVST) and standard IEEE compliant switches run just one instance of STP protocol shared by all VLANs.

Access Ports
Cisco switches run classic version of IEEE STP protocol on the access ports. The IEEE STP BPDUs are sent to IEEE reserved multicast MAC address “0180.C200.0000” using IEEE 802.2 LLC SAP encapsulation with both SSAP and DSAP fields equal to “0×42”. Note that you can plug any standard IEEE compliant switch into a Cisco switch access port and they will interoperate perfectly, joining the respective access VLAN STP instance with the IEEE STP instance (MST).

ISL Trunks
Across ISL trunks, Cisco switches run PVST (Per-VLAN Spanning Tree). (Note that PVST feature is limited to ISL trunks only). The same IEEE STP BPDUs are sent for each VLAN, encapsulated in additional ISL header (which also carries the VLAN number). Since PVST BPDUs have the same format as IEEE BPDUs (that is IEEE 802.2 LLC SAP) they can be matched using the same SSAP/DSAP values of “0×42” for the purpose of Layer 2 filtering. The group of Cisco switches connected using ISL trunks only is called PVST region.

802.1q Trunks
Across 802.1q trunks, Cisco switches run PVST+ (Per VLAN Spanning Tree Plus). The goal of PVST+ is to interoperate with standard IEEE STP (MST) and allow transparent tunneling of PVST instance BPDUs across MST region (to potentially connect to other Cisco switches across the MST region).

interoperability
a group of Cisco switches connected using ISL trunks only is called PVST region.
a group of Cisco switches connected using 802.1q trunks as PVST+ region.

PVST+ region may connect to a PVST region using an ISL trunk and connect to MST region using a 802.1q trunk. The STP instances in PVST and PVST+ regions maps directly to each other, so no special interoperability solution is required. However, on MST side only one STP instance exists, contrary to many STP instances of PVST+ region. The first question is: if we want to interoperate with MST, which PVST VLAN’s STP instance should be joined with MST? Cisco chooses VLAN 1 for this purpose. The joined together instances of Cisco VLAN 1 STP and MST are called “Common Spanning Tree” or CST (naturally, CST spans PVST, PVST+ and MST regions).

Case 1: Cisco switch connects to MST switch across a 802.1q trunk with default native VLAN (VLAN 1)

MST (standard IEEE switch) side sends IEEE STP BPDUs to IEEE multicast MAC address. Those BPDUs are consumed and processed by VLAN 1 STP instance on Cisco switch (PVST+ region).

PVST+ side (Cisco switch) sends IEEE STP BPDUs corresponding to local VLAN 1 STP to IEEE MAC address as untagged frames across the link. At the same time, special new SSTP (shared spanning tree, synonym to PVST+) BPDUs are being sent to SSTP multicast MAC address “0100.0ccc.cccd” also untagged. Those SSTP BPDUs are encapsulated using IEEE 802.2 LLC SNAP header (SSAP=DSAP=”0xAA” and SNAP PID=”0×010B”). The BPDUs contain the same information as the parallel IEEE STP BPDUs for VLAN 1, but have some additional fields, notably special TLV with the source VLAN number. Note that IEEE switches do not interpret the SSTP BPDUs, but simply them flood through the respective VLAN topology, in case there are other Cisco switches connected to MST cloud.

As for non-native VLANs (VLANs 2-4095) Cisco switch sends only SSTP BPDUs, tagged with respective VLAN number and destined to the SSTP MAC address. (Please remember that all SSTP BPDUs carry a VLAN number they belong to). The respective VLAN STP instances are “transparently expanded” across the MST region, considering it as a “virtual hub”. (Note that this may have some traffic engineering implications, since to non-CST VLANs the cost of traversing MSTP region equals to the cost of the link used to connect to the first MSTP switch).

Now the question is, why would Cisco switch send the same VLAN1 BPDU twice – towards IEEE and SSTP multicast MAC addresses? Isn’t it supposed for the Cisco switch to join its VLAN 1 STP instance with the MST? The reason for sending additional SSTP BPDUs across VLAN 1 is purely informational, to perform consistency checking. The idea is to inform all other potential Cisco switches attached to MST cloud about our native VLAN. The receiving switch will only use IEEE BPDUs for VLAN 1 (CST) computations and will ignore SSTP BPDUs sent on VLAN 1.

Lastly, for the purpose of layer 2 filtering, remember that you can match SSTP BPDUs using an ethertype value “0×010B”.This works with multilayer switches even though SSTP BPDUs are SNAP encapsulated, and the actual field is not “ethertype” but rather a SNAP Protocol ID.

Case 2: Cisco switch connects to MST switch across a 802.1q trunk with non-default native VLAN (e.g VLAN 100).

MST (standard switch) side sends IEEE STP BPDUs to IEEE multicast MAC address and those BPDUs are processed by VLAN 1 (CST) STP instance in the Cisco switch.

PVST+ side (Cisco switch) sends untagged IEEE STP BPDUs corresponding to VLAN 1 (CST) STP to IEEE MAC address across the link. This is done for the purpose of joining the local VLAN 1 instance and the MSTP instance into CST. At the same time, VLAN 1 BPDUs are replicated to SSTP multicast address, tagged with VLAN 1 number (to inform other Cisco switches that VLAN 1 is non-native on our switch). Finally, BPDUs of the native VLAN instance (VLAN 100 in our case) are sent untagged using SSTP encapsulation and destination address. Of course, native VLAN100 BPDUs, (even though they are untagged) carry VLAN number inside a special TLV SSTP header.

As in Case 1 for the remaining non-native VLANs (VLANs 2-4095) Cisco switch sends SSTP BPDU only, tagged with respective VLAN tag and destined to the SSTP MAC address. The other Cisco switches connected to the MSTP cloud receive the SSTP BPDUs and process the using the respective VLAN STP instances.

VSTP

VLAN Spanning Tree Protocol (VSTP) allows switches to run one or more STP or RSTP instances for each VLAN on which VSTP is enabled. For networks with multiple VLANs, this enables more intelligent tree spanning, because each VLAN can have interfaces enabled or disabled depending on the paths available to that specific VLAN.

Prior to Junos 10.2, only one spanning tree protocol can be enabled: IEEE or VSTP.

Starting with 10.2, concurrent spanning-tree protocol (rstp + vstp ) is allowed, can interoperate with either PVST+ or R-PVST+.

Scalability numbers:
. 253 vstp instances plus 1 rstp

=========================
example 1:
show protocols
rstp;
vstp {
vlan 2;
vlan 3;
vlan 4;
}

Any vlans that are not configured under vstp is part of the RSTP instance, vlan 2-4 are in each of their own vstp instance.

Example 2:
show protocols
rstp ;
vstp {
vlan-group {
group TRY {
vlan 2-100 ;
bridge-priority 4k;
}
group BACK {
vlan 101-200;
bridge-priority 16k;
}
}
Any vlans that ate configured under vstp is part of rstp, vlans 2-200 are in each of their own vstp instance. vlans that are grouped together will inherit there same stp parameters.

Wednesday, July 7, 2010

MSTP II: Outside a Region

The concept of CIST
Every MSTP region runs special instance of spanning-tree known as IST or Internal Spanning Tree (=MSTI0). This instance is active on all links inside a region and serves the purpose of disseminating STP topology information for other STP instances. As usual, IST has a root bridge, elected based on the lowest bridge ID (priority/MAC address). However, situation changes when you have different regions in the network (e.g. switches with different region names, different revisions, etc). When a switch detects BPDU messages sourced from another region on any link, it marks the link as MSTP boundary.

Now two regions should build a common spanning tree known as CIST – Common and Internal Spanning Tree. This tree is result of joining ISTs of each region in a special manner. Here is a detailed description of the process.

1) In addition to sending IST configuration BPDUs, every switch initially declares itself as the root of CIST. The switches pass CIST configuration information along with IST information in additional BPDU fields. Switches inside a region never change the path cost to the CIST Root, known as CIST External Root Path Cost. Instead of that, the external path cost only changes on the boundary ports. Thus, this external cost only accounts for the cost of boundary links, not the cost of the paths inside a region. Essentially, CIST External Path Cost information “tunnels” across a region.

2) On boundary ports, switches exchange their CIST BPDU information ONLY. That is, switches hide IST information between regions, but pass CIST metrics. The usual RSTP synchronization process takes places between switches on a border link, and eventually ONE switch with the lowest BID (Bridge ID = Priority + MAC Address) among all regions is elected as the CIST Root. Note that each region still elects a local IST root, known as CIST Regional Root, as described further.

3) The region that contains the CIST Root, declares this switch as the root of local IST as well. However, things start to differ for regions that don’t contain the CIST root. All of them elect one of the border switches (switches connected to other regions) as IST root (aka CIST Regional Root). This procedure elects the IST Root (CIST Regional Root) based on the lowest CIST External Root Path Cost. Note that this procedure differs from elections based purely on BID, which take place inside a single region. In this case, the procedure uses BIDs as tiebreaker, if two or more switches have the same CIST Root External Path Cost. MSTP blocks all redundant boundary “uplinks” marking them as “alternate” paths to the CIST Root. The boundary switches do so by receiving “extra” CIST BPDUs on top of IST BPDUs with external root path cost values and comparing them with the ones received on local boundary ports.

4) Inside a region, switches build regular IST, using the CIST Regional Root as the root of IST. Note that this tree uses so-called Internal Root Path Cost stored in the local IST BPDUs. This cost increments along all the links inside a region, but it never leaks out of the region. Between regions, switches exchange information about CIST External Root Path cost only.

Note that switch with non-optimal priority value may become the CIST Regional Root (local IST Root). For example, if you configure a switch inside a region with a lowest BID among all switches it may not necessarily become the CIST Regional Root. Only if the switch has the lowest BID among all regions it would be elected as CIST Root.

The concept of CST and STP interoperation

From the above information, we conclude that CIST essentially has organization of a two-level hierarchy. The first level treats all regions as “pseudo-bridges” and operates with the External Root Path Cost. The first-level spanning tree roots in CIST Root Bridge and encompasses the pseudo-bridges. They call it CST or Common Spanning Tree. Effectively, this has no idea of the internal MSTP regions structure, but sees each region as a virtual bridge.

This is the point where MSTP interoperates with legacy IEEE STP/RSTP regions as well. The legacy switch regions have no concept of IST, so they simply join their STP instance with the CST and perceive MSTP regions as “transparent” pseudo-bridges, staying unaware of their internal topology. (Note that it may happen so that a switch with the lowest BID belongs to RSTP/STP region. This situation results in all MSTP regions electing local CIST Regional Roots and considering the new CIST Root located outside MSTP “domain”). Naturally, MSTP detects the appropriate STP version on a boundary link and switches to the respective mode of operations (e.g. RSTP/STP).

The second level of CIST hierarchy consists of various regional ISTs. Every MSTP region builds IST instance using the internal path costs and following the optimal “internal” topology. As you remember, this “internal topology” transparently transports the “external” BPDU information to the border switches, so that they can elect Regional CIST Root. The fist level topology (CST) is somewhat independent of the second level topology (IST) and bases upon the CIST root and cost of the boundary links. The second level topology (“internal”, IST) may change in case if boundary link cost changes or something happens to the CIST Root, since this affect the election of the Regional CIST Root.

MSTP pseudo-bridges do not strictly emulate the real bridges. For example, different boundary switches send their own BID in BPDUs, so the pseudo-bridge may appear to have many BIDs on various boundary links. However, this has no impact on the process of transparent tunneling of CIST Root information across the pseudo-bridge. Other things that don’t see pseudo-bridge as non-transparent include MSTP hop count or MaxAge timer value, for they may change asynchronously, as information travels along different paths inside a region.

MSTIs and CIST

Now, what could be said about the MSTIs – individual STP instances used inside regions? From what we learned so far, it is easy to conclude that the only logical solution is to map all MSTP instances to the CIST on the boundary links. This implies that you cannot load-balance VLAN traffic on the boundary links by mapping VLANs to different instances. All VLANs use the same non-blocking uplink that CIST elected as the optimal path to the CIST Root. But this only applies to the “CST” paths connecting the regional virtual bridges – inside any region VLANs follow the internal topology paths, based on the respective MSTI configurations.

It is important to note that MSTIs have no idea of the CIST Root whatsoever; they only use internal paths and internal MSTI root to build the spanning tree. However, all MSTP instances see the root port (towards the CIST Root) of the CIST Regional Bridge as the special “Master Port” connecting them to the “outside” world. This port serves the purpose of the “gateway” linking MSTI’s to other regions. Notice that switches do not send M-records (MSTI information) out of boundary ports, only CIST information and thus . Thus, the CIST and MSTI’s may converge independently and in parallel. The master port will ony beging forwarding when all respective MSTI ports are in sync and forwarding to avoid temporary bridging loops.

MSTP and Fault Isolation

Ethernet is known for its broadcast nature that tends propagating issues across the whole Layer 2 domain. There are tree main problems with Ethernet that affect MSTP designs:

* Unknown unicast flooding results in traffic surges under topology changes. Every topology change may cause massive invalidation of MAC address tables and unicast traffic flooding. This process is the result of Ethernet topology unawareness – the bridges don’t know MAC addresses location.
* Broadcas and Multicast flooding. This is a separate problem as many core protocols (ARP, IGP, PIM) rely on multicasting or broadcasting. Thos packets should be delivered to every node in a broadcast domain and under intense load network could be congested at every point.
* Spanning-Tree Convergence. MSTP uses RSTP procedure for STP re-negotiation. Since it is based on distance-vector behavior, it is prone to convergence issue, such as counting to infinity (old information circulation). This is especially noticeable in larger topologies with 10 switches and more and under special conditions, such as failure of the root bridge.

The concept of MSTP region allows for bounding STP re-computations. Since MSTIs in every region are independent, any change affecting MSTI in one region will not affect MSTIs in other regions. This is a direct result of the fact that M-record information is not exchanged between the regions. However, CIST recalculations affect every region and might be slow converging. This is why it is a good idea not to map any VLAN to CIST.

Topology changes in MSTP are treated the same way as in RSTP. That is, only non-edge links going to forwarding state will cause a topology change. A single physical link may be forwarding for one MSTI and blocking for another. Thus, a single physical change may have different effect on MSTIs and the CIST. Topology changes in MSTIs are bounded to a single region, while topology changes to the CIST propagate through all regions. Every region treats the TC notification from another region as “external” and applies them to CIST-associated ports only.

A topology change to CST (the tree connecting the virtual bridges) will affect all MSTIs in all regions and the CIST. This is due to the fact that new link becoming forwarding between the virtual bridges may change all paths in the topology and thus require massive MAC address re-learning. Thus, from the standpoint of topology change, something happening to the CST will have most massive impact of flooding in the set of interconnected MSTP regions.

The above observations advise a good design rule for MSTP networks – separated “meshy” topologies in their own regions and interconnect regions using “sparse” mesh, keeping in mind balance between redundancy and topology changes effect. This is an adaptation of well-know design principle – separate complexity from complexity to keep networks more stable and isolate fault domains.

Interoperating with PVST+

This task poses a tough issue. We know that PSVST+ runs an STP instance for every VLAN. On a contrary, MSTP maps VLANs to MSTIs, so one-to-one mapping between VLAN and STP instance no longer holds true. How should an MSTP switch operate on a border link connected to the PVST+ domain? As we remember, on the border with IEEE STP domain, PVST+ simply joins VLAN 1 STP with IEEE STP and tunnels SSTP BPDUs across the IEEE STP domain (refer to PVST+ Explained article for more information).

However, MSTP runs multiple MSTIs inside a region and maps them all to CIST on the border link. That means we need to make sure that internal MSTIs could be aware of changes in PVST+ trees. It’s hard to automatically map VLAN-based STPs to MSTI and so the simplest way to accomplish the desired behavior is to join all PVST+ trees with CIST. This way, changes in any of PVST+ STP instances propagate to CIST/IST and affect all MSTIs in result. While not the optimal solution, it ensure that no changes go unnoticed and no black holes occur in a single VLAN due to the topology changes.

The MSTP implementation simulates PVST+ by replicating CIST BPDUs on the link facing the PVST+ domain and sending the BPDUs on ALL VLANs active on the trunk. The MSTP switch consumes all BDPUs received from PVST+ domain and processes them using the CIST/IST instance. The PSVT+ side sees the MSTP domain as a special PVST+ domain with all per-VLAN instances claiming the CIST Root as the root of their STP. Note that PVST+ also interprets the whole region as a single pseudo-bridge, but operating in PSVT+ mode. The two possible options are allowed here:

1) MSTP domain (either a single region or multiple regions) contains the root bridge for ALL VLANs. This is only true if CIST Root BID is better than any PVST+ STP root BID. This is the preferred design, for you can manipulate uplink costs on the PVST+ side and obtain optimal traffic engineering results.

2) PVST+ contains the root bridges for ALL VLANs, including VLAN1, which maps to CST of STP. This is only true is all PVST+ root bridges BIDs for all VLANs are better than CIST Root BID. This is not the preferred design, since all MSTIs map to CIST on the border link, and you cannot load-balance the MSTIs as the enter the PVST+ domain.

Cisco implementation does not support the second option. MSTP domain should contain the bridge with the best BID, to ensure that the CIST Root is also the root for all PVST+ trees. If any other case, MSTP border switch will complain and place the ports that receive superior BPDUs from PVST+ region in root-inconsistent state. To fix this issue, ensure that PVST+ domain does not have any bridges with BIDs better than the CIST Root Bridge ID.

MSTP I: Inside a Region

In the beginning, there was IEEE STP protocol (originally, there also was DEC variant [the original] invented by Radia Perlman and IBM STP protocols, but those are fossils now), which was adapted for use with multiple VLANs and 802.1q trunks. A single shared tree, sometimes called Mono Spanning Tree by Cisco, or more often – Common Spanning Tree is shared by all VLANs. The obvious drawback of this design is impossibility to perform VLAN traffic engineering across redundant links: if a link is blocked, it is blocked for all VLANs. To overcome this, Cisco suggested its proprietary PVST/PVST+ solution, running a separate STP instance for each VLAN. This solution permits using different logical topology for each VLAN, effectively allowing for L2 traffic engineering. However, with the number of VLANs growing, PVST becomes a waste of switch resources and management burden, for the number of logical topologies is usually much smaller than the number of active VLANs.

As time passed, STP evolved into RSTP and Cisco answered with Rapid-PVST+: the fast STP, but with the same per-VLAN instance concept. The single spanning-tree instance used by IEEE and per-VLAN STP implemented by Cisco represents two poles in the space of possible solutions. Seeing the limitations of PVST approach, Cisco came with idea of decoupling the STP instance from a VLAN (they were bound together in PVST). The initial implementation was called MISTP (Multiple Instances Spanning Tree) and later evolved into new IEEE 802.1s standard called MSTP (Multiple Spanning Trees Protocol).

Logical and Physical Topologies
Instead of running an STP instance for each VLAN, run a number of VLAN-independent STP instances (representing logical topologies) and then map each VLAN to the most appropriate logical topology (instance). Thus, the number of STP instances is kept to minimum (saving switch resources), but the network capacity is utilized in optimal fashion, by using all possible paths for VLAN traffic. The switch forwarding logic for VLAN traffic was changed a little bit. In order for a frame to be forwarded out of a port, two conditions must be met: first, VLAN must be active on this port (e.g. not filtered) and second, the STP instance the VLAN maps to, must be in non-discarding state for this port. Obviously, due to multiple logical topologies a single port could be blocking for one instance and forwarding for another.

Implementing MSTP

The following questions need to be answered:

* Topology Calculation. How to build multiple STP instances (logical topologies) in a single physical topology? Should we run multiple STP instances each with own BPDUs? If yes, then how would we distinguish every instance’s BPDUs: PVST+ uses VLAN tags for that, but now STP instances are independent of VLANs?
* Information Distribution.How to make all switches aware of VLAN to instance mappings? Should we distribute instance ID along with VLAN number? If yet, then how could we ensure all switches use consistent numbering?
* Consistency Check. How to ensure the above mapping is consistent across all switches? That is, would switch1 know that switch2 maps VLAN2 to the same instance 1 and not insance2?

IEEE’s implementation, MSTP region is a collection of switches, sharing the same view of physical topology partitioning into set of logical topologies. For two switches to become members of the same region, the following attributes must match:

* Configuration name
* Configuration revision number (16 bits)
* The table of 4096 elements which map the respective VLAN to STP instance number

IEEE 802.1s implementation does not send a BDPU for each active STP instance, nor does it encapsulate VLAN list in each configuration message. Instead of that, a special STP instance number 0 called Internal Spanning Tree (IST or MSTI0) is designated to carry all “signaling” information. The BPDUs for IST contain all standard RSTP information for IST itself, as well as carry additional informational fields. Among others fields there are configuration name, revision number and a hash value computed over VLAN to STP instance mapping table contents. Using just this compact information it’s easy to detect misconfiguration on two neighboring switches.

What about other instances, besides the IST thing? Well, obviously, all VLANs could be mapped to IST – this is the default configuration. Effectively, this represents the case of classic IEEE RSTP with all VLANs sharing the same spanning-tree. Of course, other instances also exit, and they are called MSTIs – multiple spanning tree instances. Each MSTI may assign different priorities to switches, may have different link costs, port priorities and thus end up with it’s own logical topology. Now if the 802.1s standard implementation does not send separate BDPUs for each MSTI, how does it accomplish separate topologies? The MSTIs information is piggybacked into IST BPDUs in special MRecord fields (one for every active MSTI), which carries root priority, designated bridge priority, port priority and root path cost among others. Let’s see how this whole thing works.

First of all, since MSTP convergence mechanism stems from RSTP, there is no BDPU relaying process downstream from the root bridge. Every switch emits configuration BPDUs on it’s own, every Hello interval seconds. Every BDPU has full information about IST, and also MRecord for every MSTI . Using the RSTP convergence mechanics, separate STP instances are built for IST and every MSTI, using the information from IST BPDU and MRecords (root/designated bridge priorities, port priority, root path cost etc). Note that STP timers such as Hello, ForwardTime, MaxAge could only be tuned for IST, the instance 0. All other instances (MSTIs) inherit the timers from IST – this is the natural result of all MSTI information being piggybacked in IST BPDUs. Just as a side note, MSTP does not use MaxAge timer to age out old information, like RSTP/STP do. Instead of this, IST BDPUs has special field called MaxHops. IST root sends BPDUs with hop count equal to MaxHops and every other downstream switch decrements the hop count field on reception of IST BPDU. As soon as hop count becomes zero, the information in BPDU is ignored, and the switch may start declaring itself as new IST root. The old MaxAge/ForwardDelay timers are still used when MSTP interacts with RSTP, STP or (R)PVST+ bridges.

Caveats arising from VLAN/STP decoupling

There are some issues, which may arise from the fact that spanning-tree instances now are not directly tied to VLANs. The general rule should be as following: “If a VLAN is active on a particular primary link (e.g. this link is non-backup in your logical topology), ensure the STP instance it maps to is forwarding on this link”.
So, do not use “VLAN pruning” static method of distributing VLANs across trunks when you have MSTP enabled.
Use separate STP for each logical topology and avoid mapping VLANs to IST. Keep IST only for information distribution, but load-balance traffic using MSTIs.

MSTP

CST

The original IEEE 802.1q standard defines much more than simply trunking. This standard defines a Common Spanning Tree (CST) that only assumes one spanning tree instance for the entire bridged network, regardless of the number of VLANs.
In a network running the CST, these statements are true:
. No load balancing is possible; one Uplink needs to block for all VLANs.
. The CPU is spared; only one instance needs to be computed.

MST Case
Several VLANs can be mapped to a reduced number of spanning tree instances because most networks do not need more than a few logical topologies.
. The desired load balancing scheme can still be achieved.
. The CPU is spared because only limited instances are computed.

MST Region
The main enhancement introduced by MST is that several VLANs can be mapped to a single spanning tree instance. This raises the problem of how to determine which VLAN is to be associated with which instance. More precisely, how to tag BPDUs so that the receiving devices can identify the instances and the VLANs to which each device applies.

The IEEE 802.1s committee adopted a much easier and simpler approach that introduced MST regions. Think of a region as the equivalent of Border Gateway Protocol (BGP) Autonomous Systems, which is a group of switches placed under a common administration.

MST Configuration and MST Region

Each switch running MST in the network has a single MST configuration that consists of these three attributes:

1. An alphanumeric configuration name (32 bytes)
2. A configuration revision number (two bytes)
3. A 4096-element table that associates each of the potential 4096 VLANs supported on the chassis to a given instance

In order to be part of a common MST region, a group of switches must share the same configuration attributes.
Note: If for any reason two switches differ on one or more configuration attribute, the switches are part of different regions.

Region Boundary

In order to ensure consistent VLAN-to-instance mapping, it is necessary for the protocol to be able to exactly identify the boundaries of the regions. For that purpose, the characteristics of the region are included in the BPDUs. The exact VLANs-to-instance mapping is not propagated in the BPDU, because the switches only need to know whether they are in the same region as a neighbor. Therefore, only a digest of the VLANs-to-instance mapping table is sent, along with the revision number and the name. Once a switch receives a BPDU, the switch extracts the digest (a numerical value derived from the VLAN-to-instance mapping table through a mathematical function) and compares this digest with its own computed digest. If the digests differ, the port on which the BPDU was received is at the boundary of a region.

In generic terms, a port is at the boundary of a region if the designated bridge on its segment is in a different region or if it receives legacy 802.1d BPDUs.

MST Instances

According to the IEEE 802.1s specification, an MST bridge must be able to handle at least these two instances:
. One Internal Spanning Tree (IST)
. One or more Multiple Spanning Tree Instance(s) (MSTIs)

The terminology continues to evolve, as 802.1s is actually in a pre-standard phase. It is likely these names will change in the final release of 802.1s.

IST Instances

In order to clearly understand the role of the IST instance, remember that MST originates from the IEEE. Therefore, MST must be able to interact with 802.1q-based networks, because 802.1q is another IEEE standard. For 802.1q, a bridged network only implements a single spanning tree (CST). The IST instance is simply an RSTP instance that extends the CST inside the MST region.

The IST instance receives and sends BPDUs to the CST. The IST can represent the entire MST region as a CST virtual bridge to the outside world.

These are two functionally equivalent diagrams. Notice the location of the different blocked ports. In a typically bridged network, you expect to see a blocked port between Switches M and B. Instead of blocking on D, you expect to have the second loop broken by a blocked port somewhere in the middle of the MST region. However, due to the IST, the entire region appears as one virtual bridge that runs a single spanning tree (CST). This makes it possible to understand that the virtual bridge blocks an alternate port on B. Also, that virtual bridge is on the C to D segment and leads Switch D to block its port.

MSTIs

The MSTIs are simple RSTP instances that only exist inside a region. These instances run the RSTP automatically by default, without any extra configuration work. Unlike the IST, MSTIs never interact with the outside of the region. Remember that MST only runs one spanning tree outside of the region, so except for the IST instance, regular instances inside of the region have no outside counterpart. Additionally, MSTIs do not send BPDUs outside a region, only the IST does.

MSTIs do not send independent individual BPDUs. Inside the MST region, bridges exchange MST BPDUs that can be seen as normal RSTP BPDUs for the IST while containing additional information for each MSTI. Each switch only sends one BPDU, but each includes one MRecord per MSTI present on the ports.

Note: The first information field carried by an MST BPDU contains data about the IST. This implies that the IST (instance 0) is always present everywhere inside an MST region. However, the network administrator does not have to map VLANs onto instance 0, and therefore this is not a source of concern.

Unlike regular converged spanning tree topology, both ends of a link can send and receive BPDUs simultaneously. This is because, each bridge can be designated for one or more instances and needs to transmit BPDUs. As soon as a single MST instance is designated on a port, a BPDU that contains the information for all instances (IST+ MSTIs) is to be sent. The diagram shown here demonstrates MST BDPUs sent inside and outside of an MST region.

The MRecord contains enough information (mostly root bridge and sender bridge priority parameters) for the corresponding instance to calculate its final topology. The MRecord does not need any timer-related parameters such as hello time, forward delay, and max age that are typically found in a regular IEEE 802.1d or 802.1q CST BPDU. The only instance in the MST region to use these parameters is the IST; the hello time determines how frequently BPDUs are sent, and the forward delay parameter is mainly used when rapid transition is not possible (remember that rapid transitions do not occur on shared links). As MSTIs depend on the IST to transmit their information, MSTIs do not need those timers.

Thursday, July 1, 2010

RSTP

The 802.1D Spanning Tree Protocol (STP) standard was designed at a time when the recovery of connectivity after an outage within a minute or so was considered adequate performance. It is not the case anymore.

Original 802.1D was enhanced with features such as Uplink Fast, Backbone Fast, and Port Fast to speed up the convergence time of a bridged network. The drawback is that these mechanisms are proprietary and need additional configuration.

Rapid Spanning Tree Protocol (RSTP; IEEE 802.1w) can be seen as an evolution of the 802.1D standard more than a revolution. The 802.1D terminology remains primarily the same. Most parameters have been left unchanged so users familiar with 802.1D can rapidly configure the new protocol comfortably. 802.1w can also revert back to 802.1D in order to interoperate with legacy bridges on a per-port basis. This drops the benefits it introduces.

The new edition of the 802.1D standard, IEEE 802.1D-2004, incorporates IEEE 802.1t-2001 and IEEE 802.1w standards.

Port States and Port Rolest
802.1D is defined in these four different port states: listening, learning,blocking,forwarding.

The state of the port is mixed, whether it blocks or forwards traffic, and the role it plays in the active topology (root port, designated port, and so on). RSTP decouples the role and the state of a port.
There are only three port states left in RSTP that correspond to the three possible operational states: discarding, Leanring, Forwarding. The 802.1D disabled, blocking, and listening states are merged into a unique 802.1w discarding state.

Port Roles
The role is now a variable assigned to a given port. The root port and designated port roles remain, while the blocking port role is split into the backup and alternate port roles. The Spanning Tree Algorithm (STA) determines the role of a port based on Bridge Protocol Data Units (BPDUs).

Root Port Roles
The port that receives the best BPDU on a bridge is the root port. This is the port that is the closest to the root bridge in terms of path cost. The STA elects a single root bridge in the whole bridged network (per-VLAN). The root bridge is the only bridge in the network that does not have a root port.

Designated Port Role
A port is designated if it can send the best BPDU on the segment to which it is connected. On a given segment, there can only be one path toward the root bridge. All bridges connected to a given segment listen to the BPDUs of each and agree on the bridge that sends the best BPDU as the designated bridge for the segment. The port on that bridge that corresponds is the designated port for that segment.

Alternate and Backup Port Roles
These two port roles correspond to the blocking state of 802.1D. A blocked port is defined as not being the designated or root port. A port absolutely needs to receive BPDUs in order to stay blocked.
An alternate port receives more useful BPDUs from another bridge and is a port blocked.
A backup port receives more useful BPDUs from the same bridge it is on and is a port blocked.

New BPDU
Only two flags, Topology Change (TC) and TC Acknowledgment (TCA), are defined in 802.1D. RSTP now uses all six bits of the flag byte.

01234567
0 - Topology change
1 - Proposal
2 3 - port role: 00 Unknown, 01 Alternate/Backup, 10 Root, 11 Designated
4 - Learning
5 - Forwarding
6 - Agreement
7 - Topology change Ack

RSTP BPDU is type 2, version 2. The implication is that legacy bridges must drop this new BPDU.

BPDU are Sent Every Hello-Time
BPDU are sent every hello-time. With 802.1D, a non-root bridge only generates BPDUs when it receives one on the root port. In 802.1w, A bridge now sends a BPDU with its current information every seconds (2 by default), even if it does not receive any from the root bridge.

Faster Aging of Information
On a given port, if hellos are not received three consecutive times, protocol information can be immediately aged out (or if max_age expires). BPDUs are now used as a keep-alive mechanism between bridges. A bridge considers that it loses connectivity to its direct neighbor root or designated bridge if it misses three BPDUs in a row. This fast aging of the information allows quick failure detection.

Note: Failures are detected even much faster in case of physical link failures.

Accepts Inferior BPDUs
This concept is what makes up the core of the BackboneFast engine. When a bridge receives inferior information from its designated or root bridge, it immediately accepts it and replaces the one previously stored.

Rapid Transition to Forwarding State
The legacy STA passively waited for the network to converge before it turned a port into the forwarding state. The new rapid STP is able to actively confirm that a port can safely transition to the forwarding state without having to rely on any timer configuration. There is now a real feedback mechanism that takes place between RSTP-compliant bridges. In order to achieve fast convergence on a port, the protocol relies upon two new variables: edge ports and link type.

Edge Ports
All ports directly connected to end stations cannot create bridging loops in the network. Therefore, the edge port directly transitions to the forwarding state, and skips the listening and learning stages. Neither edge ports or PortFast enabled ports generate topology changes when the link toggles. An edge port that receives a BPDU immediately loses edge port status and becomes a normal spanning tree port.

Link Type
RSTP can only achieve rapid transition to the forwarding state on edge ports and on point-to-point links. The link type is automatically derived from the duplex mode of a port. A port that operates in full-duplex is assumed to be point-to-point, while a half-duplex port is considered as a shared port by default. This automatic link type setting can be overridden by explicit configuration.

Convergence with 802.1w
A link between the root bridge and Bridge A is added. Suppose there already is an indirect connection between Bridge A and the root bridge (via C - D in the diagram).

Both ports on the link between A and the root are put in designated blocking as soon as they come up. At this stage, a negotiation takes place between Switch A and the root. As soon as A receives the BPDU of the root, it blocks the non-edge designated ports. This operation is called sync. Once this is done, Bridge A explicitly authorizes the root bridge to put its port in the forwarding state.

The link between Switch A and the root bridge is blocked, and both bridges exchange BPDUs.

Once Switch A blocks its non-edge designated ports, the link between Switch A and the root is put in the forwarding state and you reach the situation:

There still cannot be a loop. Instead of blocking above Switch A, the network now blocks below Switch A. However, the potential bridging loop is cut at a different location. This cut travels down the tree along with the new BPDUs originated by the root through Switch A. At this stage, the newly blocked ports on Switch A also negotiate a quick transition to the forwarding state with their neighbor ports on Switch B and Switch C that both initiate a sync operation. Other than the root port towards A, Switch B only has edge designated ports. Therefore, it has no port to block in order to authorize Switch A to go to the forwarding state. Similarly, Switch C only has to block its designated port to D. The state shown in this diagram is now reached:

The final network topology is reached, just in the time necessary for the new BPDUs to travel down the tree. No timer is involved in this quick convergence. The only new mechanism introduced by RSTP is the acknowledgment that a switch can send on its new root port in order to authorize immediate transition to the forwarding state, and bypasses the twice-the-forward-delay long listening and learning stages.

Proposal/Agreement Sequence
When a port is selected by the STA to become a designated port, 802.1D still waits twice seconds (2x15 by default) before it transitions it to the forwarding state. In RSTP, this condition corresponds to a port with a designated role but a blocking state. These diagrams illustrate how fast transition is achieved step-by-step. Suppose a new link is created between the root and Switch A. Both ports on this link are put in a designated blocking state until they receive a BPDU from their counterpart.

When a designated port is in a discarding or learning state (and only in this case), it sets the proposal bit on the BPDUs it sends out. This is what occurs for port p0 of the root bridge, as shown in step 1 of the preceding diagram. Because Switch A receives superior information, it immediately knows that p1 is the new root port. Switch A then starts a sync to verify that all of its ports are in-sync with this new information. A port is in sync if it meets either of these criteria:
* The port is in the blocking state, which means discarding in a stable topology.
* The port is an edge port.

UplinkFast
Another form of immediate transition to the forwarding state included in RSTP is similar to the UplinkFast spanning tree extension. Basically, when a bridge loses its root port, it is able to put its best alternate port directly into the forwarding mode.
UplinkFast does not need to be configured further because the mechanism is included natively and enabled in RSTP automatically.

Topology Change Detection

In RSTP, only non-edge ports that move to the forwarding state cause a topology change. When a RSTP bridge detects a topology change, these occur:

* It starts the TC While timer with a value equal to twice the hello-time for all its non-edge designated ports and its root port, if necessary.
* It flushes the MAC addresses associated with all these ports.

Topology Change Propagation
When a bridge receives a BPDU with the TC bit set from a neighbor, these occur:
* It clears the MAC addresses learned on all its ports, except the one that receives the topology change.
*It starts the TC While timer and sends BPDUs with TC set on all its designated ports and root port (RSTP no longer uses the specific TCN BPDU, unless a legacy bridge needs to be notified).

This way, the TCN floods very quickly across the whole network. The initiator of the topology change floods this information throughout the network, as opposed to 802.1D where only the root did.

Compatibility with 802.1D

RSTP is able to interoperate with legacy STP protocols. However, the inherent fast convergence benefits of 802.1w are lost when it interacts with legacy bridges.