Wednesday, July 7, 2010

MSTP I: Inside a Region

In the beginning, there was IEEE STP protocol (originally, there also was DEC variant [the original] invented by Radia Perlman and IBM STP protocols, but those are fossils now), which was adapted for use with multiple VLANs and 802.1q trunks. A single shared tree, sometimes called Mono Spanning Tree by Cisco, or more often – Common Spanning Tree is shared by all VLANs. The obvious drawback of this design is impossibility to perform VLAN traffic engineering across redundant links: if a link is blocked, it is blocked for all VLANs. To overcome this, Cisco suggested its proprietary PVST/PVST+ solution, running a separate STP instance for each VLAN. This solution permits using different logical topology for each VLAN, effectively allowing for L2 traffic engineering. However, with the number of VLANs growing, PVST becomes a waste of switch resources and management burden, for the number of logical topologies is usually much smaller than the number of active VLANs.

As time passed, STP evolved into RSTP and Cisco answered with Rapid-PVST+: the fast STP, but with the same per-VLAN instance concept. The single spanning-tree instance used by IEEE and per-VLAN STP implemented by Cisco represents two poles in the space of possible solutions. Seeing the limitations of PVST approach, Cisco came with idea of decoupling the STP instance from a VLAN (they were bound together in PVST). The initial implementation was called MISTP (Multiple Instances Spanning Tree) and later evolved into new IEEE 802.1s standard called MSTP (Multiple Spanning Trees Protocol).

Logical and Physical Topologies
Instead of running an STP instance for each VLAN, run a number of VLAN-independent STP instances (representing logical topologies) and then map each VLAN to the most appropriate logical topology (instance). Thus, the number of STP instances is kept to minimum (saving switch resources), but the network capacity is utilized in optimal fashion, by using all possible paths for VLAN traffic. The switch forwarding logic for VLAN traffic was changed a little bit. In order for a frame to be forwarded out of a port, two conditions must be met: first, VLAN must be active on this port (e.g. not filtered) and second, the STP instance the VLAN maps to, must be in non-discarding state for this port. Obviously, due to multiple logical topologies a single port could be blocking for one instance and forwarding for another.

Implementing MSTP

The following questions need to be answered:

* Topology Calculation. How to build multiple STP instances (logical topologies) in a single physical topology? Should we run multiple STP instances each with own BPDUs? If yes, then how would we distinguish every instance’s BPDUs: PVST+ uses VLAN tags for that, but now STP instances are independent of VLANs?
* Information Distribution.How to make all switches aware of VLAN to instance mappings? Should we distribute instance ID along with VLAN number? If yet, then how could we ensure all switches use consistent numbering?
* Consistency Check. How to ensure the above mapping is consistent across all switches? That is, would switch1 know that switch2 maps VLAN2 to the same instance 1 and not insance2?

IEEE’s implementation, MSTP region is a collection of switches, sharing the same view of physical topology partitioning into set of logical topologies. For two switches to become members of the same region, the following attributes must match:

* Configuration name
* Configuration revision number (16 bits)
* The table of 4096 elements which map the respective VLAN to STP instance number

IEEE 802.1s implementation does not send a BDPU for each active STP instance, nor does it encapsulate VLAN list in each configuration message. Instead of that, a special STP instance number 0 called Internal Spanning Tree (IST or MSTI0) is designated to carry all “signaling” information. The BPDUs for IST contain all standard RSTP information for IST itself, as well as carry additional informational fields. Among others fields there are configuration name, revision number and a hash value computed over VLAN to STP instance mapping table contents. Using just this compact information it’s easy to detect misconfiguration on two neighboring switches.

What about other instances, besides the IST thing? Well, obviously, all VLANs could be mapped to IST – this is the default configuration. Effectively, this represents the case of classic IEEE RSTP with all VLANs sharing the same spanning-tree. Of course, other instances also exit, and they are called MSTIs – multiple spanning tree instances. Each MSTI may assign different priorities to switches, may have different link costs, port priorities and thus end up with it’s own logical topology. Now if the 802.1s standard implementation does not send separate BDPUs for each MSTI, how does it accomplish separate topologies? The MSTIs information is piggybacked into IST BPDUs in special MRecord fields (one for every active MSTI), which carries root priority, designated bridge priority, port priority and root path cost among others. Let’s see how this whole thing works.

First of all, since MSTP convergence mechanism stems from RSTP, there is no BDPU relaying process downstream from the root bridge. Every switch emits configuration BPDUs on it’s own, every Hello interval seconds. Every BDPU has full information about IST, and also MRecord for every MSTI . Using the RSTP convergence mechanics, separate STP instances are built for IST and every MSTI, using the information from IST BPDU and MRecords (root/designated bridge priorities, port priority, root path cost etc). Note that STP timers such as Hello, ForwardTime, MaxAge could only be tuned for IST, the instance 0. All other instances (MSTIs) inherit the timers from IST – this is the natural result of all MSTI information being piggybacked in IST BPDUs. Just as a side note, MSTP does not use MaxAge timer to age out old information, like RSTP/STP do. Instead of this, IST BDPUs has special field called MaxHops. IST root sends BPDUs with hop count equal to MaxHops and every other downstream switch decrements the hop count field on reception of IST BPDU. As soon as hop count becomes zero, the information in BPDU is ignored, and the switch may start declaring itself as new IST root. The old MaxAge/ForwardDelay timers are still used when MSTP interacts with RSTP, STP or (R)PVST+ bridges.


Caveats arising from VLAN/STP decoupling

There are some issues, which may arise from the fact that spanning-tree instances now are not directly tied to VLANs. The general rule should be as following: “If a VLAN is active on a particular primary link (e.g. this link is non-backup in your logical topology), ensure the STP instance it maps to is forwarding on this link”.
So, do not use “VLAN pruning” static method of distributing VLANs across trunks when you have MSTP enabled.
Use separate STP for each logical topology and avoid mapping VLANs to IST. Keep IST only for information distribution, but load-balance traffic using MSTIs.

No comments:

Post a Comment