Howto create SIP resiliency through HA SBCs

This week I read an interesting blog/article about creating SIP resiliency with the use of high-availability Session Border Controllers (SBCs). The blog which I like to share is by Andrew Prokop.


It’s important to understand how high availability session border controllers are enabled, and more importantly, under what conditions failover rules are invoked.

Any subject worth learning is like an onion. No, not because it will make you cry (although some sure bring out the tears for me). Rather, it’s because a worthwhile subject consists of multiple layers, and every time you think you completely understand the whole thing, you find something new to capture your time and attention.

In another blog/article (Peeling Back the SIP Resiliency Layers) I discussed a number of techniques that an enterprise can employ to create a resilient SIP infrastructure. As with the onion, there are layers of subsystem resiliency that all together make an entire system durable and robust. Take one away and you risk a single point of failure that might be responsible for total system failure.

Today, I would like to peel the onion back a bit more and discuss one of the most critical aspects of SIP resiliency: high availability session border controllers (HA SBC). While the overall concept of an HA SBC is fairly obvious, it’s important to understand how it’s enabled and more importantly, under what conditions failover rules are invoked.

Double Your Fun
You may be surprised to know that there really is no such thing as a high availability SBC. In reality, a high availability SBC is made up of two standalone SBCs with a private connection between the two. While vendors may create product structures that position two separate SBCs as one logical package, you are still buying two boxes. It takes software and a pseudo network between them to turn them into an HA pair.

In all cases, one of the SBCs will be designated as active, and the second SBC as standby. That doesn’t mean, however, that the standby SBC is just sitting around twiddling its thumbs. No, it’s listening to all sorts of device health, configuration, and call state information that the active SBC is sending across the pseudo network. This allows the standby SBC to know exactly what the active SBC is doing in case it has to take control.

Since an HA SBC is really two separate SBCs, something more is needed to make it look like only one to the rest of the world. This is accomplished by creating virtual MAC and IP addresses that can freely float between the two standalone SBCs. These are the only SBC MAC and IP addresses that the network is aware of. Only one SBC at a time will assume control of these virtual addresses, so to the network, it really looks like one device. It works like this:

  • Each SBC has its own set of MAC and IP addresses that are not advertised to the communications components.
  • The active SBC controls the virtual MAC and IP addresses.
  • All communications components (SIP carrier, session management servers, call recorders, etc.) send to and receive from the virtual addresses. This includes SIP signaling as well as any media paths.
  • The standby SBC constantly monitors the health of the active SBC.
  • If the standby SBC realizes that the active SBC is no longer able to properly deliver SIP services, it sends out gratuitous ARP (address resolution protocol) messages to take control of the virtual addresses. A gratuitous ARP is a sort of advance notification. It updates the ARP cache of other systems before it is asked (no ARP request) and essentially moves an IP address from one physical interface to another.
  • Since the other communications elements continue to send and receive from the virtual addresses (the only addresses they are aware of), SBC functionally is uninterrupted. Existing calls stay up, and new call requests are immediately processed.

It’s important to know under what conditions the standby determines it’s time to step up to the plate and assume the role of active. These conditions may vary by vendor, but here are the main triggers:

  • The active SBC loses power
  • The active SBC is restarted
  • The active SBC loses physical connectivity (broken cable, dead NIC, etc.) to an essential communications element
  • There is a loss of ping response from a default gateway
  • There is a system or individual service crash

Note that losing physical connectivity is not the same as losing logical connectivity. By this, I mean the difference between a broken cable and an unresponsive call processing server. In the case of logical connectivity failures, different routes can be taken rather than failing over to another SBC that will have the exact same connectivity problems. Failover from one SBC to another will not fix an unresponsive SIP carrier.

As important as it is to know when to failover from active to standby, it’s just as important to know when not to failover. You don’t want the standby SBC jumping the gun and unnecessarily taking control when the active SBC is just a tad slow to respond. For this, SBCs have timers to determine when a problem really is a problem.

In addition to handling unexpected runtime problems, being highly available allows SBCs to be upgraded without stopping SIP traffic. The steps for this are:

  1. Upgrade the software or hardware on the standby SBC
  2. Take the active SBC down
  3. This forces the standby SBC to become active
  4. Upgrade the software or hardware on the formerly active SBC

As this point, you can leave things running as they are, or failback to the way it was at the start of the upgrade process. The beauty of this method is that a significant upgrade can be completely unnoticed by the outside world.

Vendor Specifics
While every SBC on the market pretty much supports high availability as I described above, there is plenty of room for vendors to differentiate their products from the competition. I spoke with my friends at AudioCodes and learned about some of the features it considers unique to its SBCs:

  • Single management Interface (IP) for both systems
  • An optional parameter called “Revertive Mode” that allows a failed device to automatically re-gain function as the primary SBC after it has recovered
  • The HA synchronization between members synchronizes not just call state and associated SIP/socket state, but auxiliary files, too (music on hold files, pre-recorded tone files, firmware, etc.)
  • Software upgrade of one member of the pair will cause the second member to get upgraded automatically
  • Migration from one to another (e.g. server to virtual)

My friends at Sonus stressed the depth and flexibility of its HA solution (e.g. copper or fiber for the synchronization link) while touting how its disaster recovery licensing saves an enterprise money when deploying SIP trunks at separate data centers.

sonus sbc

Layer 2 vs. Layer 3
Remember the connection I spoke of between the active and standby SBC? Known by some vendors as the synchronization link, it’s essentially one or two Ethernet cables that directly connect the active SBC with the standby SBC. Depending on the vendor, it may be a straight or a crossover cable.

It’s absolutely essential to know that this is a Layer 2 connection. This means that the SBCs must be on the same subnet. While nearly every telecom director I speak with wishes that he or she could spread the active and standby SBCs across data centers, high availability is limited to two SBCs in very close physical proximity.

 

Mischief Managed
I hope this article helped make a somewhat complicated and slightly mysterious subject easier to understand. A little knowledge applied in the right way will save money while avoiding costly downtime.

Andrew Prokop writes about all things unified communications on his popular blog, SIP Adventures.