Azure Virtual WAN migration – tales from the frontline

We recently conducted an Azure Virtual WAN migration for a customer to help resolve a number of issues in their legacy Azure networking environment. Some statistics about their environment included:

  • 60 connected virtual networks
  • 20 virtual network gateways deployed
  • 6 firewall instances
  • 2 SDWAN fabrics
  • 300+ route tables
  • multiple hub and spoke deployments

The goals for the program were set out as follows:

  • Simplified network design
  • Simplified network management
  • Centralised security controls for all traffic
  • Support for SDWAN platforms
  • Right sizing of resources such as ExpressRoute
  • Ability to support additional regions and office locations

I came up with the following high level design which was accepted:

The Azure Virtual WAN migration approach was run along the following lines:

  • Virtual WAN to be deployed into a parallel environment including new ExpressRoute circuits
  • ExpressRoute routers to be linked to provide access between legacy environment and the new virtual WAN environment
  • Staged migration to allow for dev/test environments to be migrated first to complete testing and prove capability

This meant during migration, we had an environment running similar to this:

Azure Virtual WAN migration

What follows are an outline of some of the issues we faced during the migration. From the “that seems obvious” to the more “not so obvious” issues we faced.

The “That seems obvious” issues we faced

Conflicting Network Address Ranges

This one seems the most obvious and hindsight is always 20-20. Some of the networks migrated were connected in strange and undocumented ways:

In this configuration, the network range was not automatically routed and could only be seen by its immediate neighbour but the migration process broke and reconnected all peers to meet with the centralised managed traffic requirement. When the network was migrated to the virtual WAN, everything could see it, including a remote office site with the same subnet for its local servers.

Network Policy for Private Endpoints required if using “Routing Intent and Routing Policies”

This one is also obvious in hindsight. It was missed initially due to inconsistent deployment configurations on subnets. Not all subnets were subject to route tables and user defined routes, so some subnets with private endpoints had been deployed without this configured:

When “Routing intent and Routing Policies” are enabled, this effectively is the same as applying a route table to the subnet and therefore a private endpoint network policy is required.

Propagate Gateway Routes

Some of the virtual networks contained firewalls that provided zone isolation between subnets. The default route table for a network with Routing Intent enabled sends Virtual Network traffic to VNETLocal. To shift the functionality to the new centralised firewalls, a local Virtual Network route via the managed appliance was needed.

Without “Propagate Gateway Routes” enabled the locally generated route table at the device included the new Virtual Network route plus the default set of routes that Microsoft apply to all network interfaces including a default to internet.

The “Not So Obvious” issues we faced

Enabling “Routing Intent and Routing Policies” for Private traffic only

Initially when deciding how internet egress would be handled, the initial directive by Cyber team was to ingress/egress through the same point. As there was a “default route” coming up the ExpressRoute from the on-premises connection, I turned on “Routing Intent and Routing Policies” for Private traffic only:

The unexpected behaviour of only managing internal traffic is that in all connected Virtual Networks, the route table applied sends the RFC1918 address ranges to route via the managed application, but then applies the remaining default route table you would normally see on any standard virtual network. All routes being broadcast via the ExpressRoute gateways are not propagated to the end devices. In the end, we needed to apply “Internet traffic” policies via Routing Intent and Routing Policies to egress through our central managed applications as well.

Asymmetric Routing

Asymmetric routing, the bane of Azure network administrators everywhere. With ExpressRoute connections to the on-premises network in two locations, all Virtual WAN networks are made available via two paths into the environment.

Hub to hub communication adds additional AS-Path info to the route path which should play into route determination, but in our case the on-premises router connections added even more. Therefore traffic originating in Hub2 would route onto the network via the local ExpressRoute, but the return path was the same or shorter (preferenced) via Hub1. With firewalls in play, traffic was dropped due to an unknown session and an asymmetric route.

There were two ways to handle this. The new (Preview) Route Map feature for Virtual Hubs is designed to assist with this issue by using policies to introduce additional AS-Path information to the route path. The problem is, (at the time of this writing) this feature is in preview and we are in production.

The alternative was to use BGP community tags and allow the ExpressRoute routers to apply policy based routing information to the AS path.

BGP Community tags

On the surface, this looked to be a simple solution. Apply a BGP community tag to each VNET based on the hub it is peered with. By default, the BGP community region tag is also automatically applied and this information is shared with the ExpressRoute routers.

Except, Virtual WAN does not fully support BGP community tagging. Community tags are shared with the local ExpressRoute service, but are stripped in inter-hub communication. Applying region specific policies to the routing path is not possible if both regions community tags are not shared.

Next-gen firewalls

The next-gen managed applications that were deployed presented a number of issues for us in our migration configuration as well. Some of the issues we faced are vendor agnostic and not specific to our deployment, some specific to the brand.

I will cover these in another post.

Read more recent blogs

Get started on the right path to cloud success today. Our Crew are standing by to answer your questions and get you up and running.