Managing Inconsistent AOVPN Disconnects

With the current climate and radical change to working methods COVID-19 has brought worldwide, more of us are either working from home due to requirements or now enjoying the flexibility this new style of working has brought to the modern workplace.

This increased demand brings a requirement for technology to enable this and Microsoft’s Always On VPN is being deployed and adopted by a large number of businesses due to its ease of deployment and functionality that it brings.

The Problem…

So you have Always On deployed, clients are connecting, connectivity is relatively stable and working well for you; connections come in across a perimeter firewall because you’ve designed it right. But sometimes, the connection for the client – which was working perfectly fine – suddenly gives an 809 error (cannot contact VPN Server).

No changes have occurred on the client, no changes on the server, no changes on the firewall, Richard Hicks articles consulted and confirmed all is as it should be. But, removing the client’s session from the firewall for the user’s public IP remediates the issue and the client is able to connect again.  Not ideal, but a workaround at least.

AOVPN Mobility Behaviour

This mobility functionality allows the VPN server to keep the connection “up” for a period of time after the client has been rudely disconnected. The Keepalives should timeout after a period, but it appears from observation in the wild that it does not, and this has been observed in multiple environments. This by itself does not cause issues for the VPN service itself, as when the client comes back online it either reconnects the session or creates a new one. However, this can cause unexpected results when dealing with other devices on the network, like a perimeter firewall.

Often, these devices are configured to disconnect “dead” links after a period of inactivity, such as the Paulo Alto firewall which in this case was set to 30 seconds.  Because of these unwanted keepalives, the firewall device in this case never closed the connection because of Keepalive traffic across the wire and when the client tries to reconnect it refuses the connection and you get your 809 error.  The only workaround to this particular case was to clear the firewall session and the client was able to connect again without any configuration or remediation on the server or client.

A Potential Solution

Looking further into this issue for a customer to avoid having to employ the workaround, there is little or no documentation regarding the ability to restrict the IKEv2 Keepalives frequency aside from the Mobility functionality by reducing the Idle or Network Outage times. 

Working on a test system we were able to replicate these unwanted disconnects by initiating Airplane mode on the test device and we saw Keepalives being generated past the Network Outage thresholds. Reducing these from the default 30 minutes to a more aggressive threshold of 5 might not be right for every environment but from initial testing normal VPN tunnel connectivity was not affected. However, the Keepalive issue had not been resolved and so the firewall issue persisted.

Looking at an older Microsoft Reference article for L2TP and PPTP protocol settings here, it outlines settings to add to limit the frequency and timing of the Keepalives. As IKEv2 also uses HelloMS and ACK for its Keepalive mechanism as laid out in here, the following theoretically should work for the IKEv2 service.

This is experimental and is currently undergoing testing so please take these suggestions as unsupported/undocumented.  PowerON takes no responsibility for issues or outages that may occur following implementation of the below.

Locate the appropriate registry key hive for IKEv2 and make the following changes to add the DWORDs as required:

After implementing the changes and rebooting the server, the Airplane mode test was repeated and the number of keepalives, after the VPN connection in the RRAS console had been dropped badly, a further ten Keepalives packets were observed from the server as desired.

So we now potentially have a mechanism to regulate this known behavior from the AOVPN server when using the IKEv2 protocol for VPN tunnels.

Results of this change will be provided to Microsoft who are alleged to be looking into potential future patches for this and other issues for the AOVPN service.

Interested in AOVPN?

If you would like to try AOVPN for your organisation as a proof of concept or emergency stop-gap solution for your business, we’ve created a guide which details how to do this.  

This free resource is not designed to be an enterprise grade solution, but if you do want to evaluate this technology, it will guide you through the first steps. 

If you have any questions around the documentation or AOVPN solutions in general, feel free to email info@poweronplatforms.com and we’ll be happy to help.

Related resources

What Makes a Secure PKI Solution? CRLs

The Certificate Revocation List (CRL) is a list of Certificates that have not expired but have been revoked where clients and services can verify presented or held certificates.

What Makes a Secure PKI Solution? Backups

Just as with any other system within your organisation, PKI needs to be backed up regularly to ensure it can be restored in event of a disaster. So the Operating System, Disks and Virtual Machine are a given, but with PKI you also need to…

What Makes a Secure PKI Solution? Physical Access

Apart from obvious server security aspects such as patching, with PKI solutions there are also considerations relating to cryptography and even physical access. These are rarely documented as a consolidated list, so we decided to try and lay out what we look for and why in a PKI solution.