Thursday, December 15, 2011

The art of GRE troubleshooting with keepalives

Today I was presented with a couple interesting problems surrounding GRE tunnels.  Because I am no expert with GRE, it took a while in order for me to figure out what was wrong and how to fix it.

I've heard a lot of engineers say that keepalives will break a tunnel and that you shouldn't use them.  This is a half truth. Since GRE Tunnels are supposed to be stateless, they should never go down and from a routers perspective they are up unless you explicitly shut the interface for the tunnel.  But the tunnel can be down without the router knowing about it.  Keepalives are the router testing the tunnel to make sure that the tunnel is functioning both ways.

Here's how this works.  Router A takes a data packet and wraps it in a GRE header for Router B to send back.  Then it wraps that packet in another GRE header destined for Router B to look something like this:
Router A now sends this to Router B:
When Router B receives this packet, it removes the outer encapsulation and is left with a packet destined to Router A.  So it returns the packet to Router A:

When Router A receives the packet and removes the GRE header, it's left with the original keepalive packet and knows the tunnel is functional.  If the packet does not return, it knows the tunnel is down and will eventually take the state of the tunnel to DOWN after enough keepalives fail.

On a Cisco router, the keepalive command is pretty simple to implement.


RouterA# conf t
RouterA(config) interface tunnel 0
RouterA(config) keepalive


To see the status of the keepalives, you can do a "debug tunnel keepalive" and see the following output:


Dec 15 23:16:29.413: Tunnel19: sending keepalive, 10.0.0.2-> 10.0.0.3 (len=24 ttl=255), counter=1
Dec 15 23:16:39.413: Tunnel19: sending keepalive, 10.0.0.2-> 10.0.0.3 (len=24 ttl=255), counter=2
Dec 15 23:16:49.413: Tunnel19: sending keepalive, 10.0.0.2->10.0.0.3- (len=24 ttl=255), counter=3


As you can see, we're not seeing the return traffic.  In this case, the issue was a NAT problem where the GRE packets were being NAT'd to the outside instead of being put into the IPSec tunnel.  Running a packet capture on the firewall indicated that the packets were hitting the PAT Overload rule and not going into the VPN crypto map.  With this fixed we see the normal behavior.



Now we are seeing happy keepalives and the tunnel is back up.

Dec 15 23:21:52.425: Tunnel19: sending keepalive, 10.19.2.1->10.4.11.1 (len=24 ttl=255), counter=1
Dec 15 23:21:52.497: Tunnel19: keepalive received, 10.19.2.1->10.4.11.1 (len=24 ttl=252), resetting counter
Dec 15 23:22:02.425: Tunnel19: sending keepalive, 10.19.2.1->10.4.11.1 (len=24 ttl=255), counter=1
Dec 15 23:22:02.497: Tunnel19: keepalive received, 10.19.2.1->10.4.11.1 (len=24 ttl=252), resetting counter
Dec 15 23:22:12.425: Tunnel19: sending keepalive, 10.19.2.1->10.4.11.1 (len=24 ttl=255), counter=1
Dec 15 23:22:12.497: Tunnel19: keepalive received, 10.19.2.1->10.4.11.1 (len=24 ttl=252), resetting counteru all