Thursday, December 15, 2011

The art of GRE troubleshooting with keepalives

Today I was presented with a couple interesting problems surrounding GRE tunnels.  Because I am no expert with GRE, it took a while in order for me to figure out what was wrong and how to fix it.

I've heard a lot of engineers say that keepalives will break a tunnel and that you shouldn't use them.  This is a half truth. Since GRE Tunnels are supposed to be stateless, they should never go down and from a routers perspective they are up unless you explicitly shut the interface for the tunnel.  But the tunnel can be down without the router knowing about it.  Keepalives are the router testing the tunnel to make sure that the tunnel is functioning both ways.

Here's how this works.  Router A takes a data packet and wraps it in a GRE header for Router B to send back.  Then it wraps that packet in another GRE header destined for Router B to look something like this:
Router A now sends this to Router B:
When Router B receives this packet, it removes the outer encapsulation and is left with a packet destined to Router A.  So it returns the packet to Router A:

When Router A receives the packet and removes the GRE header, it's left with the original keepalive packet and knows the tunnel is functional.  If the packet does not return, it knows the tunnel is down and will eventually take the state of the tunnel to DOWN after enough keepalives fail.

On a Cisco router, the keepalive command is pretty simple to implement.

RouterA# conf t
RouterA(config) interface tunnel 0
RouterA(config) keepalive

To see the status of the keepalives, you can do a "debug tunnel keepalive" and see the following output:

Dec 15 23:16:29.413: Tunnel19: sending keepalive,> (len=24 ttl=255), counter=1
Dec 15 23:16:39.413: Tunnel19: sending keepalive,> (len=24 ttl=255), counter=2
Dec 15 23:16:49.413: Tunnel19: sending keepalive,> (len=24 ttl=255), counter=3

As you can see, we're not seeing the return traffic.  In this case, the issue was a NAT problem where the GRE packets were being NAT'd to the outside instead of being put into the IPSec tunnel.  Running a packet capture on the firewall indicated that the packets were hitting the PAT Overload rule and not going into the VPN crypto map.  With this fixed we see the normal behavior.

Now we are seeing happy keepalives and the tunnel is back up.

Dec 15 23:21:52.425: Tunnel19: sending keepalive,> (len=24 ttl=255), counter=1
Dec 15 23:21:52.497: Tunnel19: keepalive received,> (len=24 ttl=252), resetting counter
Dec 15 23:22:02.425: Tunnel19: sending keepalive,> (len=24 ttl=255), counter=1
Dec 15 23:22:02.497: Tunnel19: keepalive received,> (len=24 ttl=252), resetting counter
Dec 15 23:22:12.425: Tunnel19: sending keepalive,> (len=24 ttl=255), counter=1
Dec 15 23:22:12.497: Tunnel19: keepalive received,> (len=24 ttl=252), resetting counteru all

Tuesday, December 6, 2011

First Day: Why I'm blogging

Today was a graduation day of sorts for me.  Sure I've had actual graduations before, but this one is a little special. I was just recently promoted from an "Internal Systems Engineer" to a "Systems Engineer."  Big deal right?  For me it actually is.  I've escaped the world of Operations and am now out designing and implementing networks for clients.  I have a new boss, different responsibilities and (hopefully) better pay.

But today the graduation wasn't about the new job.  It's about graduating as a consumer of information, to a producer of information.  I saw a tweet by local company @metageek talking about being a delegate for Tech Field Day.  I'm really interested in the Tech Field Day stuff.  Not only do I think it's a great idea to get people excited about the new technology, but also that it's time for me to start giving back.

Now I'm not a high level networking guy (yet), but I feel the things I learn in the day to day role as a VAR would be useful to those in the field.  So let the fun begin.