Bytes and Books: EVPN VXLAN, with author Aninda Chatterjee Artwork

The Art of Network Engineering

The Art of Network Engineering blends technical insight with real-world stories from engineers, innovators, and IT pros. From data centers on cruise ships to rockets in space, we explore the people, tools, and trends shaping the future of networking, while keeping it authentic, practical, and human.

We tell the human stories behind network engineering so every engineer feels seen, supported, and inspired to grow in a rapidly changing industry.

For more information, check out https://linktr.ee/artofneteng

All Episodes

The Art of Network Engineering

Bytes and Books: EVPN VXLAN, with author Aninda Chatterjee

January 15, 2025 • A.J. and Andy

Send us a text

This episode discusses the complexities of VXLAN and EVPN in modern networking, highlighting the transition from traditional layer 2 designs to more efficient layer 3 systems. With insights from expert Aninda Chaterjee, listeners learn the importance of understanding underlying technologies for effective network management.

• Personal experiences with home lab setups and troubleshooting
• Explanation of VXLAN as a data plane encapsulation method
• Challenges faced in traditional data center architectures
• Transitioning from layer 2 to layer 3 using leaf and spine architectures
• How EVPN enhances VXLAN functionality by managing MAC addresses
• Importance of understanding technology for effective troubleshooting
• Aninda’s insights on becoming a technical author in the networking field

This episode has been sponsored by Meter.

Go to meter.com/aone to book a demo now!

You can support the show at the link below.

Support the show

Find everything AONE right here: https://linktr.ee/artofneteng

Speaker 1: 0:00

This is the Art of Network Engineering podcast. In this podcast, we'll explore tools, technologies and talented people. We aim to bring you information that will expand your skill sets and toolbox and share the stories of fellow network engineers. Welcome to the Art of Network Engineering. I am AJ Murray and for this episode I am joined by Andy Laptev. Andy, how you doing man?

Speaker 2: 0:33

AJ, I'm good, I was just telling you guys right before the show started. I've been banging away in my home lab for two days trying to tear out ESXi and replace it with Proxmox, for reasons and I have advice for anyone listening If you didn't know this do one thing at a time. The reason it's taken me two days to get Proxmox working is because I also decided to segment my home network at the same time. So I'm creating all these zones and VLANs and I have like a Ubiquiti UDM Pro and I thought I knew what I was doing. But I could not. The connectivity wouldn't work. I'm on my PC in one VLAN, I'm trying to get over to the VM in another. I thought I had inter-VLAN communication. I'm like just give me a CLI like this ClickOps stuff. So I finally just tore everything out, put it all in one blast radius. Everything's in a default VLAN. Yay, it works. So do one thing at a time is my advice.

Speaker 1: 1:29

What's the expression Like? If you do multiple things, you can't half-ass a bunch of things, you can only half-ass one or whatever.

Speaker 2: 1:35

Yeah and I couldn't figure out what was wrong. I tried to install Proxmox probably 15 times in the past two days. I just kept reinstalling it because I'm like, well, maybe the sd card's bad or maybe maybe it had to be in the array, or maybe raid six, I mean, whatever right, I'm, I'm making stuff up and I had no idea how to troubleshoot it. And then it occurred to me I'm like, huh, I wonder if it's routing, and yeah, so anyway, well, that's a good like troubleshooting step right.

Speaker 1: 2:01

Like you know, make make only one change at a time and then test it and see what's happening.

Speaker 2: 2:05

Yeah, and the public, the community, had plenty of feedback. People were trying to give me Linux commands and TCB dump and check this and look at that. I mean, ultimately I broke something and then got mad at the wrong thing. I'm like why?

Speaker 1: 2:18

won't.

Speaker 2: 2:18

Proxmox work. I'm like, oh, I apparently still don't understand routing. That's how my two days, but it's, it's good right now. Tomorrow I'll be able to do what I need to do and start labbing. So I'm always learning right. Like I didn't realize, I I configured a raid five array years ago on my server, and then some, and then I had a disk fail. So I'm like, hey, buddy, you get another disk fail, you're screwed. I'm like, oh, like raid six is better, two disks can fail. I'm like, okay. Oh, like RAID 6 is better, two disks can fail. I'm like, okay. So like I did that too. I'm like learning stuff as I go. How are you doing, aj?

Speaker 1: 2:48

Doing good man Doing good. I've been getting back to my roots. I've been playing on the CLI at work a little bit, which has been fun.

Speaker 1: 2:55

I haven't done that for a while. So yeah, it's been good, something different. I am excited to welcome our guest, aninda, this evening because we are going to talk about EVPN VXLAN. I love VXLAN. I'm not an expert in it. I've cut my fingers, bloodied my knuckles and had a lot of fun playing with it in various situations. But, aninda, thank you so much for joining us. Thanks for having me. Guys Happy to be here. So we know who you are, but what do you do and what's your relationship status?

Speaker 3: 3:32

with VXLAN. So I am always I'm a TME slash solutions engineer at Nokia as of now, so previously Cisco and Juniper and I work in the DC space, and I have been working in the DC space for several years now. I do all sorts of things under this umbrella, which includes a lot of hands-on, so it's a lot of lab work, obviously in the EVP and VXN space, and then previously I was doing a lot of SD access for Cisco when I was a TME at Cisco. Last few years have mostly been DC and has mostly been EVP and VXN across enterprise telco clouds, web scalers, hyperscalers, you name it. So it's been an interesting few years in this space and, of course, the fun AI stuff that's been going on.

Speaker 1: 4:30

Right, yeah, absolutely. So you've been working with VXLAN for a while, so it's kind of a mouthful there, evpn. There's actually a couple of different things going on, so let's start first with VXLAN. What is it?

Speaker 3: 4:53

So I mean VXLAN is just, it's a data plane encapsulation method. So you're essentially adding a bunch of layers to elevate services from a different layer to a layer. They call the overlay, and think of it like MPLS or GRE as an example. So MPLS would add a bunch of labels and then the labels would essentially take you from one PE to the other. So VXLAN does something similar. It's the actual data plane headers that you add when you want to take the packet from one PE to another, and in VXN terminology they call them a VXN tunnel endpoint or a VTAP. So the VTAP essentially is a node that supports this functionality or it has the ability to NCAP and DCAP VXN packets. So you're just adding a bunch of headers and the headers are it's an Ethernet header, an IP header and a UDP and a VXN header. So that's 50 bytes overhead that you add on top of the original payload or packet.

Speaker 2: 5:59

Did you say four different headers? Yeah, I thought it was one.

Speaker 3: 6:05

Yeah, should have been 10. Might have been easier with maybe 10.

Speaker 2: 6:11

Why does this exist? Like what problem does VXLAN solve? Why are we adding four headers to every frame in our data center?

Speaker 3: 6:21

So you kind of sort of have to go back to how, in general, dcs have evolved over time and the problems that we're trying to solve for today. That existed with just generally how we design things in the past. Past. What you would have is we would have like a three tiered architecture as an example, which would be an access layer, distribution layer or a core and a core, and then occasionally you would collapse the core and the distribution and you would just have like a collapsed core design. Right and predominantly you would have layer two links between the access and the distribution and good old spanning tree to make sure there's no broadcast storms you don't have layer two loops because that's bad and then the distribution and above you would typically have layer three routing. So it would be just get me out of the network as quickly as I can towards the core and the core takes me out towards the WAN.

Speaker 3: 7:27

Obviously, span actually had a lot of inefficiencies. You couldn't predict failures and when something fails it is very hard to predict how you would converge. You had convergence issues and of course you were sort of always susceptible to a layer to loop, even if you've designed the network really well. You have all of your guards in place. You know there were some pretty horrible things that could go wrong. That would melt your infrastructure and across your domain, right.

Speaker 2: 8:04

Yeah, go ahead. Quick question, sorry so, did vMotion exist yet at this time and at this point in time that we're talking about?

Speaker 3: 8:11

Yeah, so you obviously had these Layer 2 requirements at that point in time also, which is why Layer 2 was such a big driving factor behind some of these designs, right, but the other factor also was the network was always looked at from the perspective of these designs, right, but the other factor also was the network was always looked at from the perspective of scaling up, which was the other problem. So you would have like a monolithic architecture from a server point of view, and what that means is, you know, let's say, I have like a web application and it has to talk to a database, and maybe it has like a browser engine which caters to all of your inbound HTTP requests. Right, now, all of these different constructs were typically coupled into a single server and you would have that single server doing everything, right, right? So any conversation that had to happen between these applications to serve the umbrella application was all within one single physical hardware that would connect to your access or your or your leaf, right?

Speaker 2: 9:19

I call these the good old days good old days Good old, simpler days. Yes.

Speaker 3: 9:27

But the problem that that presented was now, let's say, you want to scale up and you want to handle more requests, right? And again this is scaling up. So you start scaling up that particular server. This could be in terms of whatever CPU memory, the NIC speeds themselves. But that starts to have direct impact to the network infrastructure it's connected to, because now you have to scale up your immediately connected network infra as well, so you have to scale up the speeds on your network infrastructure side, and so on.

Speaker 3: 10:07

So again, scaling up was not a very good design. So eventually the industry started moving away from layer two to layer three. So it's a fully routed architecture between your leaf so your access and your distribution and your core, and you get a lot of benefits from this. Now you can start to do ECMP across. I could have a routing protocol that sort of distributes all of my server infrastructure information and then I could start to do ECMP from access up and down to the other access layer where my other servers are connected. Ecmp is obviously more predictable, routing is more predictable, convergence is more predictable, latency is more predictable. So you get a host of benefits now through this design, right.

Speaker 3: 11:07

And then of course, there was this shift from this three-tier core distribution access towards a Clow sort of architecture.

Speaker 3: 11:17

So again sorry, like going really back, Charles Clow was kind of this guy who formalized a way of doing fabrics for telephony systems where he would have this so for like telephony systems you would use crossbars, right, and crossbars would have X number of inputs and Y number of outputs and as you start connecting them you could get like an input line going to an output line and somebody talking to somebody else, right.

Speaker 3: 11:48

So you take the same sort of logic. And then what he did was he said the way to sort of reduce complexity but continue to scale up these telephony systems was to start building my fabric in stages. So I would have an ingress stage where the call comes in, and then I would have a middle stage which acts like a fan out for the egress stage where the call exits and connects to somebody on the other side, right. So they took the same logic and we eventually applied it for data center fabrics where you know you have an ingress stage which we call as the leaf, you have the middle stage which we started calling the spines, and then we had the egress stage where traffic exits out, which we do call the leaves again. So it's essentially collapsed into this leaf spine architecture, which is what you'll hear that terminology thrown around quite a bit today.

Speaker 2: 12:51

Thank you for pronouncing his name correctly.

Speaker 3: 12:54

Yeah, he's French, so Charles Clow.

Speaker 2: 12:58

Almost everyone says Closs, and I don't know why it drives me crazy.

Speaker 1: 13:04

It's one of those things where it's as a native English speaker, or an. American probably is more right, like most Americans are going to look at that and say cloths because they see CLOS. But unless you have any sort of linguistic knowledge, you know, being where I am up near Canada, we, like a lot of our our local languages is driven from from French Right. So when, I look at it, I was like, oh okay, all right, I got it.

Speaker 3: 13:29

I mean I, I didn't. I didn't know the current pronunciation either. So credit to Russ White, who is probably the guru in terms of everything DC.

Speaker 2: 13:40

So I've just I've learned a ton from him, including pronunciation of names. I never knew when I I got to love Russ. Thank you for all your knowledge, Russ.

Speaker 3: 13:52

When.

Speaker 2: 13:52

I was working on a Verizon central office. I worked on a telco frame and 20 years later I learned that this was a Charlie Clow design that I was running wires on.

Speaker 1: 14:06

So let's kind of like try to summarize what we've just talked about here. So we've got VXLAN, we have our spines, we have our leaves and we have our border leaves where traffic will ingress, egress out of the fabric. We have a lot of flexibility in that we can scale and we can scale in a couple of different ways. If we need additional bandwidth between our leaves, we can scale and we can scale in a couple of different ways. If we need additional bandwidth between our leaves, we can scale and add spines.

Speaker 3: 14:31

Correct, we're just scaling out, right, scaling out, and if we need to add additional ports, then we can just add additional leaves, correct exactly so you move away from scaling up where I gotta, you know, swap out my nicks to a higher speed nick, or I got to swap out the leaf to something that supports more line cards or more speeds, to just scaling horizontally instead.

Speaker 1: 14:55

Right. And then all of these interconnects are layer three, so you have the benefits of no spanning tree. We're using all of our links and we're using ECMP rather than, you know, lacp or something similar. Right, you know? I think a lot of people just assume that you know you're going to get equal load balancing out of LACP and you don't. There's a lot of hashing and guessing going on, whereas with ECMP it is what it is it's equal cost multipathing. So if you have two spines, you're sending the traffic equally to both of those spines, if you have three, so on and so forth.

Speaker 3: 15:31

Again, I would say that it depends. It's not as equal as you think, even though it is ECMP, because most of the hashing that you do today is per flow and not per packet, because you don't want. If you start doing per flow, then you kind of have to deal with reordering at the server side and with TCP. Obviously that's going to cause a host of problems and again, with UDP it's problematic. So you do per flow load balancing and it's typically like a five double hash. So you're taking your source destination MAC, your source destination IP and, let's say, the port number, the UDP part or TCP part, as an example, right? So as long as you have enough variability in your flow, you should get a good sort of load balancing mechanism where it's more or less well done and equal out of all your ECMP bots. Yeah.

Speaker 1: 16:34

Gotcha Okay, all right. So if everything is layer three, what are we doing layer two-wise? Are we dedicating entire leaves to particular layer two domains Like how are we using our network now?

Speaker 3: 16:49

Yeah, so that's sort of now the fundamental problem that you run into is hey, I moved my infrastructure to be a smart layer three infrastructure, but I still want to live in the olden days and I want to extend layer two. And then of course there are some fundamental requirements of vMotion and things like that, which does require a layer 2 adjacency. So how do you solve for that? Because I obviously stop my broadcast domain as soon as I hit the leaf, because everything above is now layer 3.

Speaker 3: 17:26

So that's where VXLAN as an encapsulation sort of also helps. What you're essentially doing is you're transporting layer two within this layer three infrastructure. So, through various constructs and various means of disseminating sort of layer two information across the fabric, which is where BGP VPN comes in I can start learning layer two MAC addresses against a particular VXLAN tunnel endpoint which is a VTAP, and I can say hey, if you want to go to this MAC address, it sits behind this VTAP, right against this bridge domain or broadcast domain, which translates to a VXLAN identifier, a VNI, on the VXLAN side. So then I can start to encapsulate the packet and provide this Layer 2 extension even though you have a fully routed Layer 3 infrastructure underneath it.

Speaker 1: 19:28

And, as you said, that's where the EVP and BGP comes in, because now you're sharing the MAC addresses over Layer 3 and you're essentially making routing decisions based off MAC location, correct?

Speaker 3: 19:41

So I know, andy, you have a question, so I'll let you ask and then maybe I'll expand on this a little bit more.

Speaker 2: 19:46

I have all the questions. I'm sorry. Yeah, so when I learned that EVPN is BGP exchanging MAC addresses instead of routes, my brain exploded. I'm like wait, I didn't know BGP could advertise MAC addresses. I thought it was for routes. So try to remember where you are, because I have a couple of questions.

Speaker 3: 20:09

Can I break your brain?

Speaker 2: 20:09

a little bit more. Well, yeah, yeah.

Speaker 3: 20:15

You know this isn't new right. So in fact, did you know that IS-IS does the same thing, like what Cisco implemented, like their version of Trill, which was FabricPath back in the day, they had IS-IS exchanging Macs too.

Speaker 2: 20:32

I did not know that, and thanks for pointing that out to the audience. Let me show you something else. Andy doesn't know. No, that's fascinating.

Speaker 1: 20:40

I did not know that myself. That's good yeah.

Speaker 2: 20:42

So I'm following the historical evolution and the two reasons I thought we went from three-tier to CLO or Leaf Spine. At least what I learned in the books was you know, 80% used to be north-south traffic with the old and then, once we went to microservices and everything everywhere in the DC, it switched to 80%, went east-west and we needed more speed and less latency and everything's one or two hops away. I forget what it is and what is it? Two hops, everything's two hops, yeah, no matter where it has to go. So that traffic shifted right. And was what behind that? Was it the microservices? Yeah, it was virtualized.

Speaker 3: 21:21

Yeah, so like when the network evolved, the network evolution was pushed through, I think, like server evolution, Like there was so much of virtualization that was happening. And now of course, it's all microservices, it's all containerized and all of your like constructs of an application that would sit on one physical server. They would not exist either as VMs or containers and they would sit behind different attachment points, like different leafs. And now when I, when like a single HTTP request comes in to service that I actually have to talk more east-west between all of these application constructs that sit across the fabric. So now I'm just within my fabric, going east-west before I actually go north-south. So you're right, A lot of it was driven by.

Speaker 3: 22:09

But that's what drove it A lot of it was driven by the advent of virtualization and microservices and things like that.

Speaker 2: 22:16

Yep the other thing I wanted to ask. I thought that um, evpn, vx land solved the problem with v motion because you need all the v lands available everywhere. So I guess the question I wanted to ask earlier was when it was three tier, layer two at the access, how did v motion? Like how could you move a workload if a VLAN only existed in one place? Like you couldn't move it somewhere else.

Speaker 3: 22:39

Does that make sense? Even in three tier the VLANs would exist in multiple places. But what would also be different is you would have like a centralized layers three gateway for all of them, like on the distribution. You would typically have all of them. Like on the distribution, you would typically have all of your SVIs in Cisco terminology or IRBs and Juniper, nokia terminology that would exist there and everything below the distribution was layer two. So for all your services, the distribution was your first L3 hop, which was the gateway, and you would typically have some form of FHRP, some redundancy protocol running which is HSRP, brrp, whatever, right? So even if you're remulsioning across, you're doing it across your layer two adjacent access layer, but your layer three never changes, right?

Speaker 3: 23:36

The problem now in a layer three infrastructure is well, I don't have layer two adjacency anymore unless I provide it in some form through this tunneling mechanism. And then the other problem is, once I start positioning my layer three gateways, where do I actually position them? Because every leaf is now L3 up towards the spines, right? So that's where we also brought in this concept of like an any cost distributed gateway, right. So I would have the IRBs with the same IP, same MAC that now exist across all your leafs. So I don't really care if I move from leaf one to leaf 10, because when I move I retain my default gateway, like I'm not changing that, which is a core construct that is necessary. And as long as that default gateway exists on the leaf that I have now moved under, from my point of view there's really no change, right? The resolution of the IP, of the gateway to the Mac, is going to be exactly the same.

Speaker 2: 24:44

Okay, last question on my torrent of questions, and I hand it back to AJ. So you said something about, so now we've gone. The end of the conversation earlier was so we've completely gone to layer three. Infrastructure, right, everything's layer three, but we have to maintain layer two and I guess that's what the XLAN is supporting. Is that maintenance of layer two down at, I guess, the servers? So that we can transport it across. Why I guess there's no other solution, like we can't go layer three from the leaf down the server and get rid of the layer two.

Speaker 2: 25:16

No, you absolutely can.

Speaker 3: 25:17

Yeah, you can.

Speaker 3: 25:18

So a lot of people actually build, especially when you look at, let's say, a lot of cloud services such as, like an Anycast DNS as an example A lot of people build that infrastructure with L3 all the way down to the servers.

Speaker 3: 25:36

Right, Because what these servers do and you have multiple nodes that are advertising the same anycast address, as an example right, and by doing L3 all the way down, you're able to achieve ECMP to all of these guys. Right, so it doesn't all these leafs will advertise the same Anycast address, so I don't care which leaf I eventually come in on, that leaf will eventually then ECMP down to any of the nodes that support this Anycast DNS, these Anycast DNS services. Right, and that's a network, like a non-hardware way of doing load balancing for Anycast DNS services, Because otherwise what you would typically do is you would have to purchase like a very expensive hardware load balancer and put it in the middle and then put all of your DNS servers behind it and then the load balancer does all of that, right, so this is a different way, a protocol-only way of actually doing any cast DNS services as well. So you do, and especially for, like cloud services, Kubernetes, clusters, CNIs like Calico as an example. A lot of these are now delivered through L3 to the leaves.

Speaker 2: 26:57

I guess it's dependent on use case and the problem you're solving. If you were building a Greenfield, data center. You wouldn't just go, you wouldn't completely get rid of L2, because that's the best practice. Now it just it depends on what you're doing, I guess.

Speaker 3: 27:09

It depends on what your application needs Like. Sometimes these applications they need like layer two keep-alives between them as an organization. If you bought into that application, what do you do? You have to facilitate a way of transporting that, even if you want to go towards more modern ways of building your DCs. So, once again, it's the application's fault. More modern ways of building your DCs, right? So yeah?

Speaker 2: 27:36

So once again, it's the application's fault.

Speaker 3: 27:39

It is. I mean, that's the funny thing. Right, we are building all of this network infrastructure to solve and service your business use cases and what outcome you want, so it is driven by the application. 100%. Yeah, it's a lot. I can see the brain spinning a little bit, andy.

Speaker 2: 28:06

Why is it so hard? Like am I not you and I, so you know? Full transparency to the audience. I mean, you spent three days of your life with me teaching me VPN, vxlan, and we labbed it and I videoed it. And I keep telling myself I'm going to go back to those videos, build it at home and really, until I build it, I don't really think this stuff's going to be in my bones, but it just. It seems so easy to you and for the rest of us. Why is it so complicated?

Speaker 3: 28:31

is it not? Really? I don't think it's easy. I think it's just it's consistency, it's, um, having a day job that requires me to do this every day, um, like, it's the same thing with automation, right? Um, you've tried, you've started, stopped, I think. Several times before I have done the same thing and I suck at it, like, really, really like, I'm terrible at it, right, but I think, if that would become part of my day job and you have to do it every day, I think that that changes the dynamics a little bit, right? You're forced to learn. You're forced to encounter customers who are doing this. You're forced to, like, look at designs. You're forced to encounter customers who are doing this. You're forced to look at designs and make sense of it, and it just starts to sit in your brain a little bit.

Speaker 2: 29:22

So when you build an EVPN VXLAN fabric, let's say in your lab, are you doing it artisanally by hand? You're not automating it.

Speaker 3: 29:29

No, so it depends, right. So I made sure. So I did a lot of DC when I was at. So I made sure. So I've done a lot of.

Speaker 3: 29:35

I did a lot of DC when I was at Cisco because I was supporting I was like an escalation engineer from the engineering side of things, but I was heavily involved on the platform side, right, so we wouldn't get involved in like initial troubleshooting. A lot of stuff that would come to us would be you know, the ASIC is dropping the packet here and why is that happening? On the Nexus side of things, since it was, nexus was obviously heavily deployed on the DC side. So I would cater to a lot of DC designs and DC infrastructure sided VXLAN there, but not as much as I've done it in the last three years when I was at Juniper, which is where I really got involved with customers deployments and I was in the thick of things.

Speaker 3: 30:18

So I made sure that I was labbing, like I was building a simple topology, let's say, two spines, four to five leafs and some services, l2 and L3, right, I was doing it by hand, at least like three or four times a week like break it, build it all over again, and I did it by hand for the entirety of my time at Juniper and then again at Nokia, because the CLI is obviously different. Unless I wanted to spin up something very, very quickly, then I could either use Juniper App Store or I'm using, like Nokia, eda as an example. Now, right, but I made sure to do it by hand, because that forces you to make a lot of mistakes, forces you to troubleshoot when you haven't configured things properly, and it forces you to learn, obviously.

Speaker 2: 31:09

And it's a lot of config. That's why I asked right, Like it's dense, it's yeah.

Speaker 1: 31:13

It is, it is.

Speaker 2: 31:15

This isn't a 15 line stanza, like you don't? You don't enable the vxlan and walk away no, and that's exactly.

Speaker 3: 31:21

There's so many moving parts and it's so difficult to understand. Um, you know why you need certain things, and only when you start to configure it, and then you don't configure something you need you, it hits you in the head really hard and you're like, oh shit, I really needed this for. Uh, so and so reason right, um, and then you do it again 200 times and then maybe it sits in your head for for some time.

Speaker 2: 31:42

So yeah, hey, jan, like you've built fabrics right, or or like, I mean, are people building this by hand? Is this something that most folks rely on, like like an ACI or a fabric automation tool?

Speaker 1: 31:57

You know, I don't know how many people are building it by hand manually for a production environment. I think a lot of the VXLAN implementation that I've personally seen has usually been tied with some sort of product package, right? So Cisco's ACI, and then there's a bunch of other out there that use, you know, vxlan under their hood.

Speaker 1: 32:18

When I initially started learning it, even though I was going to do it for ACI, I took the time to learn it manually so I could understand how it worked. I could troubleshoot it to a degree and have the appreciation of, you know, what is what is ACI doing for me as a network operator, of what is ACI doing for me as a network operator. But I haven't seen many manual implementations of it throughout the course of my career. Now, that being said, I haven't spent a ton of time with it. So just before I left my last job as a pro services engineer, I was actually working on a manual implementation of VXLAN, because it was just between two data centers and they were two very small data centers too, so but they needed to stretch layer two between the two, and this is really the best way to do it.

Speaker 2: 33:02

And is it a safe assumption that writing it by hand is definitely the way to hammer what's happening in your head, Like so you understand how this is all playing together? Yeah, absolutely.

Speaker 1: 33:11

You know cause, as Aninda said, you said you're going to make mistakes and you're going to have to troubleshoot it, and then, once you fix and understand how this stuff goes together, you're just going form of orchestrator or some form of automation framework Ansible, whatever right, and more than that.

Speaker 3: 33:40

You want, like the ROI, when you're doing it through these orchestrators is you want some guarantee of how the network's actually functioning. So you want to look at like day two stuff, like all of the op stuff right, is all my peerings up? Is the fabric deployed as intended, or has somebody gone in and changed some stuff and my fabric's no longer adhering to what it should be and things like that. So that's where a lot of the benefits come in. The automation problem has been solved for a long time now. But the day two stuff is that's a place where things are really happening and for DC, of course, for DC that's very, it's very important and critical that you know what's actually happening in your infrastructure.

Speaker 2: 34:30

And this is creeping into enterprise right. It's not just data center.

Speaker 3: 34:33

Yeah, I mean, and this is creeping into enterprise, right?

Speaker 2: 34:35

It's not just data center. Yeah, I mean yeah of course yeah.

Speaker 1: 34:37

Of course yeah, because I think one of the things that it can do for you is now you're not tying an IP address to a specific location, right? So now you don't have to worry about writing the same ACLs over and over and over again for all your different networks. You can now stretch that network and put it wherever you want, and now you have like a unified set of rules that you don't have to necessarily worry about.

Speaker 3: 34:59

Yeah, especially for enterprise use cases where there's so much of wireless and there's so much of roaming, like Cisco obviously pushed as the access, which was originally Lisp plus VXLAN, so Lisp was doing all of the control plane stuff and then VXLAN, lisp was doing all of the control plane stuff and then VXLAN was your data plane. But I think from an adoption perspective Lisp was much harder because nobody really used it as much. They also moved to BGP VPN and I think you could do both. I haven't done anything on the SDS space in like three years, four years or more, but I'm pretty sure you could do BGP EVPN with VXN today on their campus side, also through DNAC or Catalyst Center, whatever they call it today.

Speaker 2: 35:47

Is there anything to say about EVPN? Besides, it's the control plane that carries MAC addresses. Is there anything there? I mean yeah.

Speaker 3: 35:56

So again, like, a lot of this is through evolution. Like BGP, evpn was not how VXLAN was deployed originally, right? So you actually had VXLAN flood and learn, which is very similar to how, like, if you've done VPLS in the past, you're sort of flooding a data plane packet, like an ARP as an example, through your fabric and then that enables all of your other you know, terminating endpoints to learn the source packages, right, but there are a lot of consequences of doing that. Obviously, flooding is not that great and then the support flooding. You sort of need some form of infrastructure that facilitates that flooding, right, and typically this would be like a multicast underlay that you would build through PIM sparse mode as an example, right, and this multicast underlay that you would build through PIMS parse mode as an example, right, and this multicast underlay would flood all of your data plane packets and then you would have data plane driven learning, right. A lot of drawbacks, especially around layer 2, jsmc when things move, there's a lot of delay and convergence issues and things like that. So, again, industry evolved to figure out that you could actually use BGP through a new address family and a sub-address family, which was L2, vpn, evpn, in short, bgp EVPN. And now you could use BGP EVPN as a control plane to sort of disseminate all of your learning across these VXLAN tunnel endpoints or the leafs or the VTEPs, right? So as soon as I have a data plane, learn on my leaf, let's say a GARP comes in or an ARP comes in, or an ARP comes in, I'll process that in whatever way I have to process that ARP. But I'll also go and tell BGP EVPN that hey, I had a local learn and this is the MAC address on this port, on this bridge domain, which maps to this VXLAN identifier. So why don't you go and tell the rest of the fabric where this MAC address exists, right?

Speaker 3: 38:16

So BGP VPN takes that, it builds like a BGP update and sends it to everybody else. And now everybody else processes that update and it knows, okay, mac A resides behind leaf one right. And if I want to get to MAC A, I can end, cap the packet with all of these headers and send it to leaf one right. And then the only thing I'll add there is it's more than just Mac addresses, andy. So eventually they also realized that there are layer three, very clear layer three use cases also, like I want to transport just IP addresses as well. So you have a host of EVPN routes that do different things. Some of them do Mac only, the same thing does Mac plus IP, and some of them do like just an ip prefix as an example. So I'm just decoupling mac addresses from ip addresses are there route types or am I confusing protocols?

Speaker 3: 39:14

yeah, so they're. They're bgp, evp and route types um one through. I want to say 11 or 12. No, no, it's yeah.

Speaker 2: 39:24

So yeah, type five is stuck in my head for some reason.

Speaker 3: 39:29

Yeah, that's the IP prefix now. So that's the one that just does away from MAC addresses and you're only advertising, you know, like a V4, v6. That's correct. You passed this quiz. Good job, yay, do I get the job or not? You?

Speaker 2: 39:43

got the job man.

Speaker 3: 39:47

Yeah, that's a mouthful. So yeah, if you have any questions, let's. Let's talk about it.

Speaker 2: 39:53

Stunned silence well, I I have a completely unrelated question what the hell is a five tuple hash? I'm gonna get us off of this evp and vx for a second. You said five tuple hash and I'm like question mark exclamation. I don't even know what a tuple is. I've heard people talk about tuples and well, there's been a lot of information here. I didn't want to stop the train for that dumb question, but here we are. There was a break.

Speaker 3: 40:16

I mean you have to have like an algorithmic way of determining if you have equal cost bots to send traffic out. And this is either like layer three, ecmp, or let's say you have like a like a port channel right and and you have multiple links in the port channel. Now you can't, you can't send traffic out every link because you can't guarantee the order in which the packets go out. And I could, I could send, let's say, packet two before I can receive packet two before packet one, which causes a host of problems. So typically you would do per flow, right. And to do per flow you need to have some algorithmic way of determining which flow is put on which link.

Speaker 3: 41:01

If I have, let's say, two links to send traffic out, right, links to send traffic out right. So that algorithm essentially uses different variables which you, let's say, you could call them tuples here, and these variables include the source. Typically by default they would include your source MAC, your destination MAC from the internet header, your source IP, destination IP from the IP header if this is an IP packet, and, let's say, your destination port, as an example from a UDP header or a TCP header. So these five things form your five tuples that go into the algorithm and then the algorithm just spits out like a hash that says, hey, you take this link and you go out of this link, right?

Speaker 2: 41:50

So do I understand that each of the flows are pinned to a link? Does that sound accurate?

Speaker 3: 41:56

Yes, unless of course the number of available links change and you sort of have to recompute how many links are still available and then assign, like a bucket, to each of these links. Yeah, so it's typically.

Speaker 2: 42:11

And is it pinning flows to a link to avoid the out-of-order packets that you mentioned earlier? Is that why?

Speaker 3: 42:20

It's just a consequence of the algorithm, right? If there's no variability in the algorithm, the outcome will always be the same, which is a very interesting question when it comes to VXLAN, because? So here's a problem statement, right? So once I do the VXLAN end cap, I add, like an outer Ethernet header, I add an outer IP header, and then I add a UDP header and then the VXLAN header. Like these are the four headers that I'm adding right Now.

Speaker 3: 42:52

The goal of my outer Ethernet and IP headers is purely for transport, right? I want to take the original payload from one VXLAN tunnel endpoint to the other, right, and these endpoints are identified by, you know, their loopback addresses, as an example. So how do I do that? I add an IP header that says hey, the source IP will be VXLand tunnel, endpoint one, destination IP is going to be VXLand tunnel, endpoint two, and then my Ethernet header. Basically, hop, hop by hop, I take it from my leaf spine down to the other leaf, right?

Speaker 3: 43:32

But the problem that that now presents is, let's say, I have 10 different flows, um, but they have to go between the two vtabs, right? So the outer ip header and the outer ethernet headers that I now add will always be the same, even though my original flows or the original payload is actually changing and there is enough variability, right. So now, from a load balancing perspective, that's a big problem, like, how would I solve it? I would always start pinning it against a particular link, going out towards my spines, even though I have multiple ECMP links available, right. So do you want to like take a guess at how do we solve this, or do you want me to answer no, okay, Well, my brain is probably full, so I'm going to let you let's have you answer this.

Speaker 2: 44:31

I have one other question and then.

Speaker 1: 44:33

I'd like to maybe pivot to your authoring?

Speaker 2: 44:35

uh, okay, sure, honestly, my brain this is, but this is what it's, no reflection on you or the technology. This is what happens to me when I try to talk about as an example I get about 20 to 30 minutes. It happened when you were working with me, remember, trying to teach me. We get like about this time in and you would ask me a question and I would just stare. I think my brain is full. I think you've reached the point that my you've reached the perimeter of my intelligence. Yeah.

Speaker 3: 45:04

So just I put this problem statement because you asked a really important question, which is how are we deciding where to position this flow?

Speaker 3: 45:14

And that presents a problem when you're doing NCAP like this, because you abstract away the variability of the original flows and you're adding something on top which is almost always the same if the services sit behind the same two leaves. You need to have some mechanism that sort of adds that variability right. So that's why the UDP header exists before the VXLAN header. So obviously it identifies that what comes next is a VXLAN header. But what we do now is we use the original payload and we do like a five-tuple hash again on the original payload to generate the UDP destination port. So as long as the original payload varies, the UDP port starts to vary right. And then to load balance this VXLine encapsulated packet, I'll use my outer Ethernet header stuff, my outer IP header stuff and then the UDP destination port which is now changing. So I'm able to introduce this concept of variability which allows me to now load balance even encapsulated packets better across these equal cost parts.

Speaker 2: 46:31

AJ, I wanted to be a TME until I met Aninda and then I realized I might not be as smart as I thought I was.

Speaker 3: 46:39

Hey, don't say that no it's a compliment, man.

Speaker 2: 46:43

I'm just always blown away by your depth of knowledge. I'll say this like I don't.

Speaker 3: 46:48

I honestly don't think I'm smart. Like there are like people who get this like I would. I would take like six months to get this, and there are people who would get this in like a couple of weeks. What I feel I do a little differently is, um, like I lab up so much of this day in and day out, like that's that's the only thing that has really made a difference for me. Um, it's, it's just in my memory because I've done it so much, right? Uh, so it's not that I'm smart, it was just like a crazy amount of hard work and a crazy amount of like just in the lab, um, looking at broken shit. Uh, day in and day out yeah, you've put in the reps.

Speaker 2: 47:23

Um. The last question before we pivot away into your life as an author I wanted to ask was so these are standards correct? Evpn, vxlan.

Speaker 3: 47:33

Yeah, some of them are like RFCs, some of them are IETF jobs. But yes, they're sort of standardized Correct.

Speaker 2: 47:41

But so I've heard people say that different vendors implement them differently. So I guess is that accurate? And then, if it is, how do vendors implement them differently? So I guess, is that accurate? And then, if it is, how do vendors implement standards differently? Maybe they don't know what the standard means because no, there are.

Speaker 3: 47:57

So, yeah, there is. Sometimes the verbiage in the draft or the rfc is a little vague and it comes. It's up to the vendor and it depends on how the vendor chooses to implement this in software. So that's where some of the differences come in. And then just there, like, for example I don't want to get into this, but like data center, interconnects are obviously a big deal, because when you're building DCs you're probably building multiple DCs. Typically simplest form would be like a DRDC as well, like a primary DC and a DRDC, and you want to connect the two and have them talk to each other. So you need some form of interconnect.

Speaker 3: 48:42

Now, as an example, there is an RFC that exists that defines how to do a DCI, which is called an integrated interconnect. But then Cisco decided that they could do, they wanted to do things differently, perhaps, in their eyes, more efficiently. So they put in another draft for doing DCI, which now Cisco uses, but a lot of the other vendors in the industry use the RFC that already exists. So now you're in a position where if I have mixed vendors and I have Cisco on one side and, let's say, juniper and Aris on the other side and I want to do DCI. Well, they're doing two different forms of DCI. Now, to Cisco's credit, they also figured out that interoperability is a big deal. So they do have some knobs where you could sort of make it compatible with other vendors, but you see where the complexity starts to creep in, right. So yeah, it is a problem.

Speaker 2: 49:47

So what's it like writing a book? Why did you write a book? What, what, what is your book? I mean, you're a family guy. I've worked with you. You're a very hard worker. You put in a lot of time. You have a wife, you have a young daughter. I can't imagine how much time it took you so like, why would you, why did you write your book?

Speaker 3: 50:04

I guess the book stuff has been on my mind for, uh, for a long time, like when I was like I've been a tech blogger for um for quite a while now, um, and it was off and on. And then I really heavily got into it when I was a tme at cisco doing all of their sda stuff, and I realized like there's such a big gap, um, uh, in in just pointing and clicking on the UI and then understanding what's actually happening behind the hood. Because it's cool to do all the UI stuff and you can probably figure that out and have the fabric maybe in 30 minutes to an hour, but when things break, I don't think the UI is really going to solve it for you, maybe to a small extent, but you really need to understand how all of the stuff under the hood works, right? So I started taking all of that and writing very tech-heavy blogs.

Speaker 3: 51:03

The marketing part of a TME is non-existent, it's all tech. Like we weren't, we weren't really doing all the fluffy marketing stuff, right? And I haven't really done that ever. I don't particularly enjoy it either, right? So I was just doing all of the tech stuff and I realized that, okay, there's, there's an audience for it and I really like doing it anyway. So I actually tried to do the SDA book for some time, but that really didn't gather steam and I couldn't really see like a way in, because Cisco was just so humongous and so big. I was just a very tiny part of it.

Speaker 2: 51:42

But to approach your manager and say I'd like to write a book. I'm sorry to interrupt you, but how do you even? Who do you talk? To in Cisco to write a book. I'm sorry to interrupt you. Like how do?

Speaker 3: 51:49

you? How do you? Even who do you talk to? It's just gonna write a book. Yeah, we've. I've tried that a few times.

Speaker 3: 51:51

I just like I just couldn't see a way in, right, yeah, um, and then I had sort of given up, or I was actually thinking about like just self-publishing some sda stuff at that point, um, but I didn't. I didn't know if, like, I saw a lot of value in it at that point in time. Then I moved into Juniper Again a lot of heavy hitting DC stuff, and again I realized, man, this stuff is hard. So I went back to a lot of tech blogging again. And then Mike Puchong, who was my big boss at that point and who is my big boss again, and then Kathy Gadecki, who was under him. And then I had another manager at Juniper, rida Hameedi. Exceptional people, very supportive of things you want to do outside of just your day job. So I then pitched the idea of this Juniper book, focused just on Dana centers, because nothing existed from a Juniper perspective at that time. So I knew there was a gap that we could fill. So Kathy got me in touch with Russ and Russ was working at Juniper at that point. So I was very lucky. I had some overlap with him for about a year and then he introduced me to this gentleman at Pearson called Brett, and Russ works with Brett quite heavily. So Brett was really interested.

Speaker 3: 53:16

Originally this was actually supposed to be just video content and I wasn't too keen on doing like a tutorial or like on demand stuff. At that point I said I think my forte is writing. I think that's what I'm more passionate about. So I pitched an entire outline for the book and they were like really happy to take it up. Um, and yeah, there was no looking back after that. Um took me about a year and three months, uh, in total, from starting to write to publishing.

Speaker 3: 53:46

I was hard man, I was working nights, I was working every weekend and again, I don't think there's a way to balance it. There's sacrifice. There's a lot of family time that is sacrificed, for sure, and then of course, you're doing your day job, but then you sort of have to write the book as well. Um, I was okay with that because I felt a lot of my day job was contributing back to writing the book because it was so heavily just DC specific. So I actually told all of my bosses that don't take me out of my day job because I feel that makes the book more realistic, like I'm not writing in a bubble, I know what's actually happening outside and I want to make it relatable. So I continued a hundred percent on the day job and then 150% on the on the book and I lost a lot of like family stuff.

Speaker 2: 54:44

But um, yeah, you make up, I guess you had full support of your wife right, like a lot of like family stuff.

Speaker 3: 54:47

But um, yeah, you make up.

Speaker 2: 54:47

I guess you had full support of your wife right like of course there's a lot on the show how important it is right to have the support of your, your support system, to be able to do something like that 100 like it's.

Speaker 3: 54:58

It can be very frustrating because, like there's times where you're trying to build labs for the book and like it's just it doesn't work and stuff's breaking, which sometimes you think it's a bug, sometimes it's a mistake, and you spend days troubleshooting it's that, like a lot of that comes out, um, as frustration when you're like with with your family away from like the book stuff also.

Speaker 2: 55:19

So, yeah, my life is like a pillar of support, um, and I would either like this wouldn't exist without, without, for sure and I think I heard you correctly the reason that you wrote the book is you wanted people to understand what was happening under their hood, for when things broke right like here's how it's all working, so when it breaks, you'll have the tools you need to know where to look. Was that why you wanted to write this book?

Speaker 3: 55:42

that is my general um. Uh yeah, the general idea behind why I do a lot of the tech blogging is is that, yeah, I like even a lot of the vendor stuff, like even juniper, even nokia, right, um, it's not easy to figure out what's happening behind the hood, and sometimes the vendors don't make it easy, right, um, and I feel like it's up to us as the community to sort of help each other out, especially for people who are managing this day in, day out, who are in the upside of things, because I have been on that side right when hospitals go down, when people call in and say they're losing millions of dollars per second and shit hits the fan. Like you want the knowledge to be able to troubleshoot some of these things. So yeah, it's just, it's a helping hand, it's a way for the community to understand this is how stuff works behind the scene. If, if you were able to put in the time to lab it up and lead it up. So, yeah, I just I loved giving back.

Speaker 3: 56:49

And I have another book that I'm writing for Juniper, because I had, like I had committed to it. And then I have a third book that we just signed a contract for, which is interesting. It's a it's a multi vendor DC book that's going to cover Cisco, adista, juniper and and Nokia, so maybe like 2000,000 pages of a big. But yeah, I have a lot of fun with it.

Speaker 2: 57:11

And it's so now you're on fire. You've gotten started and you're on fire. I think that it's. I just want to put a. You know, I think it's important to have people on staff that have the deep knowledge for when things break and the reason I think it's important.

Speaker 2: 57:26

I don't know if you guys have ever had this perception from, like, vendor marketing and I'm not pointing a finger in any particular direction but there'll be vendors who say that their automation platform is so good that it you know things won't break, right. Oh well, if you, you know, if you use this, you know you won't break it. I feel like that's a big marketing angle. That and listen, it's, it's. It's usually better than doing it artisanally by hand in the CLI. So I get it. But things break. You hit bugs, things get wedged, weird stuff happens, right, and and if you were like you had said something like that earlier if you're relying on the UI and you know this tool to manage it, fine, you know you got to change connectivity or add or delete. But man, when stuff breaks, like you better have Aninda around, because I don't know if that automation platform is going to do that for you, right, like, even if there's a rollback. It just seems to me that you really need somebody around who knows what the hell's happening under the hood.

Speaker 1: 58:21

Yeah, that makes a big difference, for sure.

Speaker 2: 58:23

If you want to, you know, get your network back up. Aninda, this has been amazing and and and I'm sorry that I had a dumb look half of the conversation, because it's just so much, but it's just so much to take in.

Speaker 1: 58:36

There's so many layers of complexity, which is why I thought it'd be helpful to have this conversation and and trying to suss out like what's going to be the most important stuff to cram into an hour is difficult like vxlan, yeah, yeah where would you direct people like?

Speaker 2: 58:52

so I mean, I'm, you know, I'm halfway through studying for the ccmp and there's a vxlan chapter in that book, right? So I guess vendors have material people can consume to learn this stuff. But some of them are coming to you saying, hey, man, you know this overlay, it's all the rage. You can be excellent. Like where would you direct people who want to get started learning about everything we're talking about here?

Speaker 3: 59:13

so I learned a majority of what I wrote in the book um from dinesh dutt, who is again pioneer um in this space, from, uh, russ white um, jeff doyle, jeff Russ White, jeff Doyle, jeff Tansura. So a lot of them have either books content or they just have, like Jeff and Jeff do Between Two Nerds. They have a lot of good content on EVP and VXLAN. They have a lot of interesting guests that come in and talk about this. Russ has a huge amount of free stuff out there Rule11.tech, rule11.academy, which is his new academy for teaching.

Speaker 3: 59:56

And then Dinesh Dutt has two excellent books. I think they're called BGP in the data center and then EVPN in the data center and then cloud native networking, I think center, and then evpn in the data center and then cloud native networking, I think um. So I was, yeah, I was just I was watching their stuff, uh, reading their stuff uh, day in, day out. When I joined juniper, when I was writing the book, um, yeah, so go, go check them out um, they are the original sources of truth for any of this um. And then, yeah, maybe, maybe check out the book I wrote, because I hope I sort of took a lot of what they were teaching me and then hopefully translated that into easily consumable information, with some implementation information around Juniper.

Speaker 2: 1:00:42

What's the name of your book and how can people find it?

Speaker 3: 1:00:45

So the book is called Deploying Juniper Dinner Centers with EVP and VXLan. You could either get it on Amazon or you could buy it straight from Pearson, either as a hard copy or PDF.

Speaker 1: 1:00:58

Awesome, and where can people find more from you?

Speaker 3: 1:01:03

I am on LinkedIn, I am on Twitter for now, and then Blue Sky recently, and at all of these places it's Anandchat. So A-N-I-N-C-H-A-T.

Speaker 1: 1:01:16

And you said you're a tech blogger. Were you blogging for yourself? Do you have a website that you still blog at?

Speaker 3: 1:01:25

Yeah, I blog for myself. It's not very consistent because there's so much of work stuff going on and then a lot of the stuff I want to blog about goes into the book, but I do have a lot of good content there. It is called theaskyconstructcom.

Speaker 1: 1:01:40

Awesome. We will put links to all of that great stuff in the show notes and you can check out his blog, check out the books and other resources that he mentioned, and then to thank you so much for joining us tonight. This has been a thank you for having me. It was awesome. Thanks guys. Yeah, I feel like there's a lot here we still could have unpacked, so we might have to have you back for, uh, another.

Speaker 3: 1:01:59

Another follow-up episode yeah, I would love that. It's always good to talk to you guys.

Speaker 1: 1:02:04

Awesome, all right we will see you next time on another episode of the Art of Network Engineering podcast. Thanks for joining us.

Speaker 2: 1:02:18

Hey everyone, this is Andy. If you like what you heard today, then please subscribe to our podcast and your favorite podcatcher. Click that bell icon to get notified of all of our future episodes. Also follow us on Twitter and Instagram. We are at Art of Net Eng, that's Art of N-E-T-E-N-G. You can also find us on the web at artofnetworkengineeringcom, where we post all of our show notes, blog articles and general networking nerdery. You can also see our pretty faces on our YouTube channel named the Art of Network Engineering. Thanks for listening.