(Sponsored) Simplifying Data Center Management with Apstra Artwork

The Art of Network Engineering

The Art of Network Engineering blends technical insight with real-world stories from engineers, innovators, and IT pros. From data centers on cruise ships to rockets in space, we explore the people, tools, and trends shaping the future of networking, while keeping it authentic, practical, and human.

We tell the human stories behind network engineering so every engineer feels seen, supported, and inspired to grow in a rapidly changing industry.

For more information, check out https://linktr.ee/artofneteng

All Episodes

The Art of Network Engineering

(Sponsored) Simplifying Data Center Management with Apstra

November 06, 2024 • A.J., Andy, Dan, and Kevin • Episode 158

Send us a text

To learn more about Juniper Networks Apstra visit juniper.net/aone

Unlock the potential of AI Ops in data centers with insights from Shean Leigon, a senior product manager at Juniper Networks. Discover how AI operations, often seen as mystical, are instead practical tools that can enhance network engineering, especially in wireless and access layers. Shean delves into Juniper's strategic integration of AI through platforms like Mist and Marvis, offering a roadmap for data centers eager to embrace enhanced automation and reliability without succumbing to hype.

Explore the nuances of Juniper's Apstra solution, designed to ensure your network's intended state aligns impeccably with its actual state. Apstra utilizes a cutting-edge graph database to maintain a clear network topology, effectively managing challenges like cabling mismatches and day-two operations. With Marvis, the Virtual Network Assistant, the complexity of data is transformed into intuitive insights for network operators, streamlining issue resolution and improving communication within organizations. Shean emphasizes the importance of service mapping in reducing downtime and highlights the tools that empower network engineers to tackle application issues with confidence.

As the episode unfolds, the discussion tackles the integration of new security solutions in large enterprise environments, spotlighting the transformative potential of AI Ops in simplifying data correlation. Shean shares advancements in predictive analysis and impact assessment within data centers, underscoring the critical role of service awareness in prioritizing crucial actions. The conversation wraps up with a look at AI and automation's role in capacity planning and network maintenance, encouraging a healthy skepticism while acknowledging AI as a significant ally in the evolving landscape of network management.

Support the show

Find everything AONE right here: https://linktr.ee/artofneteng

Speaker 1: 0:00

This is the Art of Network Engineering podcast. In this podcast, we'll explore tools, technologies and talented people. We aim to bring you information that will expand your skill sets and toolbox and share the stories of fellow network engineers. Welcome to the Art of Network Engineering. I am AJ Murray and tonight I am joined by a very special co-host. He is Tim McConaughey from At Cables to Clouds. Tim, good to see you.

Speaker 2: 0:33

Hey man, it's good to be here. It's been a while.

Speaker 1: 0:36

Yeah, it's been a while since we've had you on the show. Our guest tonight is Sean Ligon. He is no stranger to the show either. He's been on the show a few times. Sean, thanks again for joining us.

Speaker 3: 0:45

Yeah, thanks, guys, appreciate it. Tim, thanks for having me.

Speaker 1: 0:48

So Sean is a senior product manager for data center at Juniper Networks and we're here to talk about AI ops in the data center and you know I got to tell you up until now we've heard a lot about AI ops and enterprise, particularly around access layer, wireless and stuff like that. Is AI ops ready for data center?

Speaker 3: 1:06

Yeah, that's a good question, and I love approaching the subject with a healthy amount of skepticism.

Speaker 1: 1:13

You know, I try I try Absolutely, as you should.

Speaker 3: 1:17

Quite, frankly, that's how we approach it internally as well. You know, look, you know it's a little bit cliche, but you know it's a bit of a journey. So, as we say, is it ready for the data center? Yeah, I think it is. We're seeing some things that you know we're bringing into the data center within Jutiper. That, you know, we think is compelling, but it's also a start, right, this isn't like, hey, we've launched a few things around AIOps, like put a bow on it and let's call it a day Right For me.

Speaker 3: 1:44

I really kind of look at this as like a tool in the toolkit and so I think, like that's the way that I would recommend folks to to kind of approach it as well. Right, it's not a you know it's. It's not a magic wand You're going to wave that makes all your problems go away, no matter what. You know, here I am as a vendor on a podcast telling you that, right, you know it's not a magic wand, but you know, look, we think that there's something there and, like I said, it's kind of the start of the journey, you know, around AIOps and the data center. So, yeah, happy to you know, looking forward to the conversation and for the folks that are familiar, like Mist and Marvis and kind of what we're doing there you know, I, myst and Marvis, and kind of what we're doing there you'd all be able to kind of relate some stuff to what we've done in the data center. It's similar in a lot of ways.

Speaker 2: 2:28

I've heard some really good stuff about the AI in your Myst product, so I mean, if it's the same general idea under the hood, it's probably pretty good, I imagine.

Speaker 3: 2:39

Yeah, it's the same idea and, like, the use cases, though, are a bit different, right, you know the way I kind of think about it. You know, when you think about like the campus and you know it's like you know, especially like on Wi-Fi too, which you know caveat, like I'm not a wireless guy, I like packets on the wire you start doing them like RF and I get lost. I'll leave that up. You know, if we think about like you know, on the campus side of the house, right, it's kind of like okay, I have a bunch of clients and those clients are trying to connect to applications that exist somewhere, right, but the idea is like you don't really you might control those applications. You know they might reside in your data center. They might not, right, they might be like some SaaS servers, or it just might be general internet traffic going out to other things, and so you know, there you're really concerned and looking at you know, like, how the client connects back to these different services, and in the data center I think it's quite a bit different where we're there. You know you're saying like, well, I have these services that I'm advertising out and I have all these other clients that I don't necessarily control connecting to my services, right, it's the opposite side of it.

Speaker 3: 3:42

And so you know a little bit about kind of like what we're doing, you know, in the data center.

Speaker 3: 3:46

First of all, it does start with, you know, essentially the Mist slash Marvis platform, and so what do I mean by that?

Speaker 3: 3:58

We've kind of taken, you know, the platform built on Mist and leveraging like all the Marvis AI stuff and essentially kind of expanded that throughout the product portfolio in Juniper. So you'll hear about things outside of the data center too that we're doing AI ops in, you know, in other areas within our product portfolio. But when we look at it from the data center perspective, essentially what we've done is kind of take this I like to think of it like a PAS layer, like, if you think about that, we've taken kind of like Mist and Marvis not the controller and the access points, but the AI engine aspect of it and think about that like we've just kind of like created this PaaS layer and we've said, okay, you know what are the use cases that are worthwhile to go solve in the data center and you know, are there things here that we can help, you know, use what's been done in Mist and Marvis to do that, and so you know, we've found a good handful of them and we have a bunch more that we're working on too, all right.

Speaker 1: 4:48

Well, I mean let's dive into a little bit. I mean just thinking about you know what I've seen of Marvis and enterprise and Wi-Fi. Some of the common things are misconfigurations, right. Like maybe it's a wrong native VLAN on a trunk or mismatched somewhere else. Native VLAN on a trunk or mismatch somewhere else Is that going to apply to here in the data center? Let's start to build on top of that, like let's take what we know and kind of go from there. Yeah.

Speaker 3: 5:12

I love it. It does apply and build on top of that. You know, and I think it's worthwhile mentioning too, the approach that we've taken around, kind of like AIOps in the data center, first and foremost, like it revolves around Apstra, and kind of like AIOps in the data center, first and foremost like it revolves around Apstra. And you know, without spending like a ton of time on it, there's a couple of reasons for that. That's pretty important. Some folks may be familiar with Apstra, you know. For those that aren't, essentially it's, you know, juniper's data center fabric management solution. But it does more than just manage a bunch of switches together. You know, one of the things that it really focuses on is around understanding the operational state of the network. So we kind of have this concept of what we call intended state and actual state of the network and it's a matter of rectifying those two. And this all gets built and kind of put into this graph database that operates under the HODAS, the engine of Apstra. And the reason that that's important is because we know things like what does the network topology look like? Right, we know that from the graph database, so we can easily go render that in a UI. Quite frankly, like I don't need AI to tell me what the topology is, right. So there's a bunch of things that we've done there and so when we think about, we think about like cabling mismatch. We know from the cabling diagram and basically the cabling map that AppStore builds out, we know exactly what the cabling should be between, like all the leaps and spines, right, and we also know what that should be going down to the server as well. So you know, we can essentially map all that out. So we'll know like, hey, if there's like a cabling mismatch, like Apstra knows that as a single source of truth, but how do we convey that back to users in a meaningful way?

Speaker 3: 6:51

I have a couple of videos. I grabbed some screenshots and some things of that. I can kind of show that might be helpful. But you know, we look at like that exact use case around like a cabling mismatch, right, if you think about that, you know one. If it's like, hey, you're just deploying the network and there's a cabling mismatch, like okay, you know there's really no operational impact, right, like you might have to get back on the phone with smart hands or you know whoever was racking everything and say, hey, you know, or maybe you're the one racking it, right? Just all kind of depends, right, figure out what's happening, and so we do things like fetch LDP information just to go easily tell you hey, we think the cable should be in port XE002 and it's actually in XE003, right? Do you want us to either reconfigure it or go move the cable back?

Speaker 2: 7:37

How are you expressing the intent of how it should be cabled? And then Juniper's you know, abstra is saying, oh well, it's actually miscabled and that's how it's communicating to you. Or because the part I'm missing is you're saying you know, hey, it should be cabled this way. But are you, as an administrator, expressing that intent and then Astra's kind of keeping you to it, or how does that work?

Speaker 3: 7:58

Yeah, great question, appreciate that. So what happens instra? You can, you know, start to kind of build out your network. You define what a rack looks like. You know we're going to. There's basically like pre-populated all the network devices that are supported in there, whether it's, you know, juniper or Cisco or Rista right, we support a handful Dell Sonics, which is all could select like, hey, it's a three-stage Clo fabric or it's five-stage. You can go in there and kind of EVPN VXLAN fabric or an IP fabric. You specify kind of like what all that looks like and what will happen is like Apstra built out the cabling map for you so you don't have to go and specify each particular physical link.

Speaker 3: 8:47

You know Apstra kind of does it and there's a bunch of value in that. You know it's repeatability. The great thing is, like you know, tim, if you use Apstra to go build one, you know blueprint in the data center, and then I turn around, I go use Apstra to build the other blueprint in the data center. The cool thing is, all the ports are in the same way. It's not like, well, I decided to put all the Leafs, fine connections in the middle of the switch and use port 26 by random and you use 00 and 01, for example, so it keeps it consistent.

Speaker 3: 9:16

And so by being the operator using App Store, expressing that intent, yes, app Store turns it on, builds it out, stores it all in the graph database and then that's how it keeps track and it knows. And the cool thing about that, you can do it before any hardware shows up, so you can actually go build all that out. You take the cabling map, you export that and you go hand it to somebody, like maybe SmartHands or whoever's doing the install if it's not yourself, right, and then they're able to go, you know, kind of implement everything and you can easily, you know, app Store alone will kind of alert you to it if there's a cabling issue. Yeah, great, great question. Hopefully that helps.

Speaker 2: 9:53

Hey, that was great man. That's exactly what I was looking for.

Speaker 3: 9:55

Yeah, so you know we kind of take that issue. But you know, let's take it a step further. Let's think about this like as a day two issue. Right, like, your network's up and running, you know it's online and active, there's a bunch of applications and services running out of it. You know, and I don't know, maybe something happens and you know there's some sort of day two change that needs to happen and you know it could be a number of things.

Speaker 3: 10:19

And all of a sudden somebody unplugs a cable, right, and you know somebody unplugs a cable and they decide, like you know, to go plug that back in and you know they misidentified the port. Right, they plugged it into port three instead of port two. So in this case there's a couple different things that'll happen. One, yes, apstra, without any AIOps stuff, apstra's just going to tell you that, like, same thing applies, right, that cable evens, right. But what we're doing with kind of the suite of services that we're essentially throwing AI ops against, is now what we'll do is we're making like correlation to one. What are the services and applications that's running on that switch and out of what ports are they running right? What physical interfaces, right? So we can kind of render that and visualize that and show it to you.

Speaker 3: 11:05

But then the other one is actually, you know, turning around and saying, okay, you know, now, maybe that was a connection between, like, a leaf and a spine, and maybe a BGP session went down and traditionally, right, like, all these messages would get generated. We all know from, like, looking at, like show log messages, you know you're going to get a bunch of cryptic stuff and you're going to get some things that actually, like just make sense, right. And then you know, well, maybe you just saw BGP went down and you don't know that it's an interface issue. Yet, right, like how do you kind of separate the signal from the noise, if you will? And so that was the first thing that we've really focused on around some of the AIOps stuff that we're doing is we're essentially running a bunch of correlation.

Speaker 3: 11:42

Against that to say, okay, you know, these types of events are grouped together, we know that these are all related to one another. And go look here, right, go look at the interface, don't look at BGP. That's a symptom, that's not actually the cause of the issue here. Go look at the cabling that's been, you know, kind of reconfigured or set up differently, right, either you know correct it by fetching LDP and updating the config, or you know correct it by having somebody move the cable back. Right, you have some options there.

Speaker 1: 12:11

I love that because you know we've all troubleshot something right. And you're like you said. You're looking at this sea of log messages and you spot something like oh, BGP, oh, I know BGP, I can troubleshoot that. And then you just grab onto that for dear life and think like maybe I can fix this problem if I troubleshoot BGP and then, like you said, it's an interface down issue. You're never going to get that session back up if your cable's plugged into the wrong port. So just to be able to use this to, hey, no, that's like you said, that's a symptom, that's not the problem. Go look here.

Speaker 3: 12:42

That's a time saver, right that was kind of our target really too. What we wanted was the simple use cases, not because we're afraid of the complexity, but it was like I want the things that are high value and happen often. I don't want to go after the corner case that you ran into three years ago in your data center, and it was really hard. Those are good to solve too, but it's just like man, where can we make it easier for users to troubleshoot these things? And so that's the approach that we've taken.

Speaker 2: 13:10

Yeah, aim for 80% and you'll get most of your absolute use cases.

Speaker 3: 13:18

Yeah, that's definitely been the goal, so it makes a ton of sense. We've done other things, like if people are familiar with the Marvis chatbot, I think everybody. Now Chatbots are going to become table stakes, right, let's just be real, they're going to become table stakes.

Speaker 3: 13:35

This kind of goes back to the other thing, though. Right, I was like the first thing we should do with our chatbot is make sure that people don't have to go in docs again again, right, Like, if I can go ask a chatbot a question and get like an answer back with a high degree of, you know, confidence in that answer that it's accurate, and then also, like, give me some proof behind that, right? So we've taken the Marvis chatbot that people are used to with Wi-Fi and we've kind of like expanded that everywhere you know. So I'm here talking to you about data center. We've we've added in all the documentation you know to it, so now you can just go query the chat bot and ask your questions and get information back. But we've added other juniper products in there as well, um, you know, as far as documentation goes, and so the the next step on that, where we're going next, is actually using the chatbot as a conversational interface to pull state data out of the network, right? So if you think about that graph database that I talked about, you know we have like this really powerful source of single source of truth. You know what, if now I just wrap this conversational interface in front of it that allows you to just ask normal questions and normal human language and get back, you know, meaningful information about the live state of your network and again, not like on a switch-by-switch level.

Speaker 3: 14:46

I think that's the other part of this. That is really nice, right? You could take a chatbot and say okay, let me go ask on a switch-by-switch level what's happening. That doesn't help, right. Especially like in modern data centers. You know the scale-out's pretty big.

Speaker 1: 15:00

Yeah, absolutely. So we've kind of talked about the physical interface, the cabling and stuff like that, but now let's go up a layer right, Like can it help us troubleshoot routing protocol issues?

Speaker 3: 15:13

Yeah, it can, like we'll call out things like BGP mismatch, so for example, like if the peer ASN is different than what you have configured, right, like we'll call that out. But we won't just call that out like again in a message or you have to go dig through. We actually use, like the virtual network assistant to kind of show you what that looks like.

Speaker 2: 15:30

AI apps is great if you you know, but if you don't know what to chat for or what to look for, like it's almost like a. It doesn't help you that much. So how does apps just surface these problems to you as the?

Speaker 3: 15:43

Yeah. So again, you know great question we're using essentially it's called the Marvis Virtual Network Assistant and we have the Marvis Virtual Network Assistant for data center. So again for the Wi-Fi folks, they're going to be pretty familiar with it. It kind of gives a little bit of like this octopus layout view and tells you, like you know, here's the Marvis actions to go take, and what it does is it kind of summarizes different information under categories, right. And what we've done on the data center side is we've said, okay, you know the different categories, like layer one and two. You know we kind of bundle things underneath that connectivity device stuff. You know things that's going on around like traffic capacities, so we might be looking at like hot and cold interfaces. We'll surface this and kind of like this easy to consume Some places. We'll surface this and kind of like this easy to consume. Some folks call it a coffee cup view and I kind of like that because it's like you know, abstracting away a lot of like the low level detail, just providing it to you to really like high level for you to. You know, you come into the morning, you know you're a knock or something, right, maybe you turn around, you just take a look at this, right, while you're sipping your coffee, you kind of take a look at it and you say, hey, is there something here that I need to pay attention to. Maybe you go look at it and you say, all right, man, this device has an issue and it's triggering an environmental alarm with a power supply. So we're going to call that to the forefront, right, and you might look at it and you go, oh man, that's no big deal, that's a new rack getting installed. We put that switch in last night. All right, I'll get somebody to look at that, but not really impacting. Now we then take that like a step further. So the other thing that we've done and to me this one's like pretty important We've taken a step further with what we call service awareness, and what service awareness does is it's actually like mapping the services and applications running in your data center to the infrastructure and the resources that it is dependent upon.

Speaker 3: 17:33

So what do I mean by that? You know, I kind of think back to my days as a network engineer and doing net ops, and it would be like some application issue. Right, you're running a big enough company. There's always something going on. You know, some application issue going on in an enterprise and the application is doing a database query and it's slow. Database teams, like database looks fine. You know everybody's like must be a network issue, right, like it said slow, so it's got to be a network problem. You know.

Speaker 3: 18:01

So it's like you know there's a p1 going on because I don't know. You know like maybe it's a, a booking engine for a hotel, right, like that's a pretty big impact, you know. So you're like all right, you know like we got to get this fixed and so you know you're the network guy, you get called, you jump on the P1, you're like you know what application is it? Oh, it's okay, well, it's. You know it's the lodging booking app. Oh, okay, server, that's used. Well, you know there's this load balancer, vip, and it's actually okay. Well, which server is it actually using right now, though, for that one query? You know you kind of spend all this time like back and forth, right. Then you're like all right, let me, let me go look at an art table and go figure out what that server's connected to. And then you know, oh, I see it's. You know, on leaf 6, port 14, right, and so you're looking at that and then you realize like, oh, actually port 14 is part of a lag.

Speaker 3: 18:47

You know, I'm kind of going on and on about this, but you know, I think anybody like a NetOps hat, right, you know you're familiar with this, right, you've kind of gone down this track. And then you know you go, look at everything and you're like well, you know, I looked at both. Like you know, I looked at the BGP table, I looked at the type twos, like I'm good, you know everything looks fine there. And so you know they're like hey, you know you've been quiet on this P1. You know, hey, network engineer, you know how are things looking. And you're like well, I checked the EVPN table and I see the type two routes. Right, like everybody else on the call is like I don't know what this guy's talking about. Right, it's utterly meaningless to the rest of your organization.

Speaker 3: 19:26

And what you really need to be able to do is how do we get to mean time to resolution? I actually think mean time to innocence is such a in the networking world. Nobody, what do you do? You hang up the phone and you're like good luck troubleshooting this. The network's fine, right, you're not helping your employer, right? So to me this is all about? How do I make the network person give them? You're still the network engineer, you're not the application guy, you're not the database, you know. You're not the DBA, right? No, but how do we empower you to be able to talk back to your organization in ways that are meaningful?

Speaker 3: 20:01

And so, with service awareness, what we've done is we're actually turning around and taking like flow data plus the topology and we map out so we'll show you like you know, hey, I have, uh, this server is connected to this leaf and out of that server we have like these services that are on that are they're running out of it. So what I mean by that like sql service is is turning around and running out of server server one, but it's also running out of like server four and server seven and server eight, you know out of these different ESXi cluster, because, again, data center, everything's distributed right, and so you know. You say, like, well, these services are running out of here, and so I see where this is going. So you don't have to ask those questions, right, you automatically know already. And even if that's in a particular VRF or a routing zone, again, like with Apster and Apster cloud services is an extension. It's just a another way that we're delivering. You know, services around this. It tells you all of that, it does all the discovery for you. You know all of that is is essentially built in, right. You're not having to go do this. So now you can map those services that make up your critical application or any application, quite frankly to it.

Speaker 3: 21:10

And the other thing that we've done to kind of like bring to surface, you know what we think is important back to the user in a way that we can build that into a product right that we think. So we do things like we bundle traffic together. So when you first go look at this, right, we're going to show you like, within a given time frame and it could be a 15-minute increment, it could be an hour, it could be a day we're going to go show you hey, here's some chunks of data. So these 14 services make up 700 gigabytes of data that was transferred in the last hour, for example. And then there's 27 other services, but they made up I don't know two gigabytes of data. So we kind of show that to you, like in these kind of different blocks.

Speaker 3: 21:51

But the idea is like draw your attention to the 700 gigabytes of data, because more than likely either A one that's the one that you could be of interest in or B the inverse could be true. It could be that there's no data because there's a network issue and nobody can access the application. But either way, right, like, you can start to kind of like divide these things and start. Everything that we've done is kind of like and I think there's an important approach, right, you got to kind of like abstract things out a little bit to figure out, like how do I give a high level view first before people dive deep, Right, cause if you dive deep too quick, you know you kind of put yourself down a rabbit hole and you might be in the complete wrong direction than where you need to be.

Speaker 1: 22:33

Well, you know, I think this is great because, when I think back to when I owned a network, I had all of these tools that checked a box right, like it was a requirement to have a log collector and a requirement to have this and a requirement to have that. I didn't use it and even when it came to troubleshooting I forgot that I had it at my disposal. And even if I did use it, it was just so full of information I couldn't make anything of it. It just caused more confusion and heartache than actually helped me. So to have something that can look at and examine this data and make it useful and provide me some I don't know decisions or some really useful information, rather than than cloudiness, I think this is great man.

Speaker 3: 23:14

I know that pain all too well. I'm right there with you. Like I remember, you know, working in a fairly large enterprise. I mean, we had 9,000 switches and routers in our environment. Right, it was pretty, pretty decent size, four data centers, 58 different campuses and, and you know, the the security team brought in like a new seam solution.

Speaker 3: 23:36

And you know, it was like you know, hey, we need you to send, you know, syslog messages from all your devices to this thing. Right, it was like, okay, you know, what do you want me to send? They're like any, any, and I'm like I'm not saying any. Like, know, so we kind of had this. Well, why not? Right, like why are you trying to keep stuff from us? No, I'm not, my network is not going to be the top talker, right, like, my switches are not going to be the number one thing using bandwidth on this network, all the servers, everybody. And it was like they wouldn't let me get access to it. Our team can't troubleshoot. Like I don't, I just want read access to be able to. And they're like no, no, no, like that's a security risk, okay.

Speaker 2: 24:19

You know, so yeah.

Speaker 3: 24:21

I feel the pain.

Speaker 2: 24:22

Yeah, I think the number one thing that AI ops at any level can do for network teams is to is that correlation piece to be able to draw insights and surface them in a in a meaningful way, because we all have access to thousands and thousands of data sources. We all got solar winds or some version of net flow that's running and we've all got the SIM. You know, maybe not the the sim, but like syslog and you know all the box checked, like like aj said, but when it comes down to it and there's a p1 and everybody's breathing down your neck, you don't have time to correlate 40 different data streams and try to figure out where they correlate right yeah, 100.

Speaker 3: 25:04

And you know the other thing that I think about too, like in that, in that same aspect of it, you know, you're absolutely right, like you're getting all this data, you know. The other thing is, I mean, you end up with people that, like spend a ton of time configuring and setting up the tools because, like and don't get me wrong like the intent is there and it's like, well, you know, know, but what if a user wants to do this? And what if a user wants to do that? Well, let's give them every option so they can configure this thing however they like. Right, and a lot of times, like I don't, I don't want to configure it however I want, I don't want to become an expert in this one tool, right, like, I just want it to set up and work and give me value.

Speaker 3: 25:46

And and that's the other thing that I really like about the kind of like the idea behind AIOps as well, right, it's like if I have a whole set of data and I could just kind of like point something at it and give me meaning back, right, and I mean that's a gross over simplification of what's happening, but if I can go point something at it and give me meaning back right, like if I can go leverage a at it. And give me meaning back right, like if I can go leverage a rag and you know understand, like how to go do that. And now, like I just turn around and use the rag to go do that, right, you know, it's to me like that's the part that gets me excited is like, oh wait, I don't have to. You know, yeah, here's a tool that we have for you, but by the, you know, here's a 1400 page book on all the different configuration options. It'll take you a month to go figure out what it's capable of and another three months to set it up, right?

Speaker 2: 26:33

Like that or you got to get a vendor to set it up for you. Who's, who's the whole sole business is going to set up this Sure, this application.

Speaker 3: 26:43

Absolutely, yeah, no, absolutely. And I think like that's the part that kind of gets me excited. And then even better, right, like the technology. Just I mean, as we've been working on this in Juniper, we've watched like the paradigm shift in the advancements in. You hear claims. It's like well, let's go actually kind of validate that. But maybe in the networking space there's all these kind of tools and interesting things that's been happening everywhere else in the infrastructure stack there's tools that the server teams have had for quite some time.

Speaker 3: 27:23

Let's be, honest, we've been pretty slow to adopt these things in the network realm and we're, at like, the source of all the actual information, right Like we could actually see packet-level data right.

Speaker 1: 27:36

It's like what are we doing?

Speaker 3: 27:38

So yeah, I'm pretty, you know, healthy skepticism, as I said, but pretty optimistic. I've seen proof in the pudding so far and you know I like what I've been seeing around AIOps and I think there's some real value there.

Speaker 1: 27:52

Yeah, Sean, I'm a longtime VMware, Windows and Windows guy, Like I did infrastructure for a really long time. So to kind of put an analogy on this, a long time ago on early versions of Windows you would have to deploy things like Active Directory or any Windows service in a very manual fashion. But as new versions of Windows come out, there's wizards where you click a few boxes and put a little bit of information and the wizard does it for you. You know, like in early versions of VMware, if you wanted to deploy VMware vCenter, you had to install Windows Server, install vCenter, set up the database server and then later versions, you put in some information and it deploys it all for you and it uses the best practices and security standards and all this other stuff.

Speaker 1: 28:36

So why do I want to go hand jam a bunch of that standards and best practices and stuff on a bunch of switches in a data center when I can put in some basic information about what I want my network to look like and have something else go do that for me? It saves me a ton of time. It lets me rise up my skills and focus on other things that AI ops can't handle. That I can handle. That I should be handling. I think this is long overdue, right, Like you were saying.

Speaker 3: 29:04

Yeah, agreed, and it's a great analogy and rings know so, true, right, I mean it's just, um, yeah, agreed, definitely, definitely, rings true for sure. And and you know, the the next kind of phase of this, when we look at it too, right, and I think, like tim, you mentioned something a little bit earlier that kind of made me think about it. Right, you know, we're also going to get to this point, like, and we're actually fairly close because we can get to it pretty quick but getting to the point of, like, you know, predictive analysis around stuff. So let me bring up kind of an example, one that's there today. You know, we have this piece that's then called like so I talked about service awareness. So if you kind of grab like the mental model and apologies, I can't share my screen, but if you grab like the mental model around, you know, okay, so I have a visual topology that maps ports and protocols, you know, and the services in the data center, you know.

Speaker 3: 30:01

And oh, by the way, we also show, like, all the clients that are connecting to it as well. Again, we aggregate that, right, who wants to see, you know? Hey, there's 30,000 clients connecting to the service, like in the data center, that's not. You don't really need to drill down into one. We give you the option to do that. But but you're more interested, you know, in kind of like hey, there's 30,000 clients connected to the service, that's a great thing, right. Like that's good. Hey, there should be 30,000 and there's only one, that's a problem, right, and so one. We surface that right away in service awareness.

Speaker 3: 30:35

But then you get to impact analysis and what we've done there is we're actually turning around and we're taking those anomalies that occur that I talked about from Appster, right, we're like the cabling mismatch, we call that an anomaly. And we have a bunch of others BGP mismatch, there's a bunch of predefined probes in App Store itself that we turn around and highlight these anomalies on and with impact analysis, we send these anomalies basically from your App Store cluster. They're able to get sent into App Store cloud services, which is where all this stuff lives. This again, is that PAS layer. And so now, you know, now we're able to say like, okay, you, you have this service and these applications running and then, and then you have this event or this anomaly that happened on your network.

Speaker 3: 31:19

And then we do, we turn around and we say all right, like you know, because of these events, one, we're going to group them together where there together where there's already correlation, like we're doing that correlation for you, right? So we're going to group those together. And then, two, we also turn around and we tell you hey, these are the services that that could go impact. Case in point let's just say that you have a fairly large data center and you come in and one of the power supplies is down on a leaf in a rack and you're like, oh well, if I don't fix it, it's got dual power supplies, so I'm okay.

Speaker 3: 31:54

Maybe the other power supply is running a little hot right now but, all right, I'm all right, and you're like man, do I need to go get that power supply replaced immediately or have somebody reseat the power cable, whatever? Do I need to get eyes on that right away? And what we're able to show you one is again mapping those services. So power supply has an issue. We'll turn around and tell you these are the services that could be affected if you don't do something about this. We're not saying that they're impacted right now, but we're saying that if something doesn't happen, they could be impacted. So you can easily draw the distinction between a switch or a leaf that there's nothing connected to it, there's no services running on it. Or like maybe there's servers connected to it. Right, there's physical servers connected, but there's no active services being ran on that, for whatever reason, who knows? Right, okay, I don't need to prioritize that one, but but maybe I have four leafs in my data center where this is the case. I'm going to go prioritize the one that I do have active services running on. The other one I bring up like arming and switch, it happens. Right, we'd love to say it. Like switch was never fail, arming a switch, or or even differently, right, like I need to. I need to swap some switches.

Speaker 3: 33:01

Maybe you're you're doing like a generation update of your fabric. You know, inside the data center, you know, and you have to go to a change board and you have to go coordinate with another team to go say, hey, I need to take this switch offline, can you vMotion some servers away from the switch? We automatically map all that stuff out for you so you know right away where all these things are connected, what services are going to be impacted. You'll know, oh, okay, again, hey, it's the SQL service that's running and it's running in, like you know, vrf.

Speaker 3: 33:32

You know, for this VRF, for this point of sale system, you know I have in its own virtual network and it's in VRF. And so you'll know, like, okay, it's the SQL service that's running in that VRF. Let me go and talk to that team, you know, because I need to go kind of map that out. So it helps with capacity planning, maintenance actions, like all those types of things. Again, you know they're not always like the sexiest thing to go and talk about, you know, make a big marketing splash about, but like these are impactful for everyday type of stuff, right To people that are doing the job.

Speaker 1: 34:02

Yeah, I mean, how many times have I had to like temporarily move a service or whatever and forgot it was there and did?

Speaker 3: 34:11

Yeah, and other things like just you know, in App Store too, right, because it's just an extension of it. Like we have things like drain mode so you can go put the switch in drain mode and when you do that, right, it's going to basically drain all the traffic off that switch. You don't have to go in there and figure out how to manually configure it and change PHP values and all those types of things and manipulate the NLRI. It'll turn around and do that for you to drain all the traffic off of it. And then we'll actually alert you if there's, if there's traffic over like a certain threshold. It's configurable, but there's like a default threshold. So it'd be like, hey, if there's over, like you know, a megabit per second of traffic on the switch as a as a whole, right, like it'll generate an anomaly and it only does it when it's in drain mode by the way.

Speaker 3: 34:53

So it's like that switches in drain road and you're getting more than one bag of it per second of traffic on there Like there's. You know there's something going on.

Speaker 1: 34:59

Yeah, right.

Speaker 3: 34:59

Yeah.

Speaker 2: 35:01

That makes sense.

Speaker 3: 35:02

Yeah, so, so there's some cool stuff like that. And then, and then you know where this is going to, right, you can, you know predictable type of analysis that's very specific to your environment, right, so you know, these are the things to me that's always important as well, right, like we might try and say like hey, these interfaces on the switch is cold, right, but if you go back and you think about, like whatever your type of business is, maybe your type of you know operation, that, like Monday through Friday is when all the traffic is, and the weekends nobody's using your data center, that much so, rather than getting alerts that the interfaces are cold on a Saturday, now it could be like all right, well, look, we know for your environment because we've turned around. And essentially, I remember, like trending analysis isn't new, that's not a new thing. There's all sorts of tools that could do that.

Speaker 3: 35:57

I remember having to do that and, like you had to baseline it for at least 30 days and usually 60, because I had to get like the full cycle of whatever your, whatever your normal business operation was in to do a even even a minor trend, right, yeah, now, like with ML, right, you can go run that calculation that was being ran before much faster. Right, and that's the cool part about it. Like you, can do this much faster and get value out of it a lot quicker. You don't have to be like, well, buy this tool and in six months we'll know what your baseline looks like in your environment. Like, no, you, you can do this a lot faster. Awesome this is.

Speaker 1: 36:28

This is uh, I'm surprised that that I I'm glad I brought a healthy dose of skepticism, but you're winning me over.

Speaker 3: 36:47

Well, that's good. And, like I said, look, the skepticism is good. We have different things to. People can go take a look at this. For App Store customers, it's really easy to use. For Marvis customers, there also more integration between Marvis and what we're doing on the data center too. It has the same look and feel, you know. So that part is really nice. And then, you know, for folks that aren't a Marvis customer, not an App Store customer, you know, look there, there's marketing videos and demo videos out there. But of course, you know, anybody at Juniper will be happy to happy to walk you through it and you can always reach out to me as well, um, you know. So, yeah, the skepticism again is, um, like I said, I think I think healthy and people should should, you know, approach it and ask questions and being inquisitive and, and you know, um, ask to see it right, like, don't from any vendor, like from anybody, us included. Right, I agree, you know, vendors should be earning your business every day. Yeah, I'm a big believer of that.

Speaker 1: 37:43

And you know two important points here. So one is we're talking about Apstra, and that's not just Juniper, that's vendor agnostic that works with. So everything that you've talked about here tonight works.

Speaker 3: 37:54

That's a good point.

Speaker 1: 37:57

Works with. You've talked about here tonight works.

Speaker 3: 37:58

Good point. Works with with any switch right go well, well anything there's, there's, there's switches that we qualify yeah, yeah, yeah, any switch qualified by abstra um so yes you know, obviously we, we, you know, first and foremost qualify all the juniper qfxs and juniper devices first. But of course, yes, uh, you know, cisco devices, arista devices, um, you know, there we support Sonic as well.

Speaker 1: 38:18

And then I feel like whenever you start to talk about AI, there's always somebody in the group that rolls their eyes and like AI is going to take our jobs, and I don't hear that in this conversation. This sounds like the coworker I wish I had not somebody that's going to replace me, wish I had not somebody that's going to replace me.

Speaker 3: 38:37

Well, you know the way I look at it and I I can't remember, you know, I I wish I remembered exactly who said it, because I'd want to give them appropriate credit, but it was like. It was like AI is not going to take our job. Somebody that uses AI is going to take your job. Yeah, and and, and you know the differences, right, like I mean, I use different AI tools all the time in my day to day, and it took me, like I was slow. I was slow to do it.

Speaker 3: 39:05

To be honest with you, it took me a little while to be like, okay, you know I need to change my, my way of thinking around this, right, but it's like it just allows you to be more efficient, you know. And so, really, it's like you know that same approach here, right, like it allows you, you know, to be more efficient. Like that's the goal. So, again, it's not, you know, like you're not going to be able to. Like AI is not, again, like the magical wave of wand. Let me just throw AI at something and we'll make it all, you know, make it all better. You know you need good data, right, the data. Like data in data out, still applies.

Speaker 3: 39:38

You can't get rid of that, if you think about it and I'm going to oversimplify it quite a bit, but it's a lot of math. It's a ton and ton of math and math transactions. It still needs good data to run that math against you can't. This is why, this is why you know, again like kind of tooting the juniper horn a little bit. But this is why, like in the Apstra framework, because we have that graph database, this is why we went that route in the data center, right Was like we have the data that we're collecting and we also collect all this telemetry data, right, like already. So now it's like I'm gonna throw ai at that, like I don't, I don't need to. You know I don't, since I know the state. I already know all these things, I know what it should look like. You know, like now like let me go help, you know, solve some kind of some some kind of operational challenges here, um, so yeah, it's a. It's a much different kind of approach, I would say, than you know people are probably used to.

Speaker 1: 40:35

I mean 100%. If you're responsible for updating documentation and you have to run into a problem and you pull out your network drawings and you go God, these network drawings suck, they're not up to date. Well, like you said, good data in, good data out. If you don't update your network drawings, you can't use them to troubleshoot Now with the same concept you know it's a live network map that just updates itself as changes are made and when problems happen it's correct data right there 100.

Speaker 3: 41:04

I've been, uh, I've been sipping on my water cup, but I um have my network engineering stick awesome.

Speaker 1: 41:13

Well, we appreciate that. Uh, sean, this has been a fun conversation is. Is there anything else you want to add before we put a bow on it?

Speaker 3: 41:31

in the Juniper space. You know if you, if you have further questions, reach out and if you also just have questions, you know, if you're just interested. I mean, at the end of the day, like I'm a network nerd at heart so I don't need to come around and, you know, give everyone a product pitch around stuff. People are just interested in kind of like what we're you know, like just our experience and kind of what we've done. You know around this, you know, feel free to reach out as well. So happy to share.

Speaker 1: 41:55

Awesome. Well, if you want to learn more, you can go to junipernet forward slash A1. And he is at Sean LV on X slash Twitter whatever we're calling it these days so if you have any follow up questions, you can certainly reach out to him there or leave a comment wherever you found the show. Sean, thank you so much for joining us. I really appreciate it. And Tim, thanks for co-hosting tonight.

Speaker 2: 42:16

Yeah, that's been great man. Thanks for dropping by Awesome.

Speaker 1: 42:18

Yeah, thank you, gentlemen. Appreciate it. Great conversation, absolutely, and we'll see you next time on another episode of the Art of Network Engineering podcast. Hey everyone, this is AJ. If you like what you heard today, then make sure you subscribe to our podcast and your favorite podcatcher, smash that bell icon to get notified of all of our future episodes. Also, follow us on Twitter and Instagram. We are at Art of NetEng, that's Art of N-E-T-E-N-G. You can also find us on the web at artofnetworkengineeringcom, where we post all of our show notes. You can read blog articles from the co-hosts and guests and also a lot more news and info from the networking world. Thanks for listening. We'll see you next time.

People on this episode

The Art of Network Engineering

The Art of Network Engineering

(Sponsored) Simplifying Data Center Management with Apstra

People on this episode

Andy Lapteff

Jeff Clark

Shean Leigon