The Art of Network Engineering

Terrifying Network Outage Tales: Real-Life IT Nightmares and How to Survive Them

October 26, 2022 The Art of Network Engineering Episode 104
Terrifying Network Outage Tales: Real-Life IT Nightmares and How to Survive Them
The Art of Network Engineering
More Info
The Art of Network Engineering
Terrifying Network Outage Tales: Real-Life IT Nightmares and How to Survive Them
Oct 26, 2022 Episode 104
The Art of Network Engineering

You sent them in and we’re sharing them! These are some of your scary network outage stories! We even share some of our own scary stories too!

Find everything AONE right here: https://linktr.ee/artofneteng

Show Notes Transcript

You sent them in and we’re sharing them! These are some of your scary network outage stories! We even share some of our own scary stories too!

Find everything AONE right here: https://linktr.ee/artofneteng

This is the Art of Network Engineering podcast. In this podcast, we'll explore tools, technologies, and talented people. We aim to bring you information that will expand your skill sets and toolbox and share the stories of fellow network engineers. Come in, sit down, and hear my story. But listener, beware. This content may not be suitable for change control managers. This is the legend of Stevie McChesey. Stevie was a fresh-faced newcomer, ready for a challenge. He prepared this fateful night for months. The config scripts were written, checked, then rewritten. Then there were labs. Oh, so many labs. At the stroke of midnight on All Hallows Eve, Stevie carefully entered the dark and ominous data center. Arrr Ding! Stevie really should get his badge checked by security. Anyway, Stevie got right to work. The server rack shook, the floor tiles squeaked, and when he replaced one line card, an actual dust bunny had evolved into a dust jack rabbit and escaped into the night. Finally he finished. However, he saw forty-two missed calls from the knock. After patiently listening to all of the problems, Stevie just grinned, then screamed, What do you expect when I'm only allowed one maintenance window a year? Happy Halloween all and enjoy this extra spooky episode of the art of network engineering. I think I just swapped out that switch this past week. That was so good. We have a story like that. Excellent. So welcome. Welcome to a very spooky episode of the Art of Network Engineering. We've posted on Twitter looking for your scary stories. And we're going to shout them out here this evening. But before we do that, Tim at Timbertino, how are you doing, Tim? I've got a full tank of gas, half a pack of cigarettes, it's dark and I'm wearing sunglasses. Hit it. Nice. I'm good AJ. What's going on, man? Not too much. I, well, I don't have a costume. I tried to pass myself off as a Cisco tech engineer because, you know, that's scary and in line with our episode, but Andy gave me too much crap for that. And so I'm just going to confess that I did not prepare as well as I could have for tonight. Put on your light on your face. It was very scary. What the hell I... Oh, there it is. You lost the flashlight in 30 seconds? Yep, that's me. That's me. Ah, there you go. Woo, spooky. Now I can't see. Excellent. And Girl Scout Lexi, how are you? Yeah, I'm good. I don't know any like Girl Scout things. No, just say you can't talk about it. It's been like 25 years, so. Super, Super Girl Scouts. God damn it. Oh, that's only rocket stuff, not Girl Scout stuff. Don't make some up. I know they're hard to do. Have the rocket badge, but yeah, I'm good. I don't I don't think so. No, I have a science badge, but I have a lot of badges. What do you have to do to get the science badge? OK, but you remember? Wow. That's impressive. Look at you legit. The badges. Yeah, I don't. I don't legitimately remember anything that went into this. Is that a name tag? Oh, sorry. No, it's a, wait, where? There's so many things on this vest. I don't know. I thought, yeah, it looks awesome. Oh, it says junior aid. Yeah. Pretty cool. Some of these are like vintage now because they came from. I won't believe it. I don't age myself anymore. But, uh. Some of these are discontinued, many of these are discontinued badges that are being sold on eBay now for a little bit of money. So I feel special. Vintage Girl Scout vest. Yeah, so I'm feeling great tonight because I know I'm vintage. I was gonna make an old joke, but I thought I better lay low. Yeah, you're in the clear. Dan's not here, so you're kind of in the clear, Andy. Speaking of Andy, what are you wearing Andy? What is he not wearing? Can you describe it for listeners? I am the... Did you just do a little twirl? You did. I was trying to show you my cape. I have a cape. I am the Network Outage... Excuse me. I am the Network Outage fairy. What's on the back? I don't know. I just... Yeah. This, you know... Is there an image on your teeth on the back? Register trademark Marvel. You turned around. That's AJ. Nice save AJ, nice save. Did I just get us in legal trouble? So interestingly enough, how am I doing? Thank you for the ask. I'm doing great. I was having some fun at work yesterday, working on some content believe it or not. That's, I don't think product managers, I'm not your typical product manager, I'll tell you that. So I was working on some content yesterday. that I'm excited about and I was dressed up as a fairy godmother of sorts. Is that content Juniper internal only or are we going to get to see this content? I just sent it to leadership today so if they decide to keep paying me then they'll have to decide. I'm not sure. First I'm trying to see if they're like, oh my god, what have we done? But if I get to stay and keep playing then I guess they'll have to make some decisions. I mean, ultimately I would love. I'm doing a series, I think, and I would love it to be released. I think it would be really cool and useful and funny. But I just went in the basement and the kids have their stuff and this is my outfit. But work is going good and I'm excited to be here for our spooky Halloween show. I love it. I'm excited. We got a lot of tweets and messages, DMs and emails. We got a lot of good stuff. Plus, we have some of our own. So I'm very excited to get into this. Where should we start? So I've got a list from Twitter. We can start there. Let's do it. So I took down some of these and gave them titles that I saw fit. See if they hold up. By the way, awesome Twitter people who sent in your stories. We got, yeah, thank you. We got a lot. We may not be able to get to all of them, but we will try our best. And just fair warning, I'm reading these as they were written on Twitter. So no, these are not all first person me. Are you gonna put like a voice on, like a Vincent Price? If anybody knows who that is. Put me on the spot, Andy. Thank you. Thank Michael Jackson Thriller. Remember Vincent Price? Oh, the ghoul should, I don't know. I forget what he said, but he's pretty creepy. I think he should. Well, I'll shut up them. Go ahead. Sorry, Andy. Not that it's ever easy to take you seriously. It's exceptionally difficult this evening. All right. Uh, this first one is from Pat Allen over there with, uh, breaking down the bites podcast at Larry packing on Twitter. My boy, Pat and PA. That's right. That's right. I'm going to call this one the sound of silence. Pat writes had dual homed BGP neighbors to our quote, meet me rooms for hosted voice provider, did a maintenance one night bringing a second circuit into the fold and had no phone testers. Black hold our entire hosted voice. Platform for the next morning. Fun times. Have you ever have any of you ever had to do maintenance like that where, you know, they want you to do it in the middle of the night so it won't mess with anybody. But the problem with that is that there's nobody there using it in the middle of the night. My entire networking career. Yes. Has been gigantic, scary changes with no goddamn testers or validation. And then they want to yell at me the next day because what the hell happened? What was the root cause? We had no validators. Ah, yeah. I worked night shift for three years. And so I was the one doing a lot of the maintenance is that they didn't want to happen during the day. But then of course, if something went wrong, a lot of the time, it would be, you know, day shift comes in and they get really angry because, you know. Nightshift broke something, but it's like we didn't have anyone to give us feedback until everyone on the East Coast woke up. There's so much that can break that isn't like hard down, right? Like you didn't get a spectrum, you know, you didn't get an alarm in your monitoring service. So it looked fine, right? But uh oh. Everything's broken. Whoopsie daisy. Right. Yeah. Something got black-hold whoopsies. Something got black-hold whoopsies. Sorry East Coast, my bad. Yeah, I have had a few issues personally. Nothing like taking the eastern seaboard of the US dam. It's been, yeah, my biggest problems for some reason, my, my biggest personal failures and mainances have been like largely affected mostly just the. When I was at an ISP, the golf channel was down for like eight hour or no, it was down for like 12 hours and we couldn't figure out why. And we were told later it was like an $8 million outage and lost like ad revenue because everything was IP based. Like their whole production stuff was. Wow. Good times. Never. I didn't have to pay any of it. All right moving right along I think we all know this guy Ethan banks of packet pushers at on Twitter Ethan I'm calling this one opposites attract this one time I Made an Arista switch and a Cisco ASA be OSPF neighbors. I'm telling you it was scary like really scary when the adjacency would bounce randomly for fun, usually when everyone was looking. Does anybody have any of those crazy stories? I think AJ's got one where you're peering routers that nobody would have ever thought they should probably peer together. So I have a question. I have a question and yes, I do have an experience with bouncing nonsense that we couldn't figure out why... I'm looking at myself on camera. It is hard to take me seriously. Why? Is it bad to hear those two switches up with the IEEE standard routing protocol? Ah, it's a valid question. It's thank you. It's a standard, but it's just a standard on paper. It's not like, at least it's my understanding that it's not like a technical standard. And, and each manufacturer's interpretation of the standard can sometimes cause problems. Yeah, that's why I'm asking on purpose. I was kind of prodding you with that because I've heard that with the EVP and BX land stuff too, it is an IEEE standard, but there's, I've, I've read a lot of stuff, I think Pack of Pushers even did a piece on. It's not really interoperable between vendors yet. It's an IEEE standard. Like, hello. Yeah. If I've learned anything from like studying some of the things I have the past year or so, it's that, you know, the standard can exist, but how vendors implement it is... Yeah. We bouncing neighbors. I spent a week trying to figure out why my ERGRP neighbors are bouncing in my data center. I was building new connectivity and I stand up a couple of things and, you know, I asked everybody, I'm pounding my head again into the white papers. I'm opening cases and it turns out I had an MTU mismatch. You know, it's 9,100, whatever it is in the data center, right? Everything's like jumbo. And, uh, I forgot, you know, the new gear I stood up, it was the default, whatever the hell it is, 1500 and they'll bounce like every three minutes. But a senior guy was like, did you check MTU? This after a week, like smashing my head into the wall. I think I was on Twitter, like anybody, you know, and damn MTU mismatch. I don't think of MTU when I think of, you know, the things that have to match up in the protocol to get them the neighbor up, right? Like. Because in my mind and the way they teach it, everything's 1500. Why would you touch MTU unless you're in an environment like a data center where it's never 1500? I know in OSPF you need to have the MTU match, but I thought in the EIGRP, the MTU is a tiebreaker, not a requirement. Man, I can lab it up and show you, you cannot get. Oh, I believe you. It'll come off. This is the point, there's the standard and then it's how the manufacturer implemented the standard, right? Right. According to the standard, MTU is a tiebreaker, not a requirement. When I stood it up in the maintenance window, I was happy and like, yeah, awesome, it looks good. And then I start checking some other stuff and I turn around and I'm like, wait, did it just bounce? It was like every three minutes it would bounce. Comes up, you feel nice and happy about yourself. I don't know about EIGRP, but at least with OSPF, if that happens, I don't think that's something you'll just see like in Syslog. You got to turn on a debug to see that. It's so it's not like necessarily apparent. You got to go digging for it. Yep. All right, next one in the list from Jason Gintert at bits and flight on Twitter. We shall title this one automatically afraid. This reminds me of like scary stories you used to read in the dark. You guys remember those books when we were kids? I don't. I know goosebumps. They have like the, you know what? Oh, goosebumps too, yeah. Scary stories you read in the dark is like a short form goosebumps. Like they had a bunch of different stories. I don't know that one. Granted, I just learned to read a couple weeks ago. Oh, all right. Alright. All right, Jason, that's not funny. Literacy is not funny. I'm sorry. Sorry. I just said it. You're the one that laughed. I'm apologizing. 22 years ago, I created a pearl slash expect script that deleted the administrative account for 34 D slams in three cities. It required a technician to perform a quote, Nindy procedure on each box. which involved putting a jumper on the controller to reset the config. My first automation nightmare. I haven't had a similar experience on the automation side, but I'm sure many people can relate to that one time you click the button and then all of a sudden your cursor locked up and went away and you smash the keyboard trying to get it to come back, but it's already a lost cause. I think our show is haunted. Yep. Lexi disappeared. Lexi just got gobbled up by the internet outage goblins. So what about those tea slams, Tim? I'm sorry, it was hard to pay attention with Lexi being abducted by technology. So basically how I'm reading this is Jason was trying to make a change and... did it with a Pearl script or an expect script. And basically not only cut his connectivity out, but essentially like bricked these devices and somebody had to physically like console or put some sort of a jumper on the controller of these devices to get them reconfigured. So that could not have been a quick and easy. There's a good example of why out of band is pretty helpful. Maybe they should have had a better out of band solution. I don't know. Shout out open gear. This is just by open gear. It is not. It could have been. No, I'm just kidding. All right. I'm going to jump right into this next one because it's a doozy and I'm probably going to, um, to mess it up here and there, but I had to read this one. This is from our buddy, uh, Pete Lumbus at peace Pete CCDE on Twitter. I love Pete. Pete wrote a damn novel. I'm not even sure how Twitter allowed this to happen, but congrats. We got 45 minutes left, Tim. So I'm impressed. I'm calling this one All Hallows Groundhog Day. All right, buckle up. Every 50 minutes, the customer would see outages for some destination prefixes. Not always the same prefixes, not always the same length of outage. We had many live troubleshooting sessions. We could see high CPU on the central processor and the line card. We couldn't figure out why that would cause packet loss, particularly such non-deterministic loss. I put it in the lab and could recreate every symptom down to the line card, CPU spikes, but not the packet loss. Working with a buddy, he changed the test instead of pinging just a few end points in our 300,000 plus routes routing table. Let's ping every endpoint. Bam! We achieved packet loss. Go to development feeling proud AF. Show them our data. Devs tell us nothing is wrong. The high CPU is just reprogramming routes on a path change. That would be fine, but this is MPLS traffic engineering. Make before break is the motto. The new routes should be up and running before we cut over. The hardware platform devs were right. The high CPU. was the routing updates, but that doesn't explain the loss. After talking to the MPLSTE devs, we learned there is no way for the hardware to tell the software, all right, I'm done, go ahead and change shit now. Instead, what software does is push a hardware update and set a clock for five minutes. At that point, it just assumes everything is okay. The software changes the MPLSTE tunnel, hardware status be damned. What we realized is the TE tunnel was flipping over, but the hardware hasn't finished programming. In my lab tests, I was pinging prefixes that were programmed early. My buddy, by pinging all prefixes, exposed that something wasn't ready yet. Imagine 1 million route updates that happen in order. Pings to route one works immediately. When the MPLS TE tunnel updates, this is the first route and hardware, and it's ready when software updates. Ping to route... 999,999 fail because software gives up and flips over, but hardware didn't program the route yet. How long is your outage? Quote, it depends. It's a factor of the speed of hardware, the prefix that's being installed and where it falls in order of prefixes installed in hardware after the software timer pops. If you're lucky, it's sub-second. If you're not, it's five plus seconds. In classic vendor fashion, what's the conclusion? quote, expected behavior. Everyone loses customer tech engineer humanity, but there was nothing that could be done. The hardware lacked the ability to send a notification to software that it was fully programmed. And this large ISP had pushed the hardware past its expected limit. To tie up loose ends, why was it 50 minute? They had set a custom MPLST timer that recalculated paths every 50 minutes, triggering the change to cause the outage. Why not the same destination? Depends on the prefix programming order. Maybe I should have just started with this. The TLDR of this, the packet loss is coming from inside the router. New hardware was the only answer. It was a function of the line card ASIC. Only new silicon could have a callback method. Wow. Man. That's crazy. I have attention issues, man. I was lost. Packet loss is the last thing I heard you say. Well, you can bring that up with Pete. Pete in the chat wrote, Pete, why did you go bald so young? I don't know why he's yelling at himself. That's funny. All right. This one is one that I think is definitely easy to happen to anybody. and where it shows that you really got to pay attention. So I'm, I am not pointing fault at this person because this could happen to anybody. At Stefan KWL on Twitter writes, and I'm calling this the access point to hell. I'm making the those part. I'm making that part up. So hopefully people don't get mad anyway Stefan writes I once took my wireless with hundreds of APs and thousands of connected clients down midday The plan was to do a software upgrade I did that many times already load new software on the controllers preload the APs and then set a timer for reboot loaded the software Hold on, time out, time out, time out, hold on. You said this could have been any of us and that Stefan's a wonderful person. Did you just say he did a software upgrade in the middle of the day? Well, hold on. With a wireless controller, he's prepping it. So you can technically have the code on the controller, push the code to the APs as a secondary image. I'm paraphrasing. That's fine, okay, fair enough. All right, I'll stand down. I'll stand down. I get what you're saying. It reads like he just went and hit reboot. I did a reload in 10 at 1 in the afternoon, and something bad happened. Uh-huh. So there is the possibility to set, for example, quote, reboot at 4 AM. So like you said, reload in 10, reload at 4 AM. So I clicked on reboot, chose all the controllers, and thought, huh, the time setting must be on the next page. But it wasn't. I was on the wrong page, so I selected all the controllers and clicked reboot and thought that there has to come something now, but it didn't. Instead, I saw the wireless connection for my laptop drop and thought, oh shit, took around 20 minutes for all the controllers to reboot and the APs to rejoin. Of course, many users felt that. I do it at 4 a.m. normally because no one is on and when I go to the office in the morning, I can check on things. But this way, my phone did not stop ringing. Lesson is don't do code upgrades during the day. That's, that's what I got out of that. It's you say it with so much authority with that wand, you know, is there any, you know, Einstein said time is relative, right? Is there anything longer or more painful than waiting for that continuous ping to, to come back alive when you're rebooting something? You brought up out of band earlier. Oh. You have out of band you can watch that shit in the console Yeah, you're right Do you guys have any of your own stories? I mean I love hearing Tim tell the listener stories but let's break it up a little bit and One I've got a couple stories one word mannequins Mannequins mannequins that was not the word. I was expecting you to use right now Alright Tim, you got me hook line and sinker. Tell me about your mannequins. Would you like me to tell more? You know, I think this is timely that you say mannequins. I'm watching Jeffrey Dahmer on Netflix and I don't know if he's gonna die yet. I got it. Yeah. Didn't he date neat people? What's the mannequin tie in there? Oh, well you'd have to watch this. I'd have to watch, okay. I can tell you later though. I don't wanna give it away on the show. I don't want nightmares. Anyway Tim, what's your mannequin story? So there I was. in the basement of an education facility of an educational facility and is training for these classes you work on mannequin so you're not you know testing on real people And I had to do some work down there and it's, it's almost like, like an interrogation room where you've got the room where the mannequin is in the bed. And then, you know, you got the painted glass and like a little control room back there so you can watch. Are these mannequins or cadavers? What's that? Is this a hospital? College training, training to become healthcare workers. It wasn't real bodies. Thank you. So I'm sitting there in that little control room and I'm just, I can't really focus on what I'm doing because you can see them through the glass. So I'm like every two seconds, I'm just looking up cause I'm waiting for one of these damn things to just pop up out of bed. And I'm already, there was nobody else really there. It was probably summer when school's out of session and, and yeah, I was like down there by myself and yeah, not cool. But yeah, that's the mannequin story. What's up with you, AJ? So we did a switch upgrade at a company that I worked at. And completely unrelated to the switch upgrade, so with the switch upgrade, we got more ports in the data center, which is part of the reason why we're doing the upgrade. And so as a result, we wanted to add some network devices or. additional devices to the network so we could monitor them and use them to get remote access. One of them was our UPS. We wanted to be able to monitor it, see its health, whatever. So we go to put it on the network. At the time I was the IT manager, my sysadmin heads upstairs. He announces he's going to do this thing, and then he heads upstairs to go do it. And about five, 10 minutes later, the network goes hard down. And it's just like, OK, wasn't expecting that. So I head on upstairs to the server room, open the server room door. And about that time, things started to come back up in some form or fashion. But I walk into the server room, and my poor sysadmin is literally on the floor in front of the server rack, clutching his chest, like recovering. And it's like, dude, are you OK? What happened? What the shit? So he tried to go. install a console cable into this particular brand of UPS. And apparently, they have two different styles of console cable. And if you use the wrong one, it causes the whole thing to shut down, like shut down hard. Like even though there's a fully charged good battery in there, completely shuts down the UPS. So it shut down our entire server rack, including all of our switches. Ouch. The worst part was. is that when everything came back up, there was this weird problem that our virtual servers had talking, establishing network connectivity with our new switches. It wasn't a problem with the new switches. The new switches were working just fine. We troubleshot that issue for close to a day, and it ended up being a firmware problem on the server hardware and the way that it dealt with virtual MAC addresses. because the virtual servers would start to come back on the network, but then they would drop off. And then it'd start to come back on the network, and then they would drop off. And we just could not figure out what was going on. We troubleshot the issue with TAC, with the vendor that we were working with, and they said, your learning MAC addresses, you've got connectivity, layers one through three are working as designed. This is another vendor's issue. And it finally, in the end, was a firmware update that we had to do in the server, and it corrected the issue. So a simple, I'm going to go put this device on the network, turns into a network harddown for almost a day. I hope that UPS vendor at least, I hope you at least got a free console cable out of it, correct? I don't think we ever did. But it was crazy. If you had both of those console cables side by side, There is no obvious way to tell which one is which. It's like magnifying glass scrutinizing. Oh, this is version two. Don't plug that into a version one device so you don't be screwed. Hey A1 fans, still getting calls at 3 a.m. Rest easy knowing that you've deployed OpenGear through the market leader, the gold standard. Their award-winning network resilience platform gives you always-on access on day one, everyday, and during an outage. Sure, there's alternatives, but when it comes to your environment, wouldn't you prefer the real thing, you know, the OG? 18 years in business, seven best in class products, one comprehensive platform. What more could you need? There's plenty of reasons to choose OpenGear as your smart out of band vendor. What reason will you choose? Schedule a demo today to see how they stack up. Accept no limitations, trust your network to the market leader. I got a couple of stories. Let's hear it. I'm listening. Let's hear it for you, Jelendy. I... I worked at a NOC for two years and I think everyone should work in a NOC for at least six months because it's some of the scariest shit in networking that exists, I think. Your whole job is getting broken stuff fixed. And if you work in a big place like I did, it's big and scary and complicated. And I didn't know what I was doing. So yes, knocks are terrifying. My story though isn't really a knock, but every day at the knock is scary. So if anybody out there has worked at a knock, you know where I'm coming from. I know Lex did. My first outage, and these will be quick, to give a little context, it was my first quote unquote real network engineering job. I worked at the knock for two years. I got this job in fintech and I was going to build stuff, not just watch spectrum all day and... dispatch people out to fix fiber cuts. It was a little more than that, but you get it. Um, so I had no idea what I was doing and change windows. Everything they gave me was scary. Um, you know, the, the training that I did up until that point and the break fix, I learned a lot, but when you're, uh, deploying new services line by line in the CLI, that was just way scary for me. So my first, uh, big outage there. Um, I left the super secure knock job. I took this really risky contracting job. I had a six month old at home. We just got married, bought a house. And I took down one of the biggest New York city stock exchange trading houses, uh, one night because I didn't understand tunnel timers and overriding policies and that it would take 20 something hours for that 24 hour default tunnel timer policy to expire to then realize I overwrote a policy that was needed. But what was so scary about it was the next morning I'm in there and I sat across from the knock and they're all, you know, what's going on? And then they're calling me or, hey, man, you were the last one in there. I'm like, yo, that was like a day ago. How did it break today? Like, what do you, you know what I mean? Wasn't me, bro, right? Long and short of it, they realize it was me and they pull me in. I'm like, oh bloody hell. And you know, my boss at the time, he's a super laid back guy. Mike, he's like, hey man, don't worry about it. It's fine. Go home, have dinner. We got your back, you know? And then at five 30, I'm sitting down with my wife and that same laid back guy calls back and he's like, hey man, uh, just to let you know. Turns out their CEO got with our CEO and he called him directly while the guy was sitting down for dinner with his family and yeah, they're pretty pissed and upset and they might be coming to get you but I'm off tomorrow. You'll be fine. Just don't worry about it. Tell the truth. And so that this is all hand in God's truth. So I go in the next morning and I'm like, this is it, man. Like I'm gonna have to go home, look at my baby, tell my wife like, you know, I tried, you know, I went for it guys. I went for it. I'm gonna be a network engineer. And I broke the stock exchange. You know, fortunately the leadership there had my back and they protected me and thank God, because they didn't have to and I was a contractor. And you know, really good. Oh wow, yeah, I didn't realize you were a contractor. Oh, I was a contractor there for like a couple of months with no clue what I was doing. And they knew it and people there, like the culture there, they had all, they were tenured. People were around a long time. And a lot of the people are like, why is this dude here? He doesn't know anything. But you know, they were giving me a shot, right? Like he needs somebody to give you a shot. And the people who knew. They're like, oh, you know, he's fine. He's learning. He's great. But some of the other people are like, what, dude, what is this dude's problem? The other one was what turned out to be a 28 hour maintenance window that I work nonstop, um, the long and short of it, we had quarterly maintenance windows. So four times a year we could break really big stuff. And this was at FinTech again. Um, and you know, this isn't like an office or a couple of wifi things. You're like, you know, we're, we're talking like the main, uh, data center that serves like 85%. of the company's customers and applications, right? So like, you know, you're impacting- Which is like 85% of the country. Well, yeah, exactly. Like everybody's using this stuff, right? So, you know, the federal government's moving money around with the services, right? So anyway, it was a quarterly maintenance window. You're supposed to break stuff. And my job was to migrate from the old 6509 distribution. You know, it was layer three core distribution layer and then access in the WAN, pretty big WAN. So we were going from 6509s to 9504s and that went pretty good, I thought. Simultaneously, our firewall folks were upgrading from checkpoints to Palos. So probably a little too much happening in the same window because invariably, you know, I flew to Georgia for this because you had to be on site because it was a big scary thing and I do my maintenance window and, you know, I, I go back to the hotel at like, I don't know, four or five in the morning. It was like an all night thing. All hands on deck, all of us around a table, right? And the phone just starts ringing at like 5.30 in the morning. And it just never stopped ringing until, whatever the hell it was, I remember 28 hours sticking out. And it was just call after call after call of people saying connectivity wasn't working. And then we'd have to parse, like, well, was it the firewall change? And each and every one of them wound up being problems with the firewall and the rule, blah, right zones. We had 28 hours nonstop. You know, not as much scary as just fricking exhausting in a marathon of, of mental torture, right? But they're the two big ones that stand out when I think of like, Oh, just network pain, spooky Halloween, you know, theme nonsense, man. No mannequins that I was afraid were going to hop up and get me, but endless hours. Yeah. Endless hours of pain. What's going on in this chat over here? I do have one more that I want to read from Twitter cause this one is I think it's bonkers. That's right. This is from Marek Esalski or at Masnu on Twitter. I'm calling this sabotage. While on a project abroad in a foreign country, a quote sudden emergency in the client's other data center needed a rush trip across town. Quote oh. You won't need your laptop with you. It'll be safe here in the office. Our guys will drive you over there now. Nothing of consequence at the other data center, a knocked cable in a comms room in a very secure facility. I was driven back. Yes, dear reader, someone had tampered with my laptop, but that's why I'd brought a burner laptop. Still got the drive for the memories. In the jump post, the burner had access to just in case there had been an emergency needing connectivity back to prod. Long gone, drive shredded and IP address space returned to ripe NCC. That's crazy. You're on a job for a customer and they lure you away to screw with your laptop. That's crazy. That is crazy. I have not heard of that. I don't think you've worked overseas. You went to Montana once. It was Wyoming. God damn it. And I have worked overseas, but it was London. OK. And it was at a time. During this country's history, where. I don't know how to say it politely. Oh boy, you better. I can't imagine we were targets at that time. But anyway, now I've never felt like I've been a target when I was traveling. I couldn't imagine being targeted for some reason. So I wonder, did they have access to some super secrets? I don't even know. I have so many questions. Yeah, so many follow up questions that they probably can't tell us. Yeah. You know what's spooky? Your wig. How dare you? Got a point. And there's a reminder that we have a YouTube channel. If you want to see Andy's wig, make sure you go on to our YouTube channel, subscribe, hit that bell button to get notified of all of Andy's future wig usage. Bring your tickets to the gun show. Oh boy, you're going to want your money back. You know what's spooky is when I'm shopping, I'm on Amazon shopping for something. And then I go other places and there's ads for that all over the place. That's spooky. This isn't a privacy show, but there's some spooky, weird tracking going on all over the place. I have one of those stories and, and there's a reason I think it's really spooky. So it was actually, it's actually a Halloween story. So a couple of years ago, we were taking the kids trick or treating and we were walking and there was some, some people that had. out in front of their house, they had a really nice, cool, modern looking stainless steel fire pit thing. So I walk up, my phone's in my pocket. I walk up to these people and I never say the word fire pit. Never, none of that. Just walk up and say, hey, that's really cool. That's all I really said. That's really cool. I'm almost positive. I never even said the word fire pit. We get home from trick or treating. Pick up my phone. I'm just scrolling through Facebook. And I have never, I had never seen one of these things before, before I saw it at this, these folks house. Scrolling through Facebook and a sponsored ad is one of these fucking things. And I'm like, how did that even happen? Uh, creepy. I used to work at a Verizon frame. So that's where all of the copper lines come in from all the houses and then come to a switch. A clove fabric, if you will. Crossbars, yes, clove. I didn't know it at the time. That's what I was working on. And it was against federal law to listen to any of the calls. So I had a butt set and each pair on there, you could, you know, boop and listen to whatever calls were happening. So that was against federal law and that was a no-no and you could get in a lot of trouble for that. All right, right now it's like old man on a soapbox. I didn't mean to get here, but I'm amazed that we live in the kind of world technologically where all of our devices are watching, listening, recording all the time and then being leveraged by, you know, databases and AI so that they can sell it to like advertisers. I mean, Tim, the fact that you didn't say fire pit, you know, like was it, was the microphone, you know, was an app on your phone listening, heard crackling. You said you liked it. Now you got me real freaked out. I don't know. Right? I don't know. I used to give, so more spookiness, right? It seems like we need some content. So go down this road for a second. We used to, I used to give everybody a hard time about the digital assistants. You know, my buddy had a, I can't say it cause there's one sitting here, but you know, one of the vendors, smart assistants that you say their name and it lights up and wants to do things for you. He's like, dude, do you have any idea? Like. They are listening to every single word you say. They are recording it all forever. Like it's, oh no, man, only the key word, blah, blah. And I remember looking at him and I go, dude, how do you think it knows you said its name? It has to listen to everything you're saying. And he kind of looked at me like, ah, well, like play music on it. You know what I mean? That was it. But we've accepted, right? The, we're giving up. What am I saying? Like, I don't know if it's free. We're giving up privacy for convenience, right? And, you know, I, and I'm a hypocrite because they gave me a free one when I bought something on Amazon and it sits here and I installed it because it has a clock on it, which I find helpful at work tracking my time. But it's ridiculous that, you know, this thing is listening to everything I say, recording it. And, and, and the way, the way I know that is there's been a lot of cases where the police. subpoena this company for the audio records of what happened so that they could use it as evidence in a crime, right? So like they're listening to everything and recording it. It wasn't like, hey, personal assistant, something bad's happening. It was recorded. They came in and they're like, were there any listening devices in there that we can subpoena, you know? So it's, and I even have a security device and you know, like security system in my house, which is cloud-based. And there's There's plenty of news out there. My buddy's even a cop. Like they don't have to ask me for it. They tell you they do when you sign up for it. Oh, you know, nobody can take your footage without your consent. The police would reach out if they needed some doorbell footage or something. And my buddy who's a cop's like, nah man, we just, you know, if we get a judge to sign on off, we get whatever we want. So anyway, you know, creepy factor, right? Like every device that we pay a lot of money for and put in our homes, in our cars, in our kids' bedrooms. or listening to everything all the time. And we're cool with it, cause you know, I can play SpongeBob without touching a remote. I guess that's okay. I don't know, right? Yeah, and I have them. I'm not trying to sound like a hypocrite, but you know, so creepy. Your fire pit story just got me on a tangent. Sorry. Yeah, that's a new one. Anymore Tim? I've got an email. Yeah, read the email. I was just opening up email. Go for it. All right. So this one sent in by Charlie. He's got an outage horror story. I don't have a fancy name for it like Tim did, I'm sorry. But Charlie writes that he had just gotten home from work around 4 p.m. on Friday of Memorial Day weekend. And he's in a very popular New England vacation spot. He was a network engineer for a local hospital when he got a call from his manager informing him of a possible outage slash emergency. So after fighting his way through touristy traffic back to the hospital and troubleshooting on the phone the entire way over, they find out that the soup modules on a pair of VFS enabled 6509s, was experiencing some sort of memory leak. The chassis acted as a collapse core for the entire hospital and many of the outpatient buildings. Traffic was effectively unable to forward through the core. The impact, the staff and the emergency center weren't capable of switching over to a downtime procedure. which is just writing everything down on pen and paper during any sort of electrical or network outage. So they called code black, which, and this is like, Tim says, well, cause I'm sure he knows exactly what that means, but for those of you that don't know, code black is basically, you know, the hospitals shut down, the emergency room is not accepting patients. So this is like one of the busiest emergency centers in New England. And they outright stopped exciting new emergency patients that were on the way, like already on the way to the hospital. They had to turn around and go to another hospital 90 minutes further away. So if you called 911, asked for emergency services, they were experiencing up to three hours wait time. Thankfully, they were able to get the network back online and everything fixed. And during that time, there was no reported injury, no loss of life, nothing like that. So nothing severe happened during this outage. But wow. What a story. In the end, they found a temporary workaround by enabling distributed Ceph on the chassis to offload the fib and the adjacency tables from the route processor to the line cards. And that saved them until they were able to do a code upgrade later on. Crazy story, thanks for sharing. Was this at a hospital, you said? Yep, at a hospital. Wow, man. You know, I would love some Tim stories. We'll never get them from him, but. you know, working in a healthcare environment, I would imagine that, you know, when that stuff goes down, right, like, I guess there's lives on the line. Not a good day. Yeah, yeah, you know, like, there's literally lives on the line. I've had this conversation with you, Tim. I've worked at a lot of places and a lot of people get upset when stuff goes down and my barometer for success is like, well, nobody died, right? Like, but you can't say that. I don't envy you. I will say it's tough. in healthcare because a lot of industries are adopting, you know, that mentality. I don't want to call it DevOps-y, but that DevOps-y mentality of, you know, move fast and break stuff, just small intermittent changes all the time, just constantly moving forward. And no matter how appealing that is, healthcare isn't always the right environment for that. Right. For obvious reasons. So yeah, it's difficult. You know what's funny? I think I have like, I think I have blocked out a lot of my, you know, PTSD, trauma, scary network stuff. Like honestly, you know, like I spent a decade doing it and so many things went wrong. And most of the time I felt like I didn't know what I was doing. And then any day they were going to walk me out the door and it was just so traumatic and scary all the time. Yeah. You know, I think my mind has just blocked a lot of it out. Like when we talked about this episode the other day, I tried to think of stuff and I could only pull up those two. And I know that there's dozens, if not hundreds of, either things I've broken or things that I've been involved with, you know, that broke. I mean, like going back to the knock, you know, somebody hits a big bundle of fiber and, you know, like, I don't know, a gigantic region goes down. And having your director, who's this big, scary person in charge, just... up your gig constantly. What's going on? And when are they there? And why aren't they there? And why aren't the splicing crew? And I'm like, yo, man, like I'm on the bridge. There's 130 people on we're waiting for the splicing crew. I don't know what you want me to do. But like her job was faster, Andy. Right. But like a lot of that happened at the knock. Like their her job was to tighten the screws on us to think she needed to motivate us. But like, yeah, we have a ticket. Yeah, there's 800 cell towers down. I get it. Like, it's bad. I don't know what to tell you. I can't go, and I think even once, I don't think she liked me, because I think once I had it, and I'm like, do you want me to drive over there and learn how to splice? What do you want me to do? You know what I mean? Yeah, I mean, I understand why she might think you're being mean about it, but if she's putting that much pressure on you and you're literally in a position where you can't do anything about it, that's a fair response, in my opinion. What are you gonna do with a fiber cut that's 3,000 miles away from you? Yeah. Sitting in a knock, right? Like I dispatched fiber. I don't know what to tell you. Yep. What's, you know, what's the meantime to repair? I don't know. They're not there yet. They haven't even surveyed the damage. I mean, it was just a whole job, right? You go home. How was your day? Oh, great. I got yelled at all day for stuff I didn't do. And it was awesome. Yeah. That was one thing we didn't really touch on in this, um, that are probably. Some of the scariest parts of the scary stories are. getting told, Hey, you need to get on this conference bridge. And those are so hard to manage because you, you just by nature, you want to put your head down and just figure out what's going on. And it's really difficult to do when 17 people are asking you what the status is. Can you guys do that? Because I've never been, I, I was trained my, the guy who, you know, senior to me was like, dude, I just put my headset down. I shut them out. I said, I'll be right back. You know, I'll be back in a couple of minutes. I need, I need some. So I can't have people yelling in my ear and me cognitively work through a problem. It just, I can't. I started that way. Um, our, I feel like our process from our leadership has gotten a lot better. So they, they've taken the approach of, um, which I had my doubts about at first. They took the approach of when something like this happens, we want all technical people on the call so they can collaborate, including vendors. So if we have tax support, something, we're sending them a link to join our local. Um, call and I, you know, I was, I was a little apprehensive to it cause I'm like, I'm going to have a hard time troubleshooting and making progress. If, if I've got a bunch of people asking me questions that may or may not be relevant, I mean, it could be. somebody who is, you know, a manager of another department asking this, this off the wall question that derails you. And then you got to try to get back on track, but really, I mean, we had done it a few times and it actually worked out pretty well. Um, we had the one thing that I will say though, is if you don't want people to derail you with, with questions, don't be silent. So like, I don't know about you guys, but there are times where, especially if I'm sitting by myself and I'm troubleshooting a problem, what am I doing? I'm probably talking to myself, just talking it out, you know? Oh, it's just me? Okay, nevermind. But if you're on those calls, just be vocal about what it is you're doing. Cause if you're constantly going through what you're doing and just kind of talking it out, people are gonna be less likely to interrupt you with a question and just throw you completely. completely off. So yeah, again, I was apprehensive to that kind of style, but it's worked well. I don't think there's anything scarier than a large outage and a large bridge that you have to get on. And like people higher up the food chain than you ever wanted to have know your name are on there. And they're all, you know, they're looking to you. And I mean, it takes a while, right? We were very manual, we weren't automated. And I just got to go through and try to figure it out. Like I have a process and it was, and it's in the middle of the night, right? You get called in the middle of the night, it's 1 45 in the morning. You know, you're exhausted. You don't even know your name. Like, wait, what? Like, you know, I can't even get into VPN. I'm so tired. And they want to know like why the building's burning down and why, you know, all the customers hate us. Like, man. I tell you the one, the one thing that I think is actually scarier than having to get on that bridge is, and this is back in the, you know, when you're in the office days in the cubes. When all of a sudden there's a network issue and people just start standing up, looking around and they know who to look for. And you have 17 people looking at you like, what's the matter? Like I am as clueless as you right now. Give me two minutes. Those are, Oh man, just chills thinking about some of those. Yeah, yeah, you know, I'm not I'm not jealous of you guys. I don't think I've ever had to sit on a major bridge. I will say that, you know, when I was troubleshooting issues, even if it was on a call, I would establish I will give you updates as and when they are available. Like, don't sit here and keep asking me if I have information to give you. I will give it to you if it looks like it's going to be like an extended duration thing. Then I'll establish further boundaries. I will provide an update to you every 15 minutes and I'll set a timer on my phone. 15 minutes. I will provide you updates at the top of the hour, you know, whatever the case might be. But I always try to establish like, I need time. I can't do this with you staring at me standing in the doorway. Know that I'm working on it. Know that my team is working on it. And we will get you updates as soon as information is available. And that's where he. That's Eddie in the chat has a good point. He says maybe we should delve into a little bit more what we're talking about. For people that may not know. So what we're really discussing right here is when there's a major issue, there's a large downtime, a large outage, and a lot of organizations will request a conference bridge. Now, hopefully it's not a large enough outage to where you can't even get on the damn phone. But they'll ask for this conference bridge, which is really just a bunch of people on a phone call, on a meeting, teleconference, whatever. And you're, you're giving status updates, you're doing collaboration and troubleshooting, all that kind of stuff. It's. If it's, if it's not managed well, it can be some of the most stressful situations of your life. And even when it is managed well, if it's a large enough issue, it's, it's still pretty rough, but yeah, that's, that's what we're talking about. These big calls, getting all the right people on the right call to try to fix a major outage. And for me, there's two levels of outages. So there's one that just something broke and like, oh, that sucks. That's all band together and figure it out. And then there's the one where I'm in a maintenance window and I'm pushing buttons and I'm making changes and all hell breaks loose. Cause that's a totally different level of passive aggressive anger at the engineer when, what did you do? It's one thing to get called into a problem. It's one thing to get called into a problem. It's another thing to cause the problem. You are the problem. You know, so when I was talking to which I just, to which I just go back and say, Hey, my change request is in it said I may cause an outage. Yeah. Well, right, Tim. And that's, and that's, so you reminded me, I mean, I used to, the line I used to love telling, you know, executives who would try to come down and it's like, listen, man, we're not plugging in toasters here. You know what I mean? This stuff is complicated. It's complex. There's a ton of technical debt. Like it's a miracle. Any of this works. And we step on landmines when we touch stuff and you know, we're doing the best we can and we had a process, plenty of change management and reviews and lab, you know, blah, blah. Right. But you can go through all that and step on a bug. Like there's just whatever, right? There's some things you can't, you know, I love that analogy. You you've used that before. Hey, hey, we're not plugging in toasters here. And people that. that don't know it's just, hey, if somebody's in- People talk to you like it's simple, right? Like, well, what happened? Like, you know, how did this happen? Huh? I think it's easy- Well, let me explain VGP to you. You know what I mean? I think it's easy for people outside of IT to see, hey, you're in a technical role. You have engineer in your title or senior engineer in your title. Nothing should be new to you. You should know all of this. There should never be any problems. And I think that's a pretty big misconception. Yeah, that's 6509 to 9504 distribution layer that I was telling you about. We did it in our three major data centers. And along with that, we were implementing a lot of traffic control that was never there. You never know where your traffic was going to be. They just threw a bunch of redundancy and bandwidth at it and figured, oh, I guess it'll work. But my second year there, we hired this brilliant guy, Carl, our architect. And he's like, guys, we need to, you know, if somebody calls you at 1 30 in the morning, you need to know that all your AT&T traffic should be on the A side of this router. Because if it's not. Now you know where to look, right? And we never knew any of that. So anyway, as we were implementing all these traffic controls so that we could, the traffic would do what you wanted to do, not just, you just pray and hope that things miraculously work out on their own. Um, it took us probably eight weeks. So we would have these big maintenance windows on Sunday night, like five hour maintenance windows. And every Sunday for two months, we brought the, the company's entire WAN down for like an hour, an hour and a half because Something weird was happening. I forget what it was. It was one of the weird, there were two things coming together at once that took us a really long time to find out. It was some technical debt. And then it was something weird in the new 95 04 iOS stuff that like just, you know, a couple of things lined up, but every time we would try to do this change to steer traffic, the entire company, when we go down now, three o'clock in the morning on a Sunday shouldn't be a big deal, but man, we called a lot of hell on Monday and it's two weeks of calls. and an RCA that you have to write up and like, well, why did this happen? Like guys, you've been buying other companies and absorbing their data centers for 20 years and just leaving everything alone and just stitching it all together with duct tape. And now we're trying to clean it up. And every other time we touch it, it explodes, right? But it was just scary. Every night was scary, right? We finally got it cleaned up and it was great, right? And then our stuff stopped breaking as much and we knew where traffic should be, but. You go into a place with a lot of problems and you're there to help and clean it up, man. Every time you touch the network, you have no idea what's gonna happen. Yeah, that's a scary position to be in. Well, this has been a fun episode. It was really fun in part that we got to share stories of all of our listeners. We do appreciate everybody following along, sharing your scary stories with us and letting us share those scary stories. We didn't have to sign any NDAs, which I was really... really happy about. I thought we weren't gonna be able to share any of these stories at all, but we made it happen. And we made it happen through the support of listeners like you. If you wanna continue to support what we do here at the Art of Network Engineering podcast, the best way you can do that is to go to iTunes or Spotify or wherever you listen to our show and give us a rating. Leave a comment if you feel so inclined. All of that stuff really helps the show. You can also go to our YouTube channel, leave some comments, subscribe, smash that bell button to get notified of all of our future episodes because all of those metrics really help this little show. Thank you so much for your support and everything we do here. We'll see you next week on another episode of the Art of Network Engineering podcast. Boo. Hey y'all, this is Lexi. If you vibe with what you heard us talking about today, we'd love for you to subscribe to our podcast in your favorite podcatcher. Also, go ahead and hit that bell icon to make sure you're notified of all our future episodes right when they come out. If you wanna hear what we're talking about when we're not on the podcast, you can totally follow us on Twitter and Instagram at Art of NetEng. That's Art of N-E-T-E-N-G. You can also find a bunch more info about us and the podcast at art of network engineering.com. Thanks for listening.

Podcasts we love