The Art of Network Engineering

The Art of Network Engineering blends technical insight with real-world stories from engineers, innovators, and IT pros. From data centers on cruise ships to rockets in space, we explore the people, tools, and trends shaping the future of networking, while keeping it authentic, practical, and human.

We tell the human stories behind network engineering so every engineer feels seen, supported, and inspired to grow in a rapidly changing industry.

For more information, check out https://linktr.ee/artofneteng

All Episodes

The Art of Network Engineering

Ep 107 – Augtera Networks

December 07, 2022 • The Art of Network Engineering • Episode 107

Send us a text

In this episode, we interview the Founder and CEO of Augtera Networks, Rahul Aggarwal. We learn about Augtera and how they can help network engineers!

More from Augtera and Rahul:
https://www.linkedin.com/in/raggarwa/ https://twitter.com/raggarwa https://twitter.com/Augtera https://www.linkedin.com/company/augtera/https://augtera.com/

This episode has been sponsored by Meter.

Go to meter.com/aone to book a demo now!

You can support the show at the link below.

Support the show

Find everything AONE right here: https://linktr.ee/artofneteng

this is the art of network engineering podcast will explore tools Technologies we and to bring new information that'll expand your skill sets and toolbox and share the stories of fellow Network engineers welcome to the art of network engineering I am timbertino at timbertino on Twitter and buckle up because from the A1 side I will be driving the ship solo for this episode you poor poor souls that being said we don't do too many episodes without uh co-host round table so what I think I'll do quick is just predict how it might have gone down this week I'll say AJ and start with me I'd try my very best to say as little about myself as possible then Dan might send a Zinger or two toward Andy Andy would probably give a most elegant retort and finally Lexi would probably tell Andy to watch his mouth and AJ would just try to keep everything together there I think that went quite well now never fear I will not be talking to myself the entire time this week we have a sponsored episode brought to you by octera networks the first Network AI for the modern Enterprise and joining me from augterra is founder and CEO Rahul Agarwal thank you so much for joining me Rahul and for your support of the art of network engineering how are you I'm doing well thank you for having me on the show so Rahul we we are the art of network engineering we we are very passionate people about the technology um everything in play across Network infrastructure but we are also very passionate about the the humans behind the technology and the people behind everything that makes networks go so if you wouldn't mind uh taking a few minutes and give us a background on you what got you originally interested in technology what have been some of the different roles that you've held and then ultimately what brought you to to create augterra yeah so just in a nutshell I've been a technologist most of my life I got my bachelor's from the Indian Institute of Technology um then did my grad school in University of Minnesota and computer science and got into networking uh spent my early years at four systems and then at Sierra went on to Red back post the CR acquisition and then was fortunate enough to be a juniper for a long time where I co-created several industry changing Technologies um both of the mpls and the VPN area you know multicast vpns uh mpls multicast and then more recently after that evpn which is now you know very very widely deployed in the data center yeah and then after that um you know left Juniper started my first startup which was in the content analytics space uh that got acquired by company called Jan Reign which then I come I acquired and after that I you know really felt that my heart was in networking that that's really you know where my passion is uh saw an opportunity and a big gap in the landscape and that's how octeta started that's that's exciting I will say that you have been incredibly busy in your career that's that's really interesting one of our co-hosts Andy that I mentioned in the intro there he is he is now um think just under a year now been employed by Juniper so that's that's quite the story so let's jump right into octera what um you know what is the base platform dog Tara and and what does it do for its customers the Gap that we saw in the market or in the landscape is that um essentially there is a whole movement of big data and machine learning which is really transformed multiple Industries in the last 15 years or so but networking has really been left behind uh there are a lot of tools which exist for network operations there's so many different companies so many tools but really they follow this old Paradigm well there's a lot of data they collect the data often with poor Fidelity uh but whatever data they collect sometimes High Fidelity often poor Fidelity it ends up in this massive data Haystack which humans manually mine so anything goes wrong it's often the applications and the customers who complain the network operations teams then scramble to figure out what went wrong where what the root causes all of that so uh and what we set out to solve was to really bring you know predictive and early detection capabilities to the that environment where we apply machine learning to see what's actually going wrong uh well in advance before it does so the operation teams instead of having you know getting a call at 2 am in the night they can actually get ahead of the problem or well ahead of the problem and fix it in advance um and do this across all uh very large scale environments data center hybrid Cloud public Cloud sd-wan uh do it in a way where the problems that you find and solve in the network are really application impact things you can really connect the dots from the application layer to the networking layer and uh you know really make sure that what you find what you solve it's uh has a very direct eye into the business the business continuity to the revenue so that that's really what the you know the problem that we solve and we have a uh a platform which solves that which I can I can talk more about yeah that'll be excellent I I will say for my own experience so I've been in um network operations network engineering for most of my career and in the Enterprise space and I can definitely relate to that uh being an issue because I will say that that even today um and especially up until recently network operations Network troubleshooting has has been in in some cases continues to be a very manual process it's it's very difficult it can be very like you said very reactive we essentially sometimes and maybe more often than not don't know that there's a problem necessarily until somebody picks up the phone and reports that problem so I I definitely agree with you that that's been a gap um at least in Enterprise networking that we are always seem to be in a reactive space so what what is augterra doing for us to help us become more proactive in network operations at a high level What dra does is we take data from literally any data source you can imagine and I'll talk more about that in a minute and from any part of the network you can imagine we take this data we normalize it uh it goes through a very sophisticated software stack which can be deployed on a set of VMS either on a customer's premises or consumed as our SAS service and this data it goes through our machine learning Pipeline and very importantly the realization that we came to us that we needed to build purpose-built algorithms for the network and that using of the Shelf commodity algorithms was really creating a lot of false positives you really need a very domain specific approach so we built we've built a whole set of different algorithms for different metrics different algorithms for optics for temperature for traffic for congestion for TCP for latency for application loss um or for logs we have completed different natural language processing algorithms and what we do is we figure out changes in patterns without humans telling us what's normal so we really automatically figure out for example one of our customers the temperature started to creep up on one of their units out in the field and they saw they got a notification automatically using our anomaly detection that the temperature is now misbehaving then that was a trigger for them to really start looking at the graph start looking at the data and they realize yes that unit is actually misbehaving it had been stable for months and now it's not so the dispatched a field unit to take a look and the air conditioning was broken and they were able able to put up a temporary patch in and avoid an outage so another example would be you know suddenly there are Asic parity errors which show up on a switch and our log based AI which uses natural language processing can look at billions of logs and surface what's anomalous and unique in real time which you haven't seen so all of a sudden you will get this automatic notification which says you know what we've got this Asic parity error and lo and behold that means something is wrong with the hardware and now you can go and you know fix it reboot it talk to the vendor so it's cases of that nature or in public Cloud all of a sudden you find that your latency across vpcs is spiking and if that latency across VPC is spiking and it's no longer normal then well maybe you need to go and talk to the public cloud provider or maybe because they may not even you know listen to you at least you can move your application workloads and avoid them you know getting um impacted so so on and so forth you know so essentially we apply all this machine intelligence to find these uh what we call anomalies with very high fidelity low false positives uh and very uh you know you can say refined so that there is no alert fatigue and we also are able to automatically correlate them and do the correlation across layers across the application across the network uh which I can talk sort of more about but that's you know that that's in a nutshell uh how our platform works I'm glad you brought up alert fatigue because that's definitely something that I think is an issue we I think we tend to want to make sure that we have every piece of information possible we never want to miss a problem and then all of a sudden we get all of these alerts coming in that we either mute a number of them or we create rules in our email to to dump certain ones into a bucket that we never look at so I think that's really important and maybe we can touch on that a little bit more on on how you specifically handle some of that alert fatigue but where I want to start I think after hearing what you just discussed was let's start with the data you talked about so let's say I'm an Enterprise customer I have a few VMS um that are going to be collecting this data on the augterra side how how am I getting that data from my network devices to augterra am I am I doing syslog am I doing some sort of streaming Telemetry how do I get it aggregated in the first place yeah that's a really good question and that actually that part of our product uh has significant engineering that's gone into it I wouldn't call that part of the product as the secret sauce but it is not to be by any means undermined because it's a very very complex engineering problem so what we do I'll take you through it slowly uh first let me give you sort of a big picture we can take SNMP data natively so we can uh you know take uh essentially we've built and the entire stack is end to end you know it's self-contained so we can take SNMP we can take syslog which are both widely deployed as we know we can take open config and grpc which is what the world wants to move to but has failed to do so for the last decade uh we can take a slow ipfix so essentially we cover everything from topology to system metrics to uh you can say interface metrics to events to logs anything you can imagine across all forms of Standards you can imagine which exist then you get into proprietary uh apis we we can we have a mechanism to take Json encoded data and that data can be sent to OCTA are using Kafka or using our API so we can connect to a Kafka bus uh and the Json encoded data can be completely proprietary it can be specific to a customer to an Enterprise it could be specific to a third-party tool and we can dynamically normalize that data you know using configuration that a customer can can provide uh we also have our so these are all the data sources we can take you know so we can take VPC prologues um you know logs in the cloud we can take any other form of Json logs then in addition to this we have our own agent which can be deployed purely for synthetic probes for uh latency and loss measurements across public private infrastructure at scale with a you know High degree of fidelity now uh so that that's so it's a very wide range of data we take the other thing is we have significant engineering on making for example SNMP scale we can pull uh tens of thousands of endpoints switches or servers using SNMP with a very cost economical in-house SNMP implementation at the same time we have a highly economical Big Data implementation for Telemetry streaming ipfix we can take hundreds or millions of logs an hour and do real-time analytics and prediction and machine learning on them so you know hopefully that gives you a sense it's really our goal is take literally any sort of data source that you can imagine uh and and really you know that that's critical if you can't take all the data one of the problems in the space has been there are a lot of tools which have tried to optimize single data sources you know maybe it's SNMP there's a tool maybe s flow there is a tool ipfix there is a tool that sort of begins to that's a completely broken model and uh it gets more and more difficult to even you know maintain or manage that model in today's world um and I I you know we are completely changing that that's really interesting and and I want to key on the the topic where we were talking about uh being reactive versus proactive and you brought something up that I thought was really interesting and you said that you have um an agent that can go on devices to do synthetic testing which I think is is really important because you you constantly for performance reasons you you want to be testing applications you want to be testing performance uh at all times and to have an application an agent that's doing that synthetic testing I I think is is a really interesting piece of the puzzle so what uh what kind of operating systems form factors can that agent run on yeah so um our agent is highly optimized to take um you know consume very little memory and it can run even with you know about One One Core available to it one virtual core and uh a few tens of megabytes um it can it can be as low as that if needed it's a Linux binary so it can be run on any environment as a Linux binary either on bare metal on a VM or it can be containerized uh we've got customers running it on switches which are Linux switches um you know most there are plenty of switches with support uh you know or are friendly to Linux and we have we we can run our agents on those uh or on servers or on VMS uh on the servers or on VMS in the public cloud or for that matter containers uh you know in the public or private Cloud so it's a range of environments and a range of use cases you can check for example uh is my latency between AWS and Azure is that beginning to misbehave is my latency from a private data center to public cloud or from from within the data center across my servers is that latency beginning to misbehave and then if it is or am I seeing significant packet loss between two different availability zones in my private cloud or within one and is that correlated to packet drops which are happening on a switch because octetera can actually get the switch packet drops using SNMP or grpcgnmi streaming so you know that's when it starts to get interesting when you can really use the agent so so there are other people who have agents I think where it gets interesting is a having agents at scale that's a tough problem uh both from technology and economics point of view second is you have so much data that agents produce uh you go back to our basic thesis that you need machine learning to tell you when that data is misbehaving and once you know that you have this you know loss which is suddenly misbehaving well how do you know what's the cause uh you know where is uh um actually the root cause for this loss is a switch dropping packets is a hypervisor dropping packets uh you know is is there some congestion on the top of rack two server links on and so forth right so we use synthetic testing as an important tool in the puzzle to bring it all together wow okay so I I think I I want to jump into that a little bit more and we initially talked about you know ingesting data from these devices and and seeing you know if there are specific issues um with with a single device but a lot of times you like you said that that path monitoring you want to see any any potential thing in the path that could be causing an issue so can you kind of help us paint the picture for how um obviously not getting into the secret sauce but how octera takes all of this data from different devices and sort of you know is it is it internally painting some sort of a topology to see how all things connect and be able to to show where on a path there's an issue or what does that look like oh that's a very good question see the uh one thing we didn't talk about so far is uh actually your question sort of uh makes me talk about one of you know a bigger topic right uh what is Network specific about worked error right I mean some of the things we are talking about so networks uh the semantics of the network has a lot to do with topology it's not the only thing but it's a big thing right understanding the protocols layer to layer 3 bgp sd-wan ospf ISIS you name it right so we ought to discover the topology of the network so we actually understand the topology and the connectivity of the network we are able to visualize it we are able to build a model out of it uh and we use that to really understand that if packets are going from server a to server B what are all the possible paths they can go through uh we understand equal cost multi-path all of that we can take it even a step further uh using flow data we can even figure out what are the actual links that that packets for a specific flow went through so but that that's the answer to your question okay yeah that's that's really interesting uh so let's let's jump to from a network operators perspective I want to be able to see where I have issues in the network and not only be able to go in and see but like you mentioned get get alerted how how does the typical Network operator network administrator Network Operation Center how do they leverage augterra on a daily basis are they typically in their proactively or do they have alerting set up so well that they really only need to get involved when there's an issue what is what does that typically look like yeah so see uh we really our customers want to move away from this world where someone has to stare at a screen 24 7. because you look at large Enterprises large providers they sometimes have armies of people staring at screens right so the way our customers are consuming after our notifications is servicenow you know ticketing platforms which are automatically consumed by knock slack or teams you know collaborative filtering platforms where you get the notification and uh that notification then of course ends up in someone's slack channel syslog so the syslog notifications often feed or notification pipelines which exist or email and so on and so forth they can also notify over Kafka so it's really this Paradigm where all the examples are given today right congestion latency application loss Optics and so on and so forth misbehavior and the is it triggers notifications which humans then consume and then they come to octera to start doing deeper troubleshooting we have a very comprehensive UI but they don't typically use our UI as a destination on a day-to-day basis it's more based on a automated workflow um the other customers consumer product is when they get a notification their software goes and drives a change in the network so let's say that there is a link in the data center flapping it's much better to take that link out of service than to actually uh leave it the way it is because there is so much equal cost multipath that you don't really want any package to go over it so there are customers who take our notifications in their automation pipelines and drive changes back to the network okay that makes a lot of sense yeah to be able to remediate those issues with without even somebody have to getting involved that's I can see that being very beneficial and and that's pretty impressive so back to kind of starting from Day Zero what for a customer what would be the initial effort to get augterra up and running and and what is the the care and feeding look like once let's say I'm a back to that Enterprise use case I've got my augterra VMS uh stood up what does it take for me to get it configured and how much once I'm ingesting data from from all my different devices and and points in the network how much tweaking am I having to do these alerts or or do you have it to where pretty much anything out of the box is is going to get you going a long way right away yeah so 70 of our customers use Opera software as a service 30 of our customers are fully on-prem so let's take both those scenarios to answer your question for our SAS customers the deployment is actually very straightforward they need to deploy and alter a collector or a set of collectors which take data from their infrastructure whether it is SNMP syslog SLO Kafka all the all the things I talked about this collector acts as a proxy and then in a secure manner it sends the data to the octetera cloud deployment of this collector is very straightforward or sometimes customers do it themselves sometimes our support team gets on the phone with them does it over whatever an hour or so this collector is set up they open up some firewall rules sometimes they don't even need to so that the collector can get the data maybe they push a couple of config changes let's say they're using Splunk they can point the syslog to it and so on right now once the collector gets the data it pushes the data to octet.cloud and uh the customer really all they have done is deployed a collector and then our uh you know you can say our teams who are onboarding the customer they train the customer on the configuration that they can do to customize the second question you asked right about customize and tailoring the anomalies they're getting you know are they getting TCP anomalies for example do they really care about TCP congestion do they care about loss so there is onboarding session in which the customer uh really expresses what's most important to them and based on that our team spent some time trains them software ships by default with all sorts of different you can say anomalies and configurations supported it's just a matter of enabling the ones that the customers care about so they go off they enable it and boom now they're starting to get machine learning anomalies there is no configuration at all to tell the software what the Baseline is as you know people use the word Baseline a lot in the industry it's it's sort of a overloaded word uh in the October case we learn from the patterns so it's a very very low touch now in large Enterprise environments there can be cases where the customers do need to do some customization maybe there are metrics that they have in their data which are not standard uh maybe they want to tweak the notifications maybe they want to tweak the what they consider operationally useful maybe one operator says I care about packet drops only when they happen on spines when they are normals another operator might say I care about anomalous packet drops even on top of rack just because my application is so very sensitive so all those things they can configure on their own in our UI uh and uh there is a training there is a very comprehensive documentation so I would say for a lot of our customers it's very low touch uh they can you know spend more time if they want to be highly customized and if all of this was deployed fully on-prem the only difference is the customer now needs to uh manage that that on-prem stack just like any enterprise software which is on-prem is actually functioning and behaving correctly it has all the right instrumentation it gives them gives them all the right alerts uh but but yeah they they are responsible for that at that point so let's at least at a high level talk about uh some of your customers is there is there a specific industry that you see a bulk of your customers or are you really cross industry there's a lot of different types of customers out there you know actually we we've been both fortunate and also I think we're really proud of the progress we have made across verticals uh we've got very large scale deployments in retail and a lot of you know most of these are Fortune 500 in retail in finance across providers across managed service providers and different types of financial companies uh technology companies so I would say you know really the common theme is anyone who has a network which is important to them uh whether it is hybrid whether it is uh you know a fully private or well actually the very few large Enterprises I don't know of anyone in fact which has a you know completely public network right or completely in the public Cloud it's mostly hybrid so I would say that anyone for whom network is important to the business is is a really really good candidate to use our software and often that's very small providers or very small Enterprises doesn't have to be somewhere very large so you got a range of customers in all sorts of different verticals and industries who are using us and using ourselves sometimes at very very large scale tens of thousands of switches or sometimes small scale you know hundreds of switches and you mentioned uh hybrid networks uh industries that may have workloads in both private and public clouds is that a a fairly easy Endeavor from an augterra standpoint to have collectors on-prem and then also be monitoring uh networks in the in the public Cloud as well yeah yeah uh and actually there are many ways to do it we can uh we also can take data directly from uh you know public Cloud for example S3 buckets or Google stackdriver so we can take data directly from public cloud in many ways actually you know VPC flow logs is a good example um so we've done a fair bit of work and we'll continue to do more and more to make those Integrations easier oh that's excellent so we've been doing a lot of discussion around you know what I'll say the main use case here is is how do we better operate our networks how do we how do we automate troubleshooting which is which is huge that it it seems like you're doing here at octera day in and day out do you see any other use cases outside of just the straight network operations management and maintenance is there do you have any organizations that are leveraging octera as for instance like a security play or anything else outside of just straight network operations yeah that's a really really good question so we you know our Genesis and our focus is all around network operations the platform we have built and algorithms we have built often have a much broader scope and as a startup you know we made a very conscious decision to stay focused what we are finding is that as our customers use us more they find more and more use cases so we have a lot of customers who are using us for network security so for example we can find things like TCP syn floods because we're looking at packet data uh and TCP send floods could be because of an attack we can find uh you know distributions which indicate some sort of a DDOS issue we can find rare logs which might indicate a security issue so clearly there are a whole slew of use cases Beyond network operations um which which are emerging and more and more will emerge our network security is one application is another one our customers are telling us well what if we give you application data really application metrics for example stock ticks and tell us if there is a misbehavior in my specific application metric we can do that because we can take as I mentioned to you earlier we can take Json data and we can extract metrics from that in real time so we are finding more and more use cases emerge and we'll see us doing more with security with applications over time but we continue to be very very focused around network operations uh you know right now again I I don't want to uh to make you feel like I'm I'm trying to have you talk about the secret sauce but I I do want to discuss uh the the AI concept behind this I think artificial intelligence is can be kind of a nebulous term and maybe mean different things to different people can you at least at a high level tell us what AI means to octera yeah yeah absolutely I think that's a you know very a very reasonable question I think uh at the high level uh there are many paths to this right one part to this is that okra has uh you can say a fundamental belief that for the network domain and actually it applies beyond the network domain as well but for the network domain the software needs to be able to learn from the data that you you can't actually expect humans to tell you what's normal what's abnormal now if you think about that that gets into the very some some very core differentiations when it comes to machine learning around supervised and supervised because the different approaches which get into whether you expect someone to tell you what's normal or whether you learn from the data so we learn from the data we learned from patterns well then you have to ask yourself well when you learn from patterns well how do you do that because there literally so many different patterns and one algorithm cannot actually satisfy all patterns so we have multiple algorithms for different distributions for different patterns um and we figured out uh you know which algorithm to use when and that's uh there's a lot of Secret Sauce there it's not easy to do because you really need to purpose build the algorithms for certain distributions and a lot of our algorithms are very Network specific because they apply very specifically network data and some of our algorithms are beginning to have broader applicability because it just turns out that that those distributions they exist outside the network as well in other domains uh so I think that's that's one one part part of it second part of it is we also understand that when the machine finds results they need to be operationally useful and to make them operationally useful you need metadata you need context so I gave an example earlier around spines at top of racks it could be around different data centers and different geographies or different vpcs so we do a lot of enrichment of the data to make it easy for the operator to mine the data so that's another belief we have that AI is is a tool but you need to marry it with a lot of Rich context from the domain to make it operationally useful and we have done that in the networking domain you know in in a very very deep way and we also learn the the models of the networks the topologies of the network the other thing is that we there are cases and I won't get into it it gets a bit into Secret Sauce where we do uh learn from the operator what the operator thinks is uh is a real issue and we can do that collective learning across operators so let's say one example I can talk about is if we our machine learning can learn as I mentioned using logs that there is a unique log for Asic parity error on let's say a Cisco or a journal or Arista device on a specific OS uh we can transfer that learning across our customer base so now if we learn that pattern in one network we can actually leverage that in another Network we call that collective learning um so I mean that's a very unique thing we've got and uh that takes a slightly different approach than some of the other approaches I talked about so hopefully that gives you a sense one more thing I'll say actually we have another very very rich algorithm which can do what we call autocorrelation so as we learn the topology of the network across less and as events take place let's say you cut a link in the data center and because of that the uh layer 2 flaps bgp flaps traffic distributions change all of this results in different events and in ocular terminology different machine learning anomalies we can put all this together by using our machine learning algorithm which leverages the network topology and all the events and anomalies that we find or create into what we call a single incident so that that's another form of machine learning we do it's usually not one thing it's it's many many different things hopefully that gives you some context no that that gives me a lot of context in fact I I think the whole um concept of that issue correlation in being able to take different data from different devices and points in the network and be able to correlate them and call this out and say all of these different things are one single issue I think that's incredibly beneficial and you brought something up a minute ago that I really appreciate but it's not something I really thought about much until you said it is that you really can't rely on your end users and even on your technical staff to always know whether what is happening is normal or expected behavior and to be able to have all of those different data points not only being collected and aggregated but to be correlated and be able to show you hey you need to look here because we think we have an issue here or like you alluded to earlier have automation built into it to where you can have workflows that can remediate whatever it is that it's seeing I I think is that's an incredibly beneficial and I'm sure your customers get tons of value out of that you know I think I think what you just repeated I want to go back and highlight that this is one of the least understood issues people don't know what's normal I think uh this is the part which often when we talk to our customers once they understand that and a lot of them often know it then everything opens up because let's take a very simple example let's say I'm a large Enterprise and let's say I have a a Global Network it doesn't have to be global national network across multiple regions and I have interconnects which I have bought from different providers and maybe I have my own backbone but I bought some interconnect from atnt here Verizon here level three here you go on and so on and so forth now these interconnects they all have the latency and loss are and one might be a 50 milliseconds latency another might be at 70. one might be having point zero one percent loss another 1.02 and that's normal you have no idea as an operator What is the normal behavior of that circuit well if you don't know what's normal how can you know what's abnormal so the magic of AI is to First discover what's normal and to actually learn from the patterns and as the normal keeps changing keep relearning that's another big thing that we do so this is not humanly possible Right humans can't actually figure this out until they keep staring at screens and uh that's that's a big part of the value that we provide definitely I can I can definitely see a lot of value there you know I've always been a proponent in if there's an issue that can be seen through automation automatically through a network I I want documentation of it I want that ticket generation but we we touched on it towards the beginning of this and I'd like to to dive deeper into it that is beneficial but then somebody has to open up their servicenow and all of a sudden they see tens or hundreds or even more uh incidents sitting there waiting for them and they don't know where to begin so can we can we kind of talk through um I want to talk specifically about your policy-based noise noise elimination can you can you delve deeper into that yeah that's actually um that's a really good question so let me maybe start from the other end so we I'll take one of our customer examples to illustrate the power of what we do right we in a very large Network we are getting uh billions of data points a day and by data points I mean you know uh one metric on one interface in five minutes is one data point so we're getting billions or literally tens of billions a day from this we generate about 100 or so anomalies using all the different techniques which I've talked about uh and of these about 10 to 15 get ticketed so this part is very important before we came into the picture in this particular operator they literally had hundreds of tickets a day so that's a lot fatigue with with the power of all our Tech and I'll take you through a little bit of it uh they are getting you know uh 10 15 tickets a day which is very very manageable so I think uh that that's in the meantime to detect has really dropped by almost 90 percent plus like 90 to 93 or something like that right which is uh and also the time to remediate has dropped as a result but it comes down a lot to much fewer tickets much more High Fidelity much more actionable how does that happen so we when we get the data in one of the things we do is we remove noise from the data because we uh because of the machine learning we learn from the patterns and we only figure we only you know ticket what's actually a really a significant change uh and we do take operator uh you can say criteria or policy used about policy which is a good policy which is a good term we use the operator policy um and take that into account right things like don't ticket anything if it is uh you know for example uh I don't really care about uh any uh anomalies which are happening with the CRC errors until they you know they're repetitive you know that's that's a good example I want I want some some repetition to happen so we have the ability to first uh if you go to the older systems they're very threshold based so let's say that uh if you're looking at uh just temperature or Optics or latency you would say you know what give me an alert if latency goes over 100 milliseconds well that's a problem because latency keeps changing it keeps changing on every circuit it's different on every circuit so because of the machine learning you automatically reduce noise because it's learning from the distributions instead of you selecting a threshold which doesn't really apply all the time to different objects right now once you have these anomalies then there is the concept of deduplication I've created an anomaly on a specific interface on a device well maybe that anomaly repeats itself you don't want to keep alerting or keep ticketing you want to be able to either suppress those alerts or with something like servicenow we can do even better we can add events to an open ticket so you create a ticket and now if that event keeps reoccurring you keep adding to the ticket so that's the second technique third technique is our autocorrelation let's say a link I'll give you a production example there is a switch which had an issue it was a leaf switch connected to four spines that leaf switch had a hardware corruption and the four spines they started to Flap and drop lldp and bgp sessions and started to give alerts so you've got eight alerts coming from these eight switches this switch the one which is misbehaving is silent it's not even giving any alerts octora autocorrelation technology put together these eight into one because we understand the topology we understand they're connected to this common leaf and we can autocorrelate so instead of getting eight tickets or eight events the operator got one because the auto correlated these eight into one so that's sort of another Technique we have to reduce uh you know the the noise or the amount of tickets you you get and I'm sure I've not gone through all our techniques around this we have a lot of things for example finite State machines which dampen control plane States so I can go on and on but but there's a lot which has happened to make this happen so not so easy to do I I go back to that that correlation being incredibly valuable to be able to take those eight different alerts and really convert them into one incident because they I know I've been uh in situations where we may have had physical power issues at a site and all of a sudden we're getting alerts uh from every single device and if every single device was then converted into an incident we now have tons of incidents without you know you're doing manual correlation at that point to have a system that's doing that for you is is incredibly a valuable because not only does it does it drop that amount of incidence down to minimal but it also de-stresses that operation staff because when they open up that ticketing system they're not completely overwhelmed they're able to focus on the the less amount of things that are now on their plate so that's there's tons of value there for sure exactly yeah so I do want to Pivot a little bit uh with the really the evolution of networks over the last few years and more and more companies wanting to use commodity uh Broadband different types of of Internet circuits to not only to connect to the greater internet but to connect to things like public Cloud we have this wonderful thing called sd-wan now there there are a lot of moving pieces to sd-win there's a lot of places where things can go wrong and a lot of the sd-wan providers you know have their own built-in analytics and and can help you point to and correlate issues but I I do have to call this out because I see a uh a Blog on your website about using augterra to detect sd-wan Network brownouts before there's a failure can you uh can you pick apart that a little bit for us have you do a lot with sd-wan and so an example would be let's say that there is a sd-wan internet circuit and that internet circuit uh begins to either drop packets intermittently or again it's latency spikes so we can take uh data from the sd-wan providers most of them have a fairly good bit of ipfix data uh we can also take some proprietary logs from people like Versa and we can figure out the misbehavior and loss packet drops Jitter uh or you know TCP retransmits which tells us you know what the circuit is misbehaving and we can do it better than those vendors do because our machine learning of all the capabilities I mentioned uh it just it just you know actually learns from the patterns uh so often what can happen is that sd-wan itself uh might for example be running on a single leg because you might have two circuits one circuit is has a gray failure we can detect the gray failure uh we can correlate it we can also point out things like Flaps in the control plane if you've got a lot of flaps going on in the IEP secular or in the tunnel layer we can find that uh so and we can another thing which which is very interesting in sd-wan is we can connect the dots from the network to the application very quickly so let's say there is a a particular application is starting to see the transmits let's say Zoom we can really point out look Zoom is seeing retransmits we can do it automatically and oh you know what it's happening because there are packet drops on this link so there's a fair bit we can do uh with the with sd1 now I want to go back to alert fatigue in that kind of situation so you let's say you have a customer that's that's running an sd-wan Network the sd-wan provider obviously has their own logging their own alerting potentially their own capability to create incidents in an itsm and your customer also owns octera so do you see those customers are they are they disabling some of that alerting in the the sd-wan provider or they just maybe not generating incidents from the sd-wan providers alerting system and they're leveraging augterra to do that how have you seen customers handle that it's a good question um it's an evolution what we see is that customers often start by just complimenting us with whatever they've got to find issues with which other people can't find that's that's often a starting point you know for example some of the great failures I mentioned some of the correlations I mentioned as they get more comfortable with our software yes our customers do often turn off other monitoring systems and just rely on us to to Really feed the a more holistic view across the infrastructure so it's a mix and we've got customers sort of doing both and we've got customers who are very comfortable with both we've got customers very comfortable with just using a complementary capabilities and just on top of what exists yeah no that that makes a lot of sense I'm glad that you you really highlighted that there are definitely transitionary periods where people may leverage both and maybe continue to leverage both or become confident enough uh that they like what they see especially in that that correlation of issues and and minimizing the amount of incidents that get created that they they get as a value out of octera and and really uh entrust their alerting to that so that's that's excellent Rahul this is this has been an excellent conversation I appreciate you coming on is there anything that we haven't touched that you want to dive into before we round this out I think it's been a it's been a great great conversation no I think uh uh we've had a pretty comprehensive discussion so I really appreciate you giving us the opportunity to come onto the show and for uh the great dialogue thank you Rahul for yourself and the rest of the augterra team for supporting the art of network engineering if you want to check us out you can find us at artofnetworkengineering.com at Art of netenge on Twitter Rahul where can we Focus people to go to to learn more about augterra .com that's that's a website it's really one of the best places to go and get information about us on LinkedIn we are you know really really active and we've got a great community on LinkedIn so I would encourage people to join our LinkedIn page as well um I think those those would be two good places excellent thank you again Rahul and to the rest of the team at octera this has been another episode of The Art of network engineering thanks for joining us we'll see you next time foreign Lexi if you Vibe with what you heard us talking about today we'd love for you to subscribe to our podcast in your favorite podcatcher also go ahead and hit that Bell icon to make sure you're notified of all our future episodes right when they come out if you want to hear what we're talking about when we're not on the podcast you can totally follow us on Twitter and Instagram at Art of netenge that's art of n-e-t-e-n-g you can also find a bunch more info about us and the podcast at Art of network engineering.com thanks for listening foreign