Artwork

Corey Quinn에서 제공하는 콘텐츠입니다. 에피소드, 그래픽, 팟캐스트 설명을 포함한 모든 팟캐스트 콘텐츠는 Corey Quinn 또는 해당 팟캐스트 플랫폼 파트너가 직접 업로드하고 제공합니다. 누군가가 귀하의 허락 없이 귀하의 저작물을 사용하고 있다고 생각되는 경우 여기에 설명된 절차를 따르실 수 있습니다 https://ko.player.fm/legal.
Player FM -팟 캐스트 앱
Player FM 앱으로 오프라인으로 전환하세요!

Open Core, Real-Time Observability Born in the Cloud with Martin Mao

41:41
 
공유
 

Manage episode 295622553 series 2937944
Corey Quinn에서 제공하는 콘텐츠입니다. 에피소드, 그래픽, 팟캐스트 설명을 포함한 모든 팟캐스트 콘텐츠는 Corey Quinn 또는 해당 팟캐스트 플랫폼 파트너가 직접 업로드하고 제공합니다. 누군가가 귀하의 허락 없이 귀하의 저작물을 사용하고 있다고 생각되는 경우 여기에 설명된 절차를 따르실 수 있습니다 https://ko.player.fm/legal.

About Martin

Martin Mao is the co-founder and CEO of Chronosphere. He was previously at Uber, where he led the development and SRE teams that created and operated M3. Prior to that, he was a technical lead on the EC2 team at AWS and has also worked for Microsoft and Google. He and his family are based in our Seattle hub and he enjoys playing soccer and eating meat pies in his spare time.

Links:

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by Thinkst. This is going to take a minute to explain, so bear with me. I linked against an early version of their tool, canarytokens.org in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you want to; it gives you fake AWS API credentials, for example. And the only thing that these things do is alert you whenever someone attempts to use those things. It’s an awesome approach. I’ve used something similar for years. Check them out. But wait, there’s more. They also have an enterprise option that you should be very much aware of canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files on it, you get instant alerts. It’s awesome. If you don’t do something like this, you’re likely to find out that you’ve gotten breached, the hard way. Take a look at this. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I love it.” That’s canarytokens.org and canary.tools. The first one is free. The second one is enterprise-y. Take a look. I’m a big fan of this. More from them in the coming weeks.

Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. I’ve often talked about observability, or as I tend to think of it when people aren’t listening, hipster monitoring. Today, we have a promoted episode from a company called Chronosphere, and I’m joined today by Martin Mao, their CEO and co-founder. Martin, thank you for coming on the show and suffering my slings and arrows.

Martin: Thanks for having me on the show, Corey, and looking forward to our conversation today.

Corey: So, before we dive into what you’re doing now, I’m always a big sucker for origin stories. Historically, you worked at Microsoft and Google, but then you really sort of entered my sphere of things that I find myself having to care about when I’m lying awake at night and the power goes out by working on the EC2 team over at AWS. Tell me a little bit about that. You’ve hit the big three cloud providers at this point. What was that like?

Martin: Yeah, it was an amazing experience, I was a technical lead on one of the EC2 teams, and I think when an opportunity like that comes up on such a core foundational project for the cloud, you take it. So, it was an amazing opportunity to be a part of leading that team at a fairly early stage of AWS and also helping them create a brand new service from scratch, which was AWS Systems Manager, which was targeted at fleet-wide management of EC2 instances, so—

Corey: I’m a tremendous fan of Systems Manager, but I’m still looking for the person who named Systems Manager Session Manager because, at this point, I’m about to put a bounty out on them. Wonderful service; terrible name.

Martin: That was not me. So, yes. But yeah, no, it was a great experience, for sure, and I think just seeing how AWS operated from the inside was an amazing learning experience for me. And being able to create foundational pieces for the cloud was also an amazing experience. So, only good things to say about my time at AWS.

Corey: And then after that, you left and you went to Uber where you led development and SRE teams that created and operated something called M3. Alternately, I’m misreading your bio, and you bought an M3 from BMW and went to drive for Uber. Which is it?

Martin: I wish it was the second one, but unfortunately, it is the first one. So yes, I did leave AWS and joined Uber in 2015 to lead a core part of their monitoring and eventually larger observability team. And that team did go on to build open-source projects such as M3—which perhaps we should have thought about the name and the conflict with the car when we named it at the time—and other projects such as Jaeger for distributed tracing as well, and a logging backend system, too. So, yeah, definitely spent many years there building out their observability stack.

Corey: We’re going to tie a theme together here. You were at Microsoft, you were at Google, you were at AWS, you were at Uber, and you look at all of this and decide, “All right. My entire career has been spent in large companies doing massive globally scaled things. I’m going to go build a small startup.” What made you decide that, all right, this is something I’m going to pursue?

Martin: So, definitely never part of the plan. As you mentioned, a lot of big tech companies, and I think I always got a lot of joy building large distributed systems, handling lots of load, and solving problems at a really grand scale. And I think the reason for doing a startup was really the situation that we were in. So, at Uber as I mentioned, myself and my co-founder led the core part of the observability team there, and we were lucky to happen to solve the problem, not just for Uber but for the broader community, especially the community adopting cloud-native architecture. And it just so happened that we were solving the problem of Uber in 2015, but the rest of the industry has similar problems today.

So, it was almost the perfect opportunity to solve this now for a broader range of companies out there. And we already had a lot of the core technology built-in open-source as well. So, it was more of an opportunity rather than a long-term plan or anything of that sort, Corey.

Corey: So, before we dive into the intricacies of what you’ve built, I always like to ask people this question because it turns out that the only thing that everyone agrees on is that everyone else is wrong. What is the dividing line, if any, between monitoring and observability?

Martin: That’s a great question, and I don’t know if there’s an easy answer.

Corey: I mean, my cynical approach is that, “Well, if you call it monitoring, you don’t get to bring in SRE-style salaries. Call it observability and no one knows what the hell we’re talking about, so sure, it’s a blank check at that point.” It’s cynical, and probably not entirely correct. So, I’m curious to get your take on it.

Martin: Yeah, for sure. So, you know, there’s definitely a lot of overlap there, and there’s not really two separate things. In my mind at least, monitoring, which has been around for a very long time, has always been around notification and having visibility into your systems. And then as the system’s got more complex over time, being able to understand that and not just have visibility into it but understand it a little bit more required, perhaps, additional new data types to go and solve those problems. And that’s how, in my mind, monitoring sort of morphed into observability. So, perhaps one is a subset of the other, and they’re not competing concepts there. But at least that’s my opinion. I’m sure there are plenty out there that would, perhaps, disagree with that.

Corey: On some level, it almost hits to the adage of, past a certain point of scale with distributed systems, it’s never a question of is the app up or down, it’s more a question of how down is it? At least that’s how it was explained to me at one point, and it was someone who was incredibly convincing, so I smiled and nodded and never really thought to question it any deeper than that. But I look back at the large-scale environments I’ve been in, and yeah, things are always on fire, on some level, and ideally, there are ways to handle and mitigate that. Past a certain point, the approach of small-scale systems stops working at large scale. I mean, I see that over in the costing world where people will put tools up on GitHub of, “Hey, I ran this script, and it works super well on my 10 instances.”

And then you try and run the thing on 10,000 instances, and the thing melts into the floor, hits rate limits left and right because people don’t think in terms of those scales. So, it seems like you’re sort of going from the opposite end. Well, this is how we know things work at large scale; let’s go ahead and build that out as an initially smaller team. Because I’m going to assume, not knowing much about Chronosphere yet, that it’s the sort of thing that will help a company before they get to the hyperscaler stage.

Martin: A hundred percent, and you’re spot on there, Corey. And it’s not even just a company going from small-stage, small-scale simple systems to more complicated ones, actually, if you think about this shift in the cloud right now, it’s really going from cloud to cloud-native. So, going from VMs to container on the infrastructure tier, and going from monoliths to microservices. So, it’s not even the growth of the company, necessarily, or the growth of the load that the system has to handle, but this shift to containers and microservices heavily accelerates the growth of the amount of data that gets produced, and that is causing a lot of these problems.

Corey: So, Uber was famous for disrupting, effectively, the taxi market. What made you folks decide, “I know. We’re going to reinvent observability slash monitoring while we’re at it, too.” What was it about existing approaches that fell down and, I guess, necessitated you folks to build your own?

Martin: Yeah, great question, Corey. And actually, it goes to the first part; we were disrupting the taxi industry, and I think the ability for Uber to iterate extremely fast and respond as a business to changing market conditions was key to that disruption. So, monitoring and observability was a key part of that because you can imagine it was providing all of the real-time visibility to not only what was happening in our infrastructure and applications, but the business as well. So, it really came out of a necessity more than anything else. We found that in order to be more competitive, we had to adopt what is probably today known as cloud-native architecture, adopt running on containers and microservices so that we can move faster, and along with that, we found that all of the existing monitoring tools we were using, weren’t really built for this type of environment. And it was that that was the forcing function for us to create our own technologies that were really purpose-built for this modern type of environment that gave us the visibility we needed to, to be competitive as a company and a business.

Corey: So, talk to me a little bit more about what observability is. I hear people talking about it in terms of having three pillars; I hear people talking about it, to be frank, in a bunch of ways so that they’re trying to, I guess, appropriate the term to cover what they already
are doing or selling because changing vocabulary is easier than changing an entire product philosophy. What is it?

Martin: Yeah, we actually had a very similar view on observability, and originally we thought that it is a combination of metrics, logs, and traces, and that’s a very common view. You have the three pillars, it’s almost like three checkboxes; you tick them off, and you have, quote-unquote, “Observability.” And that’s actually how we looked at the problem at Uber, and we built solutions for each one of those and we checked all three boxes. What we’ve come to realize since then is perhaps that was not the best way to look at it because we had all three, but what we realized is that actually just having all three doesn’t really help you with the ultimate goal of what you want from this platform, and having more of each of the types of data didn’t really help us with that, either. So, taking a step back from there and when we really looked at it, the lesson that we learned in our view on observability is really more from an end-user perspective, rather than a data type or data input perspective.

And really, from an end-user perspective, if you think about why you want to use your monitoring tool or your observability tool, you really
want to be notified of issues and remediate them as quickly as possible. And to do that, it really just comes down to answering three questions. “Can I get notified when something is wrong? Yes or no? Do I even know something is wrong?”

The second question is, “Can I triage it quickly to know what the impact is? Do I know if it’s impacting all of my customers or just a subset of them, and how bad is the issue? Can I go back to sleep if I’m being paged at two o’clock in the morning?”

And the third one is, “Can I figure out the underlying root cause to the problem and go and actually fix it?” So, this is how we think about the problem now, is from the end-user perspective. And it’s not that you don’t need metrics, logs, or distributed traces to solve the problem, but we are now orienting our solution around solving the problem for the end-user, as opposed to just orienting our solution around the three data types, per se.

Corey: I’m going to self-admit to a fun billing experience I had once with a different monitoring vendor whom I will not name because it turns out, you can tell stories, you can name names, but doing both gets you in trouble. It was a more traditional approach in a simpler time, and they wound up sending me a message saying, “Oh, we’re hitting rate limits on CloudWatch. Go ahead and open a ticket asking for them to raise it.” And in a rare display of foresight, AWS respond to my ticket with a, “We can do this, but understand at this level of concurrency, it will cost something like $90,000 a month on increased charges, with that frequency, for that many metrics.” And that was roughly twice what our AWS bill was in those days, and, “Oh.” So, I’m curious as to how you can offer predictable pricing when you can have things that emit so much data so quickly. I believe you when you say you can do it; I’m just trying to understand the philosophy of how that works.

Martin: As I said earlier, we started to approach this by trying to solve it in a very engineering fashion where we just wanted to create more efficient backend technology so that it would be cheaper for the increased amount of data. What we realized over time is that no matter how much cheaper we make it, the amount of data being produced, especially from monitoring and observability, kept increasing, and not even in a linear fashion but in an exponential fashion. And because of that, it really switched the problem not to how efficiently can we store this, it really changed our focus of the problem to how our users using this data, and do they even understand the data that’s being produced? So, in addition to the couple of properties I mentioned earlier, around cost accounting and rate-limiting—those are definitely required—the other things we try to make available for our end-users is introspection tools such that they understand the type of data that’s being produced. It’s actually very easy in the monitoring and observability world to write a single line of code that actually produces a lot of data, and most developers don’t understand that that single line of code produces so much data.

So, our approach to this is to provide a tool so that developers can introspect and understand what is produced on the backend side, not what is being inputted from their code, and then not only have an understanding of that but also dynamic ways to deal with it. So that again, when they hit the rate limit, they don’t just have to monitor it less, they understand that, “Oh, I inserted this particular label and now I have 20 times the amount of data that I needed before. Do I really need that particular label in there> and if not, perhaps dropping it dynamically on the server-side is a much better way of dealing with that problem than having to roll back your code and change your metric instrumentation.” So, for us, the way to deal with it is not to just make the backend even more efficient, but really to have end-users understand the data that they’re producing, and make decisions on which parts of it is really useful and which parts of it do they, perhaps not want or perhaps want to retain for shorter periods of time, for example, and then allow them to actually implement those changes on that data on the backend. And that is really how the end-users control the bills and the cost themselves.

Corey: So, there are a number of different companies in the observability space that have different approaches to what they solve for. In some cases, to be very honest, it seems like, well, I have 15 different observability and monitoring tools. Which ones do you replace? And the answer is, “Oh, we’re number 16.” And it’s easy to be cynical and down on that entire approach, but then you start digging into it and they’re actually right.

I didn’t expect that to be the case. What was your perspective that made you look around the, let’s be honest, fairly crowded landscape of observability companys’ tools that gave insight into the health status and well being of various applications in different ways, and say, “You know, no one’s quite gotten this right, yet. I have a better idea.”

Martin: Yeah, you’re completely correct, and perhaps the previous environments that everybody was operating in, there were a lot of different tools for different purposes. A company would purchase an infrastructure monitoring tool, or perhaps even a network monitoring tool, and then they would have, perhaps, an APM solution for the applications, and then perhaps BI tools for the business. So, there was always historically a collection of different tools to go and solve this problem. And I think, again, what has really happened recently with this shift to cloud-native recently is that the need for a lot of this data to be in a single tool has become more important than ever. So, you think about your microservices running on a single container today, if a single container dies in isolation without knowing, perhaps, which microservice was running on it doesn’t mean very much, and just having that visibility is not going to be enough, just like if you don’t know which business use case that microservice was serving, that’s not going to be very useful for you, either.

So, with cloud-native architecture, there is more of a need to have all of this data and visibility in a single tool, which hasn’t historically happened. And also, none of the existing tools today—so if you think about both the existing APM solutions out there and the existing hosted solutions that exist in the world today, none of them were really built for a cloud-native environment because you can think about even the timing that these companies were created at, you know, back in early 2010s, Kubernetes and containers weren’t really a thing. So, a lot of these tools weren’t really built for the modern architecture that we see most companies shifting towards. So, the opportunity was really to build something for where we think the industry and everyone’s technology stack was going to be as opposed to where the technology stack has been in the past before. And that was really the opportunity there, and it just so happened that we had built a lot of these solutions for a similar type environment for Uber many years before. So, leveraging a lot of our lessons learned there put us in a good spot to build a new solution that we believe is fairly different from everything else that exists today in the market, and it’s going to be a good fit for companies moving forward.

Corey: So, on your website, one of the things that you, I assume, put up there just to pick a fight—because if there’s one thing these people love, it’s fighting—is a use case is outgrowing Prometheus. The entire story behind Prometheus is, “Oh, it scales forever. It’s what the hyperscalers would use. This came out of the way that Google does things.” And everyone talks about Google as if it’s this mythical Valhalla place where everything is amazing and nothing ever goes wrong. I’ve seen the conference talks. And that’s great. What does outgrowing Prometheus look like?

Martin: Yeah, that’s a great question, Corey. So, if you look at Prometheus—and it is the graduated and the recommended monitoring tool for cloud-native environments—if you look at it and the way it scales, actually, it’s a single binary solution, which is great because it’s really easy to get started. You deploy a single instance, and you have ingestion, storage, and visibility, and dashboarding, and alerting, all packaged together into one solution, and that’s definitely great. And it can scale by itself to a certain point and is definitely the recommended starting point, but as you really start to grow your business, increase your cluster sizes, increase the number of applications you have, actually isn’t a great fit for horizontal scale. So, by default, there isn’t really a high availability and horizontal scale built into Prometheus by default, and that’s why other projects in the CNCF, such as Cortex and Thanos were created to solve some of these problems.

So, we looked at the problem in a similar fashion, and when we created M3, the open-source metrics platform that came out of Uber, it was also approaching it from this different perspective where we built it to be horizontally scalable, and highly reliable from the beginning, but yet, we don’t really want it to be a, let’s say, competing project with Prometheus. So, it is actually something that works in tandem with Prometheus, in the sense that it can ingest Prometheus metrics and you can issue Prometheus query language queries against it, and it will fulfill those. But it is really built for a more scalable environment. And I would say that once a company starts to grow and they run into some of these pain points and these pain points are surrounding how reliable a Prometheus instance is, how you can scale it up beyond just giving it more resources on the VM that it runs on, vertical scale runs out at a certain point. Those are some of the pain points that a lot of companies do run into and need to solve eventually. And there are various solutions out there, both in open-source and in the commercial world, that are designed to solve those pain points. M3 being one of the open-source ones and, of course, Chronosphere being one of the commercial ones.

Corey: This episode is sponsored in part by Salesforce. Salesforce invites you to “Salesforce and AWS: Whats Ahead for Architects, Admins and Developers” on June 24th at 10AM, Pacific Time. Its a virtual event where you’ll get a first look at the latest innovations of the Salesforce and AWS partnership, and have an opportunity to have your questions answered. Plus you’ll get to enjoy an exclusive performance from Grammy Award winning artist The Roots! I think they’re talking about a band, not people with super user access to a system. Registration is free at salesforce.com/whatsahead.

Corey: Now, you’ve also gone ahead and more or less dangled raw meat in front of a tiger in some respects here because one of the things that you wind up saying on your site of why people would go with Chronosphere is, “Ah, this doesn’t allow for bill spike overages as far as what the Chronosphere bill is.” And that’s awesome. I love predictable pricing. It’s sort of the antithesis of cloud bills. But there is the counterargument, too, which is with many approaches to monitoring, I don’t actually care what my monitoring vendor is going to charge me because they wind up costing me five times more, just in terms of CloudWatch charges. How does your billing work? And how do you avoid causing problems for me on the AWS side, or other cloud provider? I mean, again, GCP and Azure are not immune from this.

Martin: So, if you look at the built-in solutions by the cloud providers, a lot of those metrics and monitoring you get from those like
CloudWatch or Stackdriver, a lot of it you get included for free with your AWS bill already. It’s only if you want additional data and additional retention, do you choose to pay more there. So, I think a lot of companies do use those solutions for the default set of monitoring that they want, especially for the AWS services, but generally, a lot of companies have custom monitoring requirements outside of that in the application tier, or even more detailed monitoring in the infrastructure that is required, especially if you think about Kubernetes.

Corey: Oh, yeah. And then I see people using CloudWatch as basically a monitoring, or metric, or log router, which at its price point, don’t
do that. [laugh]. It doesn’t end well for anyone involved.

Martin: A hundred percent. So, our solution and our approach is a little bit different. So, it doesn’t actually go through CloudWatch or any of these other inbuilt cloud-hosted solutions as a router because, to your point, there’s a lot of cost there as well. It actually goes and collects the data from the infrastructure tier or the applications. And what we have found is that not only does the bill for monitoring climb exponentially—and not just as you grow; especially as you shift towards cloud-native architecture—our very first take of solving that problem is to make the backend a lot more efficient than before so it just is cheaper overall.

And we approached it that way at Uber, and we had great results there. So, when we created an—originally before M3, 8% of Uber’s infrastructure bill was spent on monitoring all the infrastructure and the application. And by the time we were done with M3, the cost was a little over 1%. So, the very first solution was just make it more efficient. And that worked for a while, but what we saw is that over time, this grew again.

And there wasn’t any more efficiency, we could crank out of the backend storage system. There’s only so much optimization you can do to the compression algorithms in the backend and how much you can get there. So, what we realized the problem shifted towards was not, can we store this data more efficiently because we’re already reaching limitations there, and what we noticed was more towards getting the users of this data—so individual developers themselves—to start to understand what data is being produced, how they’re using it, whether it’s even useful, and then taking control from that perspective. And this is not a problem isolated to the SRE team or the observability team anymore; if you think about modern DevOps practices, every developer needs to take control of monitoring their own applications. So, this responsibility is really in the hands of the developers.

And the way we approached this from a Chronosphere perspective is really in four steps. The first one is that we have cost accounting so that every developer, and every team, and the central observability team know how much data is being produced. Because it’s actually a hard thing to measure, especially in the monitoring world. It’s—

Corey: Oh, yeah. Even AWS bills get this wrong. Like if you’re sending data between one availability zone to another in the same region, it charges a penny to leave an AZ and a penny to enter an AZ in that scenario. And the way that they reflect this on the bill is they double it. So, if you’re sending one gigabyte across AZ link in a month, you’ll see two gigabytes on the bill and that’s how it’s reflected. And that is just a glimpse of the monstrosity that is the AWS billing system. But yeah, exposing that to folks so they can understand how much data their application is spitting off? Forget it. That never happens.

Martin: Right. Right. And it’s not even exposing it to the company as a whole, it’s to each use case, to each developer so they know how much data they are producing themselves. They know how much of the bill is being consumed. And then the second step in that is to put up bumper lanes to that so that once you hit the limit, you don’t just get a surprise bill at the end of the month.

When each developer hits that limit, they rate-limit themselves and they only impact their own data; there is no impact to the other developers or to the other teams, or to the rest of the company. So, we found that those two were necessary initial steps, and then there were additional steps beyond that, to help deal with this problem.

Corey: So, in order for this to work within a multi-day lag, in some cases, it’s a near certainty that you’re looking at what is happening and the expense that is being incurred in real-time, not waiting for it to pass its way through the AWS billing system and then do some tag attribution back.

Martin: A hundred percent. It’s in real-time for the stream of data. And as I mentioned earlier, for the monitoring data we are collecting, it goes straight from the customer environment to our backend so we’re not waiting for it to be routed through the cloud providers because, rightly so, there is a multi-day or multi-hour delay there. So, as the data is coming straight to our backend, we are actively in real-time measuring that and cost accounting it to each individual team. And in real-time, if the usage goes above what is allocated, will actually limit that particular team or that particular developer, and prevent them by default from using more. And with that mechanism, you can imagine that’s how the bill is controlled and controlled in real-time.

Corey: So, help me understand, on some level; is your architecture then agent-based? Is it a library that gets included in the application code itself? All of the above and more? Something else entirely? Or is this just such a ridiculous question that you can’t believe that no one has ever asked it before?

Martin: No, it’s a great question, Corey, and would love to give some more insight there. So, it is an agent that runs in the customer environment because it does need to be something there that goes and collects all the data we’re interested in to send it to the backend. This agent is unlike a lot of APM agents out there where it does, sort of, introspection, things like that. We really believe in the power of the open-source community, and in particular, open-source standards like the Prometheus format for metrics. So, what this agent does is it actually goes and discovers Prometheus endpoints exposed by the infrastructure and applications, and scrapes those endpoints to collect the monitoring data to send to the backend.

And that is the only piece of software that runs in our customer environments. And then from that point on, all of the data is in our backend, and that’s where we go and process it and get visibility into the end-users as well as store it and make it available for alerting and dashboarding purposes as well.

Corey: So, when did you found Chronosphere? I know that you folks recently raised a Series B—congratulations on that, by the way; that generally means, at least if I understand the VC world correctly, that you’ve established product-market fit and now we’re talking about let’s scale this thing. My experience in startup land was, “Oh, we’ve raised a Series B, that means it’s probably time to bring in the first DevOps hire.” And that was invariably me, and I wound up screaming and freaking out for three months, and then things were better. So, that was my exposure to Series B.

But it seems like, given what you do, you probably had a few SRE folks kicking around, even on the product team because everything you’re saying so far absolutely resonates with the experiences someone who has run these large-scale things in production. No big surprise there. Is that where you are? I mean, how long have you been around?

Martin: Yeah, so we’ve been around for a couple of years thus far—so still a relatively new company, for sure. A lot of the core team were the team that both built the underlying technology and also ran it in production the many years at Uber, and that team is now here at Chronosphere. So, you can imagine from the very beginning, we had DevOps and SREs running this hosted platform for us. And it’s the folks that actually built the technology and ran it for years running it again, outside of Uber now. And then to your first question, yes, we did establish fairly early on, and I think that is also because we could leverage a lot of the technology that we had built at Uber, and it sort of gave us a boost to have a product ready for the market much faster.

And what we’re seeing in the industry right now is the adoption of cloud-native is so fast that it’s sort of accelerating a need of a new monitoring solution that historical solutions, perhaps, cannot handle a lot of the use cases there. It’s a new architecture, it’s a new technology stack, and we have the solution purpose-built for that particular stack. So, we are seeing fairly fast acceleration and adoption of our product right now.

Corey: One problem that an awful lot of monitoring slash observability companies have gotten into in the last few years—at least it feels this way, and maybe I’m wildly incorrect—is that it seems that the target market is the Ubers of the world, the hyperscalers where once you’re at that scale, then you need a tool like this, but if you’re just building a standard three-tier web app, oh, you’re nowhere near that level of scale. And the problem with go-to-market in those stories inherently seems that by the time you are a hyperscalers, you have already built a somewhat significant observability apparatus, otherwise you would not have survived or stayed up long enough to become a hyperscalers. How do you find that the on-ramp looks? I mean, your website does talk about, “When you outgrow Prometheus.” Is there a certain point of scale that customers should be at before they start looking at things like Chronosphere?

Martin: I think if you think about the companies that are born in the cloud today and how quickly they are running and they are iterating their technology stack, monitoring is so critical to that. It’s the real-time visibility of these changes that are going out multiple times a day is critical to the success and growth of a lot of new companies. And because of how critical that piece is, we’re finding that you don’t have to be a giant hyperscalers like Uber to need technology like this. And as you rightly pointed out, you need technology like this as you scale up. And what we’re finding is that while a lot of large tech companies can invest a lot of resources into hiring these teams and building out custom software themselves, generally, it’s not a great investment on their behalf because those are not companies that are selling monitoring technology as their core business.

So generally, what we find is that it is better for companies to perhaps outsource or purchase, or at least use open-source solutions to solve some of these problems rather than custom-build in-house. And we’re finding that earlier and earlier on in a company’s lifecycle, they’re needing technology like this.

Corey: Part of the problem I always ran into was—again, I come from the old world of grumpy Unix sysadmins—for me, using Nagios was my approach to monitoring. And that’s great when you have a persistent stateful, single node or a couple of single nodes. And then you outgrow it because well, now everything’s ephemeral and by the time you realize that there’s an outage or an issue with a container, the container hasn’t existed for 20 minutes. And you better have good telemetry into what’s going on and how your application behaves, especially at scale because at that point, edge cases, one-in-a-million events happen multiple times a second, depending upon scale, and that’s a different way of thinking. I’ve been somewhat fortunate in that, in my experience at least, I’ve not usually had to go through
those transformative leaps.

I’ve worked with Prometheus, I’ve worked with Nagios, but never in the same shop. That’s the joy of being a consultant. You go into one environment, you see what they’re doing and you take notes on what works and what doesn’t, you move on to the next one. And it’s clear that there’s a definite defined benefit to approaching observability in a more modern way. But I despair the idea of trying to go from one to the other. And maybe that just speaks to a lack of vision for me.

Martin: No, I don’t think that’s the case at all, Corey. I think we are seeing a lot of companies do this transition. I don’t think a lot of companies go and ditch everything that they’ve done. And things that they put years of investment into, there’s definitely a gradual migration process here. And what we’re seeing is that a lot of the newer projects, newer environments, newer efforts that have been kicked off are being monitored and observed using modern technology like Prometheus.

And then there’s also a lot of legacy systems which are still going to be around and legacy processes which are still going to be around for a very long time. It’s actually something we had to deal with that at Uber as well; we were actually using Nagios and a StatsD Graphite stack for a very long time before switching over to a more modern tag-like system like Prometheus. So—

Corey: Oh, modern Nagios. What was it, uh… that’s right, Icinga. That’s what it was.

Martin: Yes, yes. It was actually the system that we were using Uber. And I think for us, it’s not just about ditching all of that investment; it’s really about supporting this migration as well. And this is why both in the open-source technology M3, we actually support both the more legacy data types, like StatsD and the Graphite query language, as well as the more modern types like Prometheus and PromQL. And having support for both allows for a migration and a transition.

And not even a complete transition; I’m sure there will always be StatsD, Graphite data in a lot of these companies because they’re just legacy applications that nobody owns or touches anymore, and they’re just going to be lying around for a long time. So, it’s actually something that we proactively get ahead of and ensure that we can support both use cases even though we see a lot of companies and trending towards the modern technology solutions, for sure.

Corey: The last point I want to raise has always been a personal, I guess, area of focus for me. I allude to it, sometimes; I’ve done a Twitter thread or two on it, but on your website, you say something that completely resonates with my entire philosophy, and to be blunt is why in many cases, I’m down on an awful lot of vendor tooling across a wide variety of disciplines. On the open-source page on your site, near the bottom, you say, and I quote, “We want our end-users to build transferable skills that are not vendor or product-specific.”
And I don’t think I’ve ever seen a vendor come out and say something like that. Where did that come from?

Martin: Yeah. If you look at the core of the company, it is built on top of open-source technology. So, it is a very open core company here at Chronosphere, and we really believe in the power of the open-source community and in particular, perhaps not even individual projects, but industry standards and open standards. So, this is why we don’t have a proprietary protocol, or proprietary agent, or proprietary query language in our product because we truly believe in allowing our end-users to build these transferable skills and industry-standard skills. And right now that is using Prometheus as the client library for monitoring and PromQL as the query language.

And I think it’s not just a transferable skill that you can bring with you across multiple companies, it is also the power of that broader community. So, you can imagine now that there is a lot more sharing of, “Hey, I am monitoring, for example, MongoDB. How should I best do that?” Those skills can be shared because the common language that they’re all speaking, the queries that everybody is sharing with each other, the dashboards everybody is sharing with each other, are all, sort of, open-source standards now. And we really believe in the power that and we really do everything we can to promote that. And that is why in our product, there isn’t any proprietary query language, or definitions of dashboarding, or [learning 00:35:39] or anything like that. So yeah, it is definitely just a core tenant of the company, I would say.

Corey: It’s really something that I think is admirable, I’ve known too many people who wind up, I guess, stuck in various environments where the thing that they work on is an internal application to the company, and nothing else like it exists anywhere else, so if they ever want to change jobs, they effectively have a black hole on their resume for a number of years. This speaks directly to the opposite. It seems like it’s not built on a lock-in story; it’s built around actually solving problems. And I’m a little ashamed to say how refreshing that is [laugh] just based upon what that says about our industry.

Martin: Yeah, Corey. And I think what we’re seeing is actually the power of these open-source standards, let’s say. Prometheus is actually having effects on the broader industry, which I think is great for everybody. So, while a company like Chronosphere is supporting these from day one, you see how pervasive the Prometheus protocol and the query language are that actually all of these probably more traditional vendors providing proprietary protocols and proprietary query languages all actually have to have Prometheus—or not ‘have to have,’ but we’re seeing that more and more of them are having Prometheus compatibility as well. And I think that just speaks to the power of the industry, and it really benefits all of the end-users and the industry as a whole, as opposed to the vendors, which we are really happy to be supporters of.

Corey: Thank you so much for taking the time to speak with me today. If people want to learn more about what you’re up to, how you’re thinking about these things, where can they find you? And I’m going to go out on a limb and assume you’re also hiring.

Martin: We’re definitely hiring right now. And you can find us on our website at chronosphere.io or feel free to shoot me an email directly. My email is martin@chronosphere.io. Definitely massively hiring right now, and also, if you do have problems trying to monitor your cloud-native environment, please come check out our website and our product.

Corey: And we will, of course, include links to that in the [show notes 00:37:41]. Thank you so much for taking the time to speak with me
today. I really appreciate it.

Martin: Thanks a lot for having me, Corey. I really enjoyed this.

Corey: Martin Mao, CEO and co-founder of Chronosphere. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment speculating about how long it took to convince Martin not to name the company ‘Observability Manager Chronosphere Manager.’

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

  continue reading

549 에피소드

Artwork
icon공유
 
Manage episode 295622553 series 2937944
Corey Quinn에서 제공하는 콘텐츠입니다. 에피소드, 그래픽, 팟캐스트 설명을 포함한 모든 팟캐스트 콘텐츠는 Corey Quinn 또는 해당 팟캐스트 플랫폼 파트너가 직접 업로드하고 제공합니다. 누군가가 귀하의 허락 없이 귀하의 저작물을 사용하고 있다고 생각되는 경우 여기에 설명된 절차를 따르실 수 있습니다 https://ko.player.fm/legal.

About Martin

Martin Mao is the co-founder and CEO of Chronosphere. He was previously at Uber, where he led the development and SRE teams that created and operated M3. Prior to that, he was a technical lead on the EC2 team at AWS and has also worked for Microsoft and Google. He and his family are based in our Seattle hub and he enjoys playing soccer and eating meat pies in his spare time.

Links:

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by Thinkst. This is going to take a minute to explain, so bear with me. I linked against an early version of their tool, canarytokens.org in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you want to; it gives you fake AWS API credentials, for example. And the only thing that these things do is alert you whenever someone attempts to use those things. It’s an awesome approach. I’ve used something similar for years. Check them out. But wait, there’s more. They also have an enterprise option that you should be very much aware of canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files on it, you get instant alerts. It’s awesome. If you don’t do something like this, you’re likely to find out that you’ve gotten breached, the hard way. Take a look at this. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I love it.” That’s canarytokens.org and canary.tools. The first one is free. The second one is enterprise-y. Take a look. I’m a big fan of this. More from them in the coming weeks.

Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. I’ve often talked about observability, or as I tend to think of it when people aren’t listening, hipster monitoring. Today, we have a promoted episode from a company called Chronosphere, and I’m joined today by Martin Mao, their CEO and co-founder. Martin, thank you for coming on the show and suffering my slings and arrows.

Martin: Thanks for having me on the show, Corey, and looking forward to our conversation today.

Corey: So, before we dive into what you’re doing now, I’m always a big sucker for origin stories. Historically, you worked at Microsoft and Google, but then you really sort of entered my sphere of things that I find myself having to care about when I’m lying awake at night and the power goes out by working on the EC2 team over at AWS. Tell me a little bit about that. You’ve hit the big three cloud providers at this point. What was that like?

Martin: Yeah, it was an amazing experience, I was a technical lead on one of the EC2 teams, and I think when an opportunity like that comes up on such a core foundational project for the cloud, you take it. So, it was an amazing opportunity to be a part of leading that team at a fairly early stage of AWS and also helping them create a brand new service from scratch, which was AWS Systems Manager, which was targeted at fleet-wide management of EC2 instances, so—

Corey: I’m a tremendous fan of Systems Manager, but I’m still looking for the person who named Systems Manager Session Manager because, at this point, I’m about to put a bounty out on them. Wonderful service; terrible name.

Martin: That was not me. So, yes. But yeah, no, it was a great experience, for sure, and I think just seeing how AWS operated from the inside was an amazing learning experience for me. And being able to create foundational pieces for the cloud was also an amazing experience. So, only good things to say about my time at AWS.

Corey: And then after that, you left and you went to Uber where you led development and SRE teams that created and operated something called M3. Alternately, I’m misreading your bio, and you bought an M3 from BMW and went to drive for Uber. Which is it?

Martin: I wish it was the second one, but unfortunately, it is the first one. So yes, I did leave AWS and joined Uber in 2015 to lead a core part of their monitoring and eventually larger observability team. And that team did go on to build open-source projects such as M3—which perhaps we should have thought about the name and the conflict with the car when we named it at the time—and other projects such as Jaeger for distributed tracing as well, and a logging backend system, too. So, yeah, definitely spent many years there building out their observability stack.

Corey: We’re going to tie a theme together here. You were at Microsoft, you were at Google, you were at AWS, you were at Uber, and you look at all of this and decide, “All right. My entire career has been spent in large companies doing massive globally scaled things. I’m going to go build a small startup.” What made you decide that, all right, this is something I’m going to pursue?

Martin: So, definitely never part of the plan. As you mentioned, a lot of big tech companies, and I think I always got a lot of joy building large distributed systems, handling lots of load, and solving problems at a really grand scale. And I think the reason for doing a startup was really the situation that we were in. So, at Uber as I mentioned, myself and my co-founder led the core part of the observability team there, and we were lucky to happen to solve the problem, not just for Uber but for the broader community, especially the community adopting cloud-native architecture. And it just so happened that we were solving the problem of Uber in 2015, but the rest of the industry has similar problems today.

So, it was almost the perfect opportunity to solve this now for a broader range of companies out there. And we already had a lot of the core technology built-in open-source as well. So, it was more of an opportunity rather than a long-term plan or anything of that sort, Corey.

Corey: So, before we dive into the intricacies of what you’ve built, I always like to ask people this question because it turns out that the only thing that everyone agrees on is that everyone else is wrong. What is the dividing line, if any, between monitoring and observability?

Martin: That’s a great question, and I don’t know if there’s an easy answer.

Corey: I mean, my cynical approach is that, “Well, if you call it monitoring, you don’t get to bring in SRE-style salaries. Call it observability and no one knows what the hell we’re talking about, so sure, it’s a blank check at that point.” It’s cynical, and probably not entirely correct. So, I’m curious to get your take on it.

Martin: Yeah, for sure. So, you know, there’s definitely a lot of overlap there, and there’s not really two separate things. In my mind at least, monitoring, which has been around for a very long time, has always been around notification and having visibility into your systems. And then as the system’s got more complex over time, being able to understand that and not just have visibility into it but understand it a little bit more required, perhaps, additional new data types to go and solve those problems. And that’s how, in my mind, monitoring sort of morphed into observability. So, perhaps one is a subset of the other, and they’re not competing concepts there. But at least that’s my opinion. I’m sure there are plenty out there that would, perhaps, disagree with that.

Corey: On some level, it almost hits to the adage of, past a certain point of scale with distributed systems, it’s never a question of is the app up or down, it’s more a question of how down is it? At least that’s how it was explained to me at one point, and it was someone who was incredibly convincing, so I smiled and nodded and never really thought to question it any deeper than that. But I look back at the large-scale environments I’ve been in, and yeah, things are always on fire, on some level, and ideally, there are ways to handle and mitigate that. Past a certain point, the approach of small-scale systems stops working at large scale. I mean, I see that over in the costing world where people will put tools up on GitHub of, “Hey, I ran this script, and it works super well on my 10 instances.”

And then you try and run the thing on 10,000 instances, and the thing melts into the floor, hits rate limits left and right because people don’t think in terms of those scales. So, it seems like you’re sort of going from the opposite end. Well, this is how we know things work at large scale; let’s go ahead and build that out as an initially smaller team. Because I’m going to assume, not knowing much about Chronosphere yet, that it’s the sort of thing that will help a company before they get to the hyperscaler stage.

Martin: A hundred percent, and you’re spot on there, Corey. And it’s not even just a company going from small-stage, small-scale simple systems to more complicated ones, actually, if you think about this shift in the cloud right now, it’s really going from cloud to cloud-native. So, going from VMs to container on the infrastructure tier, and going from monoliths to microservices. So, it’s not even the growth of the company, necessarily, or the growth of the load that the system has to handle, but this shift to containers and microservices heavily accelerates the growth of the amount of data that gets produced, and that is causing a lot of these problems.

Corey: So, Uber was famous for disrupting, effectively, the taxi market. What made you folks decide, “I know. We’re going to reinvent observability slash monitoring while we’re at it, too.” What was it about existing approaches that fell down and, I guess, necessitated you folks to build your own?

Martin: Yeah, great question, Corey. And actually, it goes to the first part; we were disrupting the taxi industry, and I think the ability for Uber to iterate extremely fast and respond as a business to changing market conditions was key to that disruption. So, monitoring and observability was a key part of that because you can imagine it was providing all of the real-time visibility to not only what was happening in our infrastructure and applications, but the business as well. So, it really came out of a necessity more than anything else. We found that in order to be more competitive, we had to adopt what is probably today known as cloud-native architecture, adopt running on containers and microservices so that we can move faster, and along with that, we found that all of the existing monitoring tools we were using, weren’t really built for this type of environment. And it was that that was the forcing function for us to create our own technologies that were really purpose-built for this modern type of environment that gave us the visibility we needed to, to be competitive as a company and a business.

Corey: So, talk to me a little bit more about what observability is. I hear people talking about it in terms of having three pillars; I hear people talking about it, to be frank, in a bunch of ways so that they’re trying to, I guess, appropriate the term to cover what they already
are doing or selling because changing vocabulary is easier than changing an entire product philosophy. What is it?

Martin: Yeah, we actually had a very similar view on observability, and originally we thought that it is a combination of metrics, logs, and traces, and that’s a very common view. You have the three pillars, it’s almost like three checkboxes; you tick them off, and you have, quote-unquote, “Observability.” And that’s actually how we looked at the problem at Uber, and we built solutions for each one of those and we checked all three boxes. What we’ve come to realize since then is perhaps that was not the best way to look at it because we had all three, but what we realized is that actually just having all three doesn’t really help you with the ultimate goal of what you want from this platform, and having more of each of the types of data didn’t really help us with that, either. So, taking a step back from there and when we really looked at it, the lesson that we learned in our view on observability is really more from an end-user perspective, rather than a data type or data input perspective.

And really, from an end-user perspective, if you think about why you want to use your monitoring tool or your observability tool, you really
want to be notified of issues and remediate them as quickly as possible. And to do that, it really just comes down to answering three questions. “Can I get notified when something is wrong? Yes or no? Do I even know something is wrong?”

The second question is, “Can I triage it quickly to know what the impact is? Do I know if it’s impacting all of my customers or just a subset of them, and how bad is the issue? Can I go back to sleep if I’m being paged at two o’clock in the morning?”

And the third one is, “Can I figure out the underlying root cause to the problem and go and actually fix it?” So, this is how we think about the problem now, is from the end-user perspective. And it’s not that you don’t need metrics, logs, or distributed traces to solve the problem, but we are now orienting our solution around solving the problem for the end-user, as opposed to just orienting our solution around the three data types, per se.

Corey: I’m going to self-admit to a fun billing experience I had once with a different monitoring vendor whom I will not name because it turns out, you can tell stories, you can name names, but doing both gets you in trouble. It was a more traditional approach in a simpler time, and they wound up sending me a message saying, “Oh, we’re hitting rate limits on CloudWatch. Go ahead and open a ticket asking for them to raise it.” And in a rare display of foresight, AWS respond to my ticket with a, “We can do this, but understand at this level of concurrency, it will cost something like $90,000 a month on increased charges, with that frequency, for that many metrics.” And that was roughly twice what our AWS bill was in those days, and, “Oh.” So, I’m curious as to how you can offer predictable pricing when you can have things that emit so much data so quickly. I believe you when you say you can do it; I’m just trying to understand the philosophy of how that works.

Martin: As I said earlier, we started to approach this by trying to solve it in a very engineering fashion where we just wanted to create more efficient backend technology so that it would be cheaper for the increased amount of data. What we realized over time is that no matter how much cheaper we make it, the amount of data being produced, especially from monitoring and observability, kept increasing, and not even in a linear fashion but in an exponential fashion. And because of that, it really switched the problem not to how efficiently can we store this, it really changed our focus of the problem to how our users using this data, and do they even understand the data that’s being produced? So, in addition to the couple of properties I mentioned earlier, around cost accounting and rate-limiting—those are definitely required—the other things we try to make available for our end-users is introspection tools such that they understand the type of data that’s being produced. It’s actually very easy in the monitoring and observability world to write a single line of code that actually produces a lot of data, and most developers don’t understand that that single line of code produces so much data.

So, our approach to this is to provide a tool so that developers can introspect and understand what is produced on the backend side, not what is being inputted from their code, and then not only have an understanding of that but also dynamic ways to deal with it. So that again, when they hit the rate limit, they don’t just have to monitor it less, they understand that, “Oh, I inserted this particular label and now I have 20 times the amount of data that I needed before. Do I really need that particular label in there> and if not, perhaps dropping it dynamically on the server-side is a much better way of dealing with that problem than having to roll back your code and change your metric instrumentation.” So, for us, the way to deal with it is not to just make the backend even more efficient, but really to have end-users understand the data that they’re producing, and make decisions on which parts of it is really useful and which parts of it do they, perhaps not want or perhaps want to retain for shorter periods of time, for example, and then allow them to actually implement those changes on that data on the backend. And that is really how the end-users control the bills and the cost themselves.

Corey: So, there are a number of different companies in the observability space that have different approaches to what they solve for. In some cases, to be very honest, it seems like, well, I have 15 different observability and monitoring tools. Which ones do you replace? And the answer is, “Oh, we’re number 16.” And it’s easy to be cynical and down on that entire approach, but then you start digging into it and they’re actually right.

I didn’t expect that to be the case. What was your perspective that made you look around the, let’s be honest, fairly crowded landscape of observability companys’ tools that gave insight into the health status and well being of various applications in different ways, and say, “You know, no one’s quite gotten this right, yet. I have a better idea.”

Martin: Yeah, you’re completely correct, and perhaps the previous environments that everybody was operating in, there were a lot of different tools for different purposes. A company would purchase an infrastructure monitoring tool, or perhaps even a network monitoring tool, and then they would have, perhaps, an APM solution for the applications, and then perhaps BI tools for the business. So, there was always historically a collection of different tools to go and solve this problem. And I think, again, what has really happened recently with this shift to cloud-native recently is that the need for a lot of this data to be in a single tool has become more important than ever. So, you think about your microservices running on a single container today, if a single container dies in isolation without knowing, perhaps, which microservice was running on it doesn’t mean very much, and just having that visibility is not going to be enough, just like if you don’t know which business use case that microservice was serving, that’s not going to be very useful for you, either.

So, with cloud-native architecture, there is more of a need to have all of this data and visibility in a single tool, which hasn’t historically happened. And also, none of the existing tools today—so if you think about both the existing APM solutions out there and the existing hosted solutions that exist in the world today, none of them were really built for a cloud-native environment because you can think about even the timing that these companies were created at, you know, back in early 2010s, Kubernetes and containers weren’t really a thing. So, a lot of these tools weren’t really built for the modern architecture that we see most companies shifting towards. So, the opportunity was really to build something for where we think the industry and everyone’s technology stack was going to be as opposed to where the technology stack has been in the past before. And that was really the opportunity there, and it just so happened that we had built a lot of these solutions for a similar type environment for Uber many years before. So, leveraging a lot of our lessons learned there put us in a good spot to build a new solution that we believe is fairly different from everything else that exists today in the market, and it’s going to be a good fit for companies moving forward.

Corey: So, on your website, one of the things that you, I assume, put up there just to pick a fight—because if there’s one thing these people love, it’s fighting—is a use case is outgrowing Prometheus. The entire story behind Prometheus is, “Oh, it scales forever. It’s what the hyperscalers would use. This came out of the way that Google does things.” And everyone talks about Google as if it’s this mythical Valhalla place where everything is amazing and nothing ever goes wrong. I’ve seen the conference talks. And that’s great. What does outgrowing Prometheus look like?

Martin: Yeah, that’s a great question, Corey. So, if you look at Prometheus—and it is the graduated and the recommended monitoring tool for cloud-native environments—if you look at it and the way it scales, actually, it’s a single binary solution, which is great because it’s really easy to get started. You deploy a single instance, and you have ingestion, storage, and visibility, and dashboarding, and alerting, all packaged together into one solution, and that’s definitely great. And it can scale by itself to a certain point and is definitely the recommended starting point, but as you really start to grow your business, increase your cluster sizes, increase the number of applications you have, actually isn’t a great fit for horizontal scale. So, by default, there isn’t really a high availability and horizontal scale built into Prometheus by default, and that’s why other projects in the CNCF, such as Cortex and Thanos were created to solve some of these problems.

So, we looked at the problem in a similar fashion, and when we created M3, the open-source metrics platform that came out of Uber, it was also approaching it from this different perspective where we built it to be horizontally scalable, and highly reliable from the beginning, but yet, we don’t really want it to be a, let’s say, competing project with Prometheus. So, it is actually something that works in tandem with Prometheus, in the sense that it can ingest Prometheus metrics and you can issue Prometheus query language queries against it, and it will fulfill those. But it is really built for a more scalable environment. And I would say that once a company starts to grow and they run into some of these pain points and these pain points are surrounding how reliable a Prometheus instance is, how you can scale it up beyond just giving it more resources on the VM that it runs on, vertical scale runs out at a certain point. Those are some of the pain points that a lot of companies do run into and need to solve eventually. And there are various solutions out there, both in open-source and in the commercial world, that are designed to solve those pain points. M3 being one of the open-source ones and, of course, Chronosphere being one of the commercial ones.

Corey: This episode is sponsored in part by Salesforce. Salesforce invites you to “Salesforce and AWS: Whats Ahead for Architects, Admins and Developers” on June 24th at 10AM, Pacific Time. Its a virtual event where you’ll get a first look at the latest innovations of the Salesforce and AWS partnership, and have an opportunity to have your questions answered. Plus you’ll get to enjoy an exclusive performance from Grammy Award winning artist The Roots! I think they’re talking about a band, not people with super user access to a system. Registration is free at salesforce.com/whatsahead.

Corey: Now, you’ve also gone ahead and more or less dangled raw meat in front of a tiger in some respects here because one of the things that you wind up saying on your site of why people would go with Chronosphere is, “Ah, this doesn’t allow for bill spike overages as far as what the Chronosphere bill is.” And that’s awesome. I love predictable pricing. It’s sort of the antithesis of cloud bills. But there is the counterargument, too, which is with many approaches to monitoring, I don’t actually care what my monitoring vendor is going to charge me because they wind up costing me five times more, just in terms of CloudWatch charges. How does your billing work? And how do you avoid causing problems for me on the AWS side, or other cloud provider? I mean, again, GCP and Azure are not immune from this.

Martin: So, if you look at the built-in solutions by the cloud providers, a lot of those metrics and monitoring you get from those like
CloudWatch or Stackdriver, a lot of it you get included for free with your AWS bill already. It’s only if you want additional data and additional retention, do you choose to pay more there. So, I think a lot of companies do use those solutions for the default set of monitoring that they want, especially for the AWS services, but generally, a lot of companies have custom monitoring requirements outside of that in the application tier, or even more detailed monitoring in the infrastructure that is required, especially if you think about Kubernetes.

Corey: Oh, yeah. And then I see people using CloudWatch as basically a monitoring, or metric, or log router, which at its price point, don’t
do that. [laugh]. It doesn’t end well for anyone involved.

Martin: A hundred percent. So, our solution and our approach is a little bit different. So, it doesn’t actually go through CloudWatch or any of these other inbuilt cloud-hosted solutions as a router because, to your point, there’s a lot of cost there as well. It actually goes and collects the data from the infrastructure tier or the applications. And what we have found is that not only does the bill for monitoring climb exponentially—and not just as you grow; especially as you shift towards cloud-native architecture—our very first take of solving that problem is to make the backend a lot more efficient than before so it just is cheaper overall.

And we approached it that way at Uber, and we had great results there. So, when we created an—originally before M3, 8% of Uber’s infrastructure bill was spent on monitoring all the infrastructure and the application. And by the time we were done with M3, the cost was a little over 1%. So, the very first solution was just make it more efficient. And that worked for a while, but what we saw is that over time, this grew again.

And there wasn’t any more efficiency, we could crank out of the backend storage system. There’s only so much optimization you can do to the compression algorithms in the backend and how much you can get there. So, what we realized the problem shifted towards was not, can we store this data more efficiently because we’re already reaching limitations there, and what we noticed was more towards getting the users of this data—so individual developers themselves—to start to understand what data is being produced, how they’re using it, whether it’s even useful, and then taking control from that perspective. And this is not a problem isolated to the SRE team or the observability team anymore; if you think about modern DevOps practices, every developer needs to take control of monitoring their own applications. So, this responsibility is really in the hands of the developers.

And the way we approached this from a Chronosphere perspective is really in four steps. The first one is that we have cost accounting so that every developer, and every team, and the central observability team know how much data is being produced. Because it’s actually a hard thing to measure, especially in the monitoring world. It’s—

Corey: Oh, yeah. Even AWS bills get this wrong. Like if you’re sending data between one availability zone to another in the same region, it charges a penny to leave an AZ and a penny to enter an AZ in that scenario. And the way that they reflect this on the bill is they double it. So, if you’re sending one gigabyte across AZ link in a month, you’ll see two gigabytes on the bill and that’s how it’s reflected. And that is just a glimpse of the monstrosity that is the AWS billing system. But yeah, exposing that to folks so they can understand how much data their application is spitting off? Forget it. That never happens.

Martin: Right. Right. And it’s not even exposing it to the company as a whole, it’s to each use case, to each developer so they know how much data they are producing themselves. They know how much of the bill is being consumed. And then the second step in that is to put up bumper lanes to that so that once you hit the limit, you don’t just get a surprise bill at the end of the month.

When each developer hits that limit, they rate-limit themselves and they only impact their own data; there is no impact to the other developers or to the other teams, or to the rest of the company. So, we found that those two were necessary initial steps, and then there were additional steps beyond that, to help deal with this problem.

Corey: So, in order for this to work within a multi-day lag, in some cases, it’s a near certainty that you’re looking at what is happening and the expense that is being incurred in real-time, not waiting for it to pass its way through the AWS billing system and then do some tag attribution back.

Martin: A hundred percent. It’s in real-time for the stream of data. And as I mentioned earlier, for the monitoring data we are collecting, it goes straight from the customer environment to our backend so we’re not waiting for it to be routed through the cloud providers because, rightly so, there is a multi-day or multi-hour delay there. So, as the data is coming straight to our backend, we are actively in real-time measuring that and cost accounting it to each individual team. And in real-time, if the usage goes above what is allocated, will actually limit that particular team or that particular developer, and prevent them by default from using more. And with that mechanism, you can imagine that’s how the bill is controlled and controlled in real-time.

Corey: So, help me understand, on some level; is your architecture then agent-based? Is it a library that gets included in the application code itself? All of the above and more? Something else entirely? Or is this just such a ridiculous question that you can’t believe that no one has ever asked it before?

Martin: No, it’s a great question, Corey, and would love to give some more insight there. So, it is an agent that runs in the customer environment because it does need to be something there that goes and collects all the data we’re interested in to send it to the backend. This agent is unlike a lot of APM agents out there where it does, sort of, introspection, things like that. We really believe in the power of the open-source community, and in particular, open-source standards like the Prometheus format for metrics. So, what this agent does is it actually goes and discovers Prometheus endpoints exposed by the infrastructure and applications, and scrapes those endpoints to collect the monitoring data to send to the backend.

And that is the only piece of software that runs in our customer environments. And then from that point on, all of the data is in our backend, and that’s where we go and process it and get visibility into the end-users as well as store it and make it available for alerting and dashboarding purposes as well.

Corey: So, when did you found Chronosphere? I know that you folks recently raised a Series B—congratulations on that, by the way; that generally means, at least if I understand the VC world correctly, that you’ve established product-market fit and now we’re talking about let’s scale this thing. My experience in startup land was, “Oh, we’ve raised a Series B, that means it’s probably time to bring in the first DevOps hire.” And that was invariably me, and I wound up screaming and freaking out for three months, and then things were better. So, that was my exposure to Series B.

But it seems like, given what you do, you probably had a few SRE folks kicking around, even on the product team because everything you’re saying so far absolutely resonates with the experiences someone who has run these large-scale things in production. No big surprise there. Is that where you are? I mean, how long have you been around?

Martin: Yeah, so we’ve been around for a couple of years thus far—so still a relatively new company, for sure. A lot of the core team were the team that both built the underlying technology and also ran it in production the many years at Uber, and that team is now here at Chronosphere. So, you can imagine from the very beginning, we had DevOps and SREs running this hosted platform for us. And it’s the folks that actually built the technology and ran it for years running it again, outside of Uber now. And then to your first question, yes, we did establish fairly early on, and I think that is also because we could leverage a lot of the technology that we had built at Uber, and it sort of gave us a boost to have a product ready for the market much faster.

And what we’re seeing in the industry right now is the adoption of cloud-native is so fast that it’s sort of accelerating a need of a new monitoring solution that historical solutions, perhaps, cannot handle a lot of the use cases there. It’s a new architecture, it’s a new technology stack, and we have the solution purpose-built for that particular stack. So, we are seeing fairly fast acceleration and adoption of our product right now.

Corey: One problem that an awful lot of monitoring slash observability companies have gotten into in the last few years—at least it feels this way, and maybe I’m wildly incorrect—is that it seems that the target market is the Ubers of the world, the hyperscalers where once you’re at that scale, then you need a tool like this, but if you’re just building a standard three-tier web app, oh, you’re nowhere near that level of scale. And the problem with go-to-market in those stories inherently seems that by the time you are a hyperscalers, you have already built a somewhat significant observability apparatus, otherwise you would not have survived or stayed up long enough to become a hyperscalers. How do you find that the on-ramp looks? I mean, your website does talk about, “When you outgrow Prometheus.” Is there a certain point of scale that customers should be at before they start looking at things like Chronosphere?

Martin: I think if you think about the companies that are born in the cloud today and how quickly they are running and they are iterating their technology stack, monitoring is so critical to that. It’s the real-time visibility of these changes that are going out multiple times a day is critical to the success and growth of a lot of new companies. And because of how critical that piece is, we’re finding that you don’t have to be a giant hyperscalers like Uber to need technology like this. And as you rightly pointed out, you need technology like this as you scale up. And what we’re finding is that while a lot of large tech companies can invest a lot of resources into hiring these teams and building out custom software themselves, generally, it’s not a great investment on their behalf because those are not companies that are selling monitoring technology as their core business.

So generally, what we find is that it is better for companies to perhaps outsource or purchase, or at least use open-source solutions to solve some of these problems rather than custom-build in-house. And we’re finding that earlier and earlier on in a company’s lifecycle, they’re needing technology like this.

Corey: Part of the problem I always ran into was—again, I come from the old world of grumpy Unix sysadmins—for me, using Nagios was my approach to monitoring. And that’s great when you have a persistent stateful, single node or a couple of single nodes. And then you outgrow it because well, now everything’s ephemeral and by the time you realize that there’s an outage or an issue with a container, the container hasn’t existed for 20 minutes. And you better have good telemetry into what’s going on and how your application behaves, especially at scale because at that point, edge cases, one-in-a-million events happen multiple times a second, depending upon scale, and that’s a different way of thinking. I’ve been somewhat fortunate in that, in my experience at least, I’ve not usually had to go through
those transformative leaps.

I’ve worked with Prometheus, I’ve worked with Nagios, but never in the same shop. That’s the joy of being a consultant. You go into one environment, you see what they’re doing and you take notes on what works and what doesn’t, you move on to the next one. And it’s clear that there’s a definite defined benefit to approaching observability in a more modern way. But I despair the idea of trying to go from one to the other. And maybe that just speaks to a lack of vision for me.

Martin: No, I don’t think that’s the case at all, Corey. I think we are seeing a lot of companies do this transition. I don’t think a lot of companies go and ditch everything that they’ve done. And things that they put years of investment into, there’s definitely a gradual migration process here. And what we’re seeing is that a lot of the newer projects, newer environments, newer efforts that have been kicked off are being monitored and observed using modern technology like Prometheus.

And then there’s also a lot of legacy systems which are still going to be around and legacy processes which are still going to be around for a very long time. It’s actually something we had to deal with that at Uber as well; we were actually using Nagios and a StatsD Graphite stack for a very long time before switching over to a more modern tag-like system like Prometheus. So—

Corey: Oh, modern Nagios. What was it, uh… that’s right, Icinga. That’s what it was.

Martin: Yes, yes. It was actually the system that we were using Uber. And I think for us, it’s not just about ditching all of that investment; it’s really about supporting this migration as well. And this is why both in the open-source technology M3, we actually support both the more legacy data types, like StatsD and the Graphite query language, as well as the more modern types like Prometheus and PromQL. And having support for both allows for a migration and a transition.

And not even a complete transition; I’m sure there will always be StatsD, Graphite data in a lot of these companies because they’re just legacy applications that nobody owns or touches anymore, and they’re just going to be lying around for a long time. So, it’s actually something that we proactively get ahead of and ensure that we can support both use cases even though we see a lot of companies and trending towards the modern technology solutions, for sure.

Corey: The last point I want to raise has always been a personal, I guess, area of focus for me. I allude to it, sometimes; I’ve done a Twitter thread or two on it, but on your website, you say something that completely resonates with my entire philosophy, and to be blunt is why in many cases, I’m down on an awful lot of vendor tooling across a wide variety of disciplines. On the open-source page on your site, near the bottom, you say, and I quote, “We want our end-users to build transferable skills that are not vendor or product-specific.”
And I don’t think I’ve ever seen a vendor come out and say something like that. Where did that come from?

Martin: Yeah. If you look at the core of the company, it is built on top of open-source technology. So, it is a very open core company here at Chronosphere, and we really believe in the power of the open-source community and in particular, perhaps not even individual projects, but industry standards and open standards. So, this is why we don’t have a proprietary protocol, or proprietary agent, or proprietary query language in our product because we truly believe in allowing our end-users to build these transferable skills and industry-standard skills. And right now that is using Prometheus as the client library for monitoring and PromQL as the query language.

And I think it’s not just a transferable skill that you can bring with you across multiple companies, it is also the power of that broader community. So, you can imagine now that there is a lot more sharing of, “Hey, I am monitoring, for example, MongoDB. How should I best do that?” Those skills can be shared because the common language that they’re all speaking, the queries that everybody is sharing with each other, the dashboards everybody is sharing with each other, are all, sort of, open-source standards now. And we really believe in the power that and we really do everything we can to promote that. And that is why in our product, there isn’t any proprietary query language, or definitions of dashboarding, or [learning 00:35:39] or anything like that. So yeah, it is definitely just a core tenant of the company, I would say.

Corey: It’s really something that I think is admirable, I’ve known too many people who wind up, I guess, stuck in various environments where the thing that they work on is an internal application to the company, and nothing else like it exists anywhere else, so if they ever want to change jobs, they effectively have a black hole on their resume for a number of years. This speaks directly to the opposite. It seems like it’s not built on a lock-in story; it’s built around actually solving problems. And I’m a little ashamed to say how refreshing that is [laugh] just based upon what that says about our industry.

Martin: Yeah, Corey. And I think what we’re seeing is actually the power of these open-source standards, let’s say. Prometheus is actually having effects on the broader industry, which I think is great for everybody. So, while a company like Chronosphere is supporting these from day one, you see how pervasive the Prometheus protocol and the query language are that actually all of these probably more traditional vendors providing proprietary protocols and proprietary query languages all actually have to have Prometheus—or not ‘have to have,’ but we’re seeing that more and more of them are having Prometheus compatibility as well. And I think that just speaks to the power of the industry, and it really benefits all of the end-users and the industry as a whole, as opposed to the vendors, which we are really happy to be supporters of.

Corey: Thank you so much for taking the time to speak with me today. If people want to learn more about what you’re up to, how you’re thinking about these things, where can they find you? And I’m going to go out on a limb and assume you’re also hiring.

Martin: We’re definitely hiring right now. And you can find us on our website at chronosphere.io or feel free to shoot me an email directly. My email is martin@chronosphere.io. Definitely massively hiring right now, and also, if you do have problems trying to monitor your cloud-native environment, please come check out our website and our product.

Corey: And we will, of course, include links to that in the [show notes 00:37:41]. Thank you so much for taking the time to speak with me
today. I really appreciate it.

Martin: Thanks a lot for having me, Corey. I really enjoyed this.

Corey: Martin Mao, CEO and co-founder of Chronosphere. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment speculating about how long it took to convince Martin not to name the company ‘Observability Manager Chronosphere Manager.’

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

  continue reading

549 에피소드

모든 에피소드

×
 
Loading …

플레이어 FM에 오신것을 환영합니다!

플레이어 FM은 웹에서 고품질 팟캐스트를 검색하여 지금 바로 즐길 수 있도록 합니다. 최고의 팟캐스트 앱이며 Android, iPhone 및 웹에서도 작동합니다. 장치 간 구독 동기화를 위해 가입하세요.

 

빠른 참조 가이드