Looking for observability dashboards that have the right balance between looking good, easy to use, powerful, and intuitive? Ben Rometsch has a web application just for you. In this episode, he sits down with the creator and project lead of Grafana, Torkel Ödegaard, to talk about how he started the company. Torkel takes us inside the work they put into creating a great user interface as well as the behind-the-scenes of meeting co-founders, when they decided to fundraise, and scaling. He then shares stories on architecture decisions, some advice on how to navigate the open-source industry, and more.
Thanks for your time. Please introduce yourself and your project.
My name is Torkel Ödegaard. The project that I’ve been working on the last seven years is Grafana. It started as a hobby project and it became a company about a year after the first open source release. I hooked up with an American, Raj Dutt, and an Australian, Anthony Woods. Together, we Cofounded Grafana Labs with the goal to build an open SaaS monitoring observability solution built on open source components. That was our initial vision. Back then, we didn’t know that Grafana was going to be as big as it is now. In the early days, we weren’t called Grafana Labs. We called it Raintank in the early days. We got tired of referring to ourselves as, “We’re the company behind Grafana.” A lot of open source companies fell into that trap in the early days, having a separation between a company name and the project name. Many companies later rebranded to have a single name that people can recognize and know.
Take us back to that first moment where you started the project. You almost invented the segment.
The metrics, monitoring, application instrumentation, and the whole movement, a lot of credit goes to Graphite and StatsD. Also, Etsy pushed a lot of this. Etsy was the company that created StatsD and also blogged a lot about DevOps, instrumenting the application, and monitoring dashboards. StatsD is 2012 or 2013. Graphite is old — maybe even 2011. These were evolutions of other monitoring operation-related tools for visualizing infrastructure metrics, CPU, and memory.
What Graphite did and what StatsD also pushed was also including application metrics into that mix. It’s not just getting graphs on infrastructure, memory, CPU, and network, but also have custom metrics, application metrics, business metrics, everything there to correlate. That’s something that I fell in love with. I’m a developer and an application engineer. I wanted to see where the bottlenecks were in the applications and services that I was building and see what was going on in production. Adding a few instrumentation lines and sending metrics made it super easy. I could get beautiful graphs, almost real-time graphs for what was going on in production, and no worries about filling up on centralized logging or disk. These metrics were cheap compared to logs. I fell in love with metrics, instrumentation.
The tooling to build dashboards was poor at the time. The most popular time-series database back then was Graphite. It was cumbersome to build dashboards. It was generated PNGs. There was no tooling to help you understand the queries in what we were doing. The impetus for Grafana is a UI to help make the awesome power of time-series databases more user-friendly and more visually pleasing and interesting as well. You get incentivized to put that on a TV on the wall. That was the thing that also kick-started Grafana to make it viral and popular. People fell in love with building these dashboards for applications and services and putting them up on a TV on the wall. Other teams walk by and see that and they say, “I want a cool dashboard like that.” It quickly spreads throughout the organization. Also, it changes how you develop software.
One of the things that I always think about is how much effort went into the UI and the beauty of it. I’m remembering a software like Nagios. The graphs always look awful. They’re still being generated from the same Unix tool that was written 30 years ago. When I think of a Grafana, the first thing that comes into my head is that dark and beautiful dashboard in my head. Was that something that you realized you had to hit right from day one?
One thing that stuck with me and Graphite as well is you could have these dark screenshots, dark graphs, and powerful graphs with lots of customization, customizing every series, multiple Y-axes. I was also trying to make sure that the UI felt clean and elegant. There weren’t a lot of buttons that distracted from the graphs and the data and to have something that looked awesome, vibrant, and something that you want. I’ve always felt that UIs that are good-looking are fun to use. You need some of that fun to incentivize you to add instrumentation to build a dashboard. That process is not always fun, to do the thing that is required to get the end product. If the end product is good-looking and cool, it gives you that extra incentive to take the time to do all the things you need to do. One thing that motivated me to build UIs that look good is that they’re so much more fun to work with.
Who was responsible for that in the early days? Was that you or someone else?
It was me. I’m a back-end and a front-end developer. I’ve been that for a long time. I’m not a designer. I cannot draw at all. I do have an eye for UX and design there. I know what makes a polished and good-looking UI. I have some intuitive sense there. I’ve learned a ton over the years going from just development to working more with UX, UI, and product management as well. I have come to appreciate UX a lot. There is so much problem-solving. There are a lot of similarities between development and UX. You have a lot of conflicting requirements. The space of possibilities is almost endless. You have to navigate this space to find something that meets most of the requirements and has the right balance between looking good, easy to use, powerful, and intuitive.
There are some designers I worked with at the software agency that I started. The tooling around design seems to be converging more with all the processes around software engineering. They have Sketch and almost a commit history and shared collaboration against things. That’s interesting. I hadn’t considered that.
The whole profession of UX design has evolved a lot over the last 15 or so years. It has become much closer to development. There’s a lot more iterative collaboration between UX and engineers. It’s been interesting to see that.
I agree with that. In the early days, you were working on it by yourself and pushing to GitHub. At what point did you realize that there was something to it?
The first version, I worked on it by myself in maybe three months or so. We started using it a little bit in the company I was consulting for. As soon as I released the first open source version, it became popular within the Graphite community. They instantly took to it because they’ve been craving for a polished dashboarding alternative that not only solved the dashboard building aspect but also the query building and making that easy and intuitive. It took off incredibly fast. I added support for other data sources as well, which helped the growth curve, to be introduced to other communities that also use time-series databases like OpenTSDB and InfluxDB that were also trending at the time and later on Prometheus. In the early days, it was me. I quickly realized that there was something here. I resigned from my contract three months after the first open source version.
It was fast.
I started working on it full-time about six months after the first release. I worked on it for six months by myself full-time. I got open source sponsorships. I had plans to have a pro version and a paid differentiated version to support some business around it. I got open source sponsorships from many companies like Squarespace and MediaMath that helped postpone some of those commercial plans. I met up with others that were also looking to do something within the open source space to build a monitoring and observability company. The first year when we had the company, it was still mainly me. I had some design help as we created the company. I hired two friends. They’re still with me, now leading teams here at Grafana Labs. Now we’re probably around 60 people spread around eight or so teams working on Grafana or Grafana Plugins or related things.
How did you meet the other cofounders? Did you know them prior?
No. They contacted me and they said that they were planning to build something and maybe leverage Grafana in a new SaaS product. They invited me to work with them for a week as a consultant in New York. I was already planning to go to New York to visit Squarespace who was sponsoring the project. I spent a week with them. We connected personally, Raj and Anthony, my other Cofounders. We complemented each other well. Raj previously had run a hosting company. He’s a business guy. He started the hosting company in his dorm. He’s super technical and knows the hosting space. They saw a lot of opportunity in the monitoring space knowing that this was a big problem for all their customers who were customers of their hosting service.
Anthony Woods, my other Cofounder, is the CTO. He ran the technical aspects of these data centers. He worked with Raj before. He has the technical experience of hosting large-scale services. I’m more of a traditional engineering and architecture background leading engineering teams. I’m a developer. I love coding. We became a good foundation for a company, both a business guy with a technical background and awesome technical sense, and one that knows how to operate large services, and an engineering developer.
The donations you were getting, were they all corporate entities donating? Did you have individuals as well?
“The open-source space has definitely evolved a lot in the last five years. It’s a much more hot area for investors.”
No. I only had three companies supporting the project at that early stage.
When you founded the company, were you worried about people stealing a march on you and building the Grafana business without you?
Not so much in the early days. We didn’t know how big Grafana was going to be like for brand it is a platform like it is now. In the early days, I licensed the trademark to own Grafana as a trademark. That’s the thing we did as protection. We put our effort into building a SaaS platform. That was the initial take to monetize the open source project through SaaS services. Later, we had a lot of requests for enterprise features, enhanced authentication, and on-premises software. Sp about three years ago we started working on a differentiated Grafana enterprise version. It has LDAP features, permission models, reporting, and other features that were requests from large companies. That became the second basis for our revenues beyond Cloud services. It was enterprise differentiated.
Which license had you chosen from day one?
Grafana is Apache. Most of our product has been Apache. Not our enterprise plugins and differentiated features. They’re licensed under a commercial license.
Has that seen you well, do you think?
It has. There’s fear that other big SaaS vendors can takeGrafanaand host it for themselves and potentially rob the company that is supporting the vast majority of all the contributions. Kind of robbing us of that potential revenue that is supporting the project. There is a fear there. I do fully understand why many companies are switching to other license models that help protect the business that is supporting the project the most. I have lots of sympathy for Elastic, MongoDB, and others that have other license models to make sure that the company that has built the open source project can sustain it.
At what point did you decide to raise money? Was that something that you always intended to do?
Yes. That was always something we intended to do. We started with a small seed. The first two years were with a small team of under ten people. We managed to go by with a small initial seed. Through some contract support deals and some roadmap assurance when we do features, we do want to do it ourselves but maybe in the future so companies can pay us to prioritize the feature.
I hadn’t heard of that before.
It’s been working pretty well in terms of being an added incentive to buy Grafana enterprise. You can also get us to reprioritize some features that we do want to do but maybe much further in the future than a customer wants. We need this feature. We can consider that to raise the priority of that if you buy Grafana enterprise. That’s something we have done and it’s something we did early on as well before even before we had Grafana enterprise. We had managed to go quite a long time without raising any serious amount of money. Our first Series A was September of 2019. We had gone by almost 5 years on the seed round and grown to around maybe 60 people. We had managed to do that by building a sustainable business on enterprise revenue and Cloud revenue. When we saw the foundation of the company business model working and we had grown to a place where the organization also could support more rapid growth, that was when we raised our Series A.
Nowadays, it must be a fairly standard and well-understood thing from the people financing these deals. Do you think it would have been a different story if you’d have tried to do that five years ago?
Yeah. The open source space has evolved a lot over the last 5 years. It’s a much more hot area for investors. It’s a much more proven business model and space as well. If you have a successful open source project and have clear ways to monetize it, there are a lot of easier pathways now to find interest from investors. The first Series A was in 2019 and then we did a Series B last summer.
How did you figure out what stuff stays open and what goes into your enterprise products? Is that fairly natural?
Mostly. Also, many times it’s hard. There’s an internal debate in terms of what goes into the enterprise and what goes into open source. The general sniff test that we’ve tried to develop is, on average, what’s the consumer of this feature look like? Is this a pain point that a small hobbyist or user or small companies are going to encounter? Is this a problem that only large-scale users will have? If it’s a feature that everyone will want to use and everyone will feel they are missing out, it’s probably something that should go into open source. We don’t want to make the open source project feel like a crippleware, where obvious good additions are left out. That’s been our philosophy so far.
What was the reaction to your open source community when you announced that there was going to be an enterprise version?
There’s been little reaction to that. Maybe it’s on small things and specific features around LDAP enhancements that people would want. It’s been pretty well received.
How do you go about managing that community? Is there a dedicated team within the company to manage the open source project?
The biggest Grafana teams are only working on the open source side of things. We’re maybe balancing up a little bit now because we’re growing our enterprise and Cloud teams a lot. If you look at the previous years of the company, the vast majority of the engineers have been working on open source projects. The open source project is the foundation of everything. It’s where we’re putting a lot of our effort into.
Talk a little bit about how you scaled that. Looking at your GitHub page, it’s like this behemoth monster. Was that tough on you or other core members of the team originally? Did you feel like you’re on the back of a big dragon that you couldn’t keep control over?
“It’s much more useful for Grafana to be more open, compostable, and use different data sources because that’s the reality of so many users.”
In terms of managing the development and scaling the number of people working on Grafana, it can be exhausting, especially for major versions. When we do a major version, we try to make it big, impactful, and a big reason for people to upgrade. Also, the opportunity to try to make as many platform plugin breaking. We never tried to break plugins but if we do it, it should be in the major version. We try to do as much as possible in a major version. Those can be stressful, the last month or so of a major release because there’s so much writing on this being great and being awesome. Those are stressful.
Otherwise, it’s been super exciting to see how much stuff happens every week because we have many teams working on Grafana. We have a big UX team that does all of the design. Years ago, I posted new designs in our UX channel but no one commented because it was mainly me working on UX issues. Now, we have a huge UX team. It’s changed in terms of the amount of stuff that happens every day and every week. It’s exciting to see this small project that I started. Now, it has a life of its own with many people working and contributing every day.
That’s awesome to hear. What tools do you use to manage that both internally and externally?
The teams that work mainly on the open source side use GitHub projects primarily. We try to keep as much of what we’re doing in the public so community members can also see what’s going on. We’re trying to do as much as possible in the public. We have community calls. When we do UX changes, we record them. We have a UX feedback session where we can talk through UX changes. Many of them, we put on. These are for internal only but we put them on YouTube so other community members can see the discussions that we are having regarding the UX changes.
They can understand the thought processes behind it.
Some other teams are using ZenHub and other tools that can more easily gain visibility into GitHub if you work on multiple reports. GitHub is the primary process tool for us in terms of day-to-day.
You don’t have a public Discord or Slack.
We have a public Slack where we can also engage with the community, especially plugin developers and others that are using Grafana as a platform.
How have you managed to avoid building a database? I would be thinking, “Now is the day I’m going to write a time-series database.” You must have woken up thinking that once.
Yes, in the early days. We saw the rise of Prometheus and we jumped early on the Prometheus bandwagon. Together Grafana and Prometheus have ridden this wave of Kubernetes, monitoring, and observability wave that’s coming the last 3 years. They made it so that we don’t have to do that. At the same time, we alsoinvested in Cortex, the multi-tenant Prometheus version that we use to provide our hosted Prometheus service. Grafana Labs employs many of the contributors to Prometheus and Cortex. We are building our own database. We also started a project called Loki for logs. We’re building a logs database. We have a tracing database as well in Tempo. In many ways, we have been building our own set of databases to cover all the different used cases around observability.
Are you sure there’s not a folder on your machine that’s five years old that’s and got five files in it? It’s amazing self-control to not do that as an engineer. Everyone wants to build a database at one point in their life.
I have sketches on query languages. Something that always annoyed me with the time-series database that Grafana works with is that I saw limitations in their query language design.Luckily, Grafana can interoperate and do many different data sources. Having a primary Grafana database hasn’t been that relevant. It’s much more useful for Grafana to be more open and composable and use different data sources because that’s the reality we see with many users.Many of our Grafana open source users, and companies and customers, have more than one type of data source they use with Grafana. In many cases, they have 4 or 5 primary data sources that they use. They use maybe Splunk for logs and Oracle for some SQL BI. They have Prometheus or Loki for a new team that is trying out some new technology. Having that ability to use both new and old and mix both on-premises data and data you might have with a Cloud provider, that’s a key thing in Grafana.
Was the plugin system something that you had in your head right from day one?
Three or four months after the first release of Grafana, I added support for InfluxDB. To do that, I had to create some internal abstractions around the data source. Using that abstraction, we added support for Prometheus, Elasticsearch, and a few others. It wasn’t until 2016 Grafana 3 where we made a plugin architecture and a plugin repository and made it easy for end-users to install plugins. Third-party plugins did exist before that but it was not super official and easy to install. The investment we did into plugins has been amazing. The number of community plugins that people have developed is incredible. That platform aspect is something that we leverage a lot internally as well with different apps and integrations that we’re building.
How did you go about designing that plugin system in terms of where the code was hosted and what languages you were going to support? If you want it to have approved versions or things like that; that sounds like a minefield.
The cool thing with the backend plugin is you can write backend plugins in Go. This is important if you want to have a data source that supports alerting for example or wants to talk to hardware directly to read sensor data from hardware directly. The backend architecture for the plugins is platform agnostic. It’s using Go RPC and the HashiCorp Go-plugin library. You can write backend plugins in any language you want almost. All the big supported languages work because of the API to talk to backend plugins is a remote protocol that is cross-platform with the Go RPC framework or protocol.
Are there any decisions you would have done differently having your experience in terms of that system? It sounds like there are many options that you can go for. That’s one of those designs where you’re going to end up pissing a few people off or they’re going to say, “This isn’t how it should be.”
Overall, the choices we’ve done in the plugin architecture have worked out surprisingly well. One of the things that has always been a big focus in Grafana has been backward compatibility, both with dashboards and plugins. When you update Grafana, your dashboard should still work and they should look the same. That part has mostly been true. Occasionally, we have changed some features or removed some features or broken some things intentionally or unintentionally but those remain minor. You can import a dashboard from one of the first versions of Grafana into Grafana 7. It will migrate through all the versions of that dashboard.
It’s the equivalent of running the Microsoft Calculator from Windows 2.
“Grafana is very easy to set up and run.”
We have migrations for every version. It will migrate it from this version to the next. That worked well. We try hard not to break plugins without reason. The plugin architecture has been relatively stable for Angular plugins. With the React plugins, we have been much more ambitious in having published packages and a much bigger API surface with a ton of React components and interfaces. It’s a lot easier to have subtle breaking changes.
React has been moving pretty fast as well for quite a while.
A lot of our UI components and some other data infrastructures and plugin interfaces are much bigger and more complex than they were for the Angular versions. It’s been a challenge getting those interfaces to be stable, those components to be stable, and to have more of a stable plugins platform. That took almost a year between version six when we started this React plugin packages and version seven when we said, “It’s better now. You can start using these components without them breaking every minor release.”
In terms of telemetry, are you guys getting telemetry on self-hosted installs?
Yeah. We have phone-home usage and completely anonymized usage reports that say, “Grafana server running version X.” Those reports, you can disable them. They are on by default. That’s one of the primary ways we can gauge the growth of the project and how many are installing and running Grafana. It’s been useful as a metric for success.
Do they report the version they’re running?
Yeah. We can see the update trends and how many are running specific versions.
What’s next for your project? You’ve done two raises fairly close together. You’ve got a lot of opportunities to go in different directions. Is that fair to say?
Yeah. The focus for the company is around Grafana Cloud and focusing a little bit more on our Cloud services. At the same time, we’re also bringing out new enterprise products in our metrics enterprise solutions, so our on-premise database for Prometheus metrics called backend metrics enterprise. We’re doing the same for logs soon. We’re investing in that area as well. A lot of our other efforts are going to the Cloud to make the Cloud the best place to run Grafana and the best place to get started with Grafana. That’s the short-term focus.
For Grafana, we’re about to release v7.4. After that, it’s another big version again. We’ve been working on version eight for some time already but that’s going to be the big focus in the coming months. Some awesome plans there for a much more expanded range of visualization capabilities to be able to take advantage of high resolution streaming and more analytical sides. Not only supporting simple time-series graphs but other types of scatter plots and more on the BI side maybe. A full host of graphing panels that take advantage of our new architecture.
We’re getting some interest on the enterprise side. Quite often, it’s a bit of an iceberg in terms of working with enterprises, deploying your software within the enterprise. How have you gone around that? Are you guys deploying to Kubernetes and OpenShift all the time?
Our own hosted services are all running Kubernetes. For our enterprise on-premises installations, we have no control. We provide the binaries, the Docker, and the packages like Linux, Ubuntu packages, Windows installers. The whole gamut there is available for an enterprise to decide how they run Grafana. This has worked fairly well because Grafana is easy to set up and run. It would have been different if it’s split up into a microservice architecture where running Grafana requires you to spin up multiple services that talk to different databases and things. Then we’d have to maybe have a more complex environment to run it on-prem. So far, on-prem Grafana is super easy to run.
You can lean into that simplicity. Your architecture is like one box.
It has its disadvantages. When you want to run Grafana in a Cloud setting, you want the microservice architecture and you want fine grain deployment models. From an open-source perspective where you want super easy to install and you want to optimize for adoption and ease of getting started from an on-premises open source consumer and user, you need a single binary, no dependency install. That’s what Idesigned Grafana to be. That’s why I used Go. That’s why it supports SQL. Because there’s SQLite, you don’t have to install a SQL database to run it. A lot of decisions early on were optimized to adoption and ease of getting started with it.
Torkel, thank you so much for your time. I appreciate it. Your software is running on my Raspberry Pi that’s under my TV. It’s been pretty faultless for a couple of years. I want to thank you for that.
This was fun.
Thank you. Good luck in the future. We’d like to see what happens, maybe light mode.
There is a light mode. There’s enterprise mode. On user preference, you can have a happier light theme in Grafana.
I like the dark one. Have a good day. Thank you for your time.
About Torkel Ödegaard
Currently an open source entrepreneur, working full time on development of the popular Grafana open source metrics and analytics platform. Leading the development of the project and a core team of 20+ people. Using technologies like Go, React, AngularJS, Graphite, Docker & Kubernetes.
Available for talk, coaching and workshops on:
- Grafana and live & trend monitoring
- Continuous integration & Automated deployment
- Time series databases
- Application metrics & instrumentation