Observability as Code: A Guide to Managing Monitors with Terraform and GitOps for Datadog (Part 1)

5 min readMar 16, 2023

Observability is crucial for ensuring the reliability and performance of modern software systems. Datadog is a popular observability platform that enables teams to monitor and analyze metrics, logs, and traces from their applications and infrastructure. However, setting up and configuring Datadog resources can be challenging, especially when dealing with distributed systems and microservices architectures.

In this article, we will explore how to create an Observability as Code (OAC) pipeline for Datadog using Terraform and GitOps. We will cover the key concepts and components of OAC, including the Datadog provider for Terraform, and demonstrate how to automate the deployment and configuration of Datadog monitors or alerts using Terraform. We will also show how GitOps can be used to manage changes to the OAC pipeline, ensuring consistency and traceability across different environments.

By the end of this article, you will have a solid understanding of how to implement OAC for Datadog with Terraform and GitOps, enabling you to monitor and optimize your applications and infrastructure with ease.

Prerequisites:

Terraform installed on your local machine (just so you can test it beforehand)
Github account
Datadog account (Datadog API an APP keys)

This article assumes that you have a running instance of Datadog working already.

Let’s get to Terraforming!

The first step, of course, is to Terraform the creation of Datadog resources. In this case it will be monitors but you will see later on that this same steps can be replicated for creating other resources.

Just in case you don’t know, Terraform is an open-source infrastructure as code tool used to define and manage infrastructure resources in a declarative way. It allows developers to write code to create, modify, and delete infrastructure resources, such as virtual machines, load balancers, and databases, across multiple cloud providers and on-premises environments. In other words, no more bash commands nor UI for managing infrastructure or other resources.

This will be the directory tree:

.
└── Observability as Code/
    ├── deployment/
    │   ├── vars/
    |   │   ├── monitors.yaml
    |   │   ├── slo.yaml
    |   │   ├── dashboards.yaml
    |   │   ├── downtimes.yaml
    │   ├── main.tf
    │   ├── providers.tf
    │   ├── variables.tf
    │   └── version.tf
    └── modules/
        ├── monitors/
        │   ├── main.tf
        │   ├── variables.tf
        │   └── version.tf
        ├── slo
        ├── dashboards
        └── downtimes

Note that inside modules folder there are four subfolders. We will not touch slo, dashboards or downtimes folder because they are there just to show you can have lots of modules that will be called by the main.tf file on the deployment folder later on.

Module folder

The main.tf file will be like this:

Terraform main file that defines the module “monitors”

You’ll notice lots of variables here. And that’s because we don’t want everything set up in stone and we will explicitly define them later in this tutorial.

The variables.tffile will then be like this (please remember to write validations on the variables that need one):

Terraform main file that defines the module “monitors”

And last but not least, the version.tf file:

Deployment folder

You’ll notice there is a folder called vars here. This folder will contain a bunch of .yaml files with the definition of multiple alerts. Why yaml? Because I like it and you should too. It’s way easier to read and maintain.

Let’s check the main.tf file first:

Terraform main.tf for deployment folder

First, we will call the locals block to define a variable called monitors that will contain the monitors declared inside the vars folder in a specific file. In this case monitors.yaml (remember, we will not touch any other files aside of monitors.yaml , the other ones are there just so you can understand that you can have more than one).

Then, we definde a module called monitor (or actually whatever name you want to give it). In this block, you’ll need the source and for_each argument. The source argument needs the path of the module and the for_each argument will iterate over the local.monitors variable you defined earlier and create a resource with those specific values.

Remember you defined variables in the module? Now you’ll have to tell the module monitor block which values those variables will get. That’s why for each one of the variables you’ll have to write
nameOfTheVariable = each.value.nameOfTheVariable

Let’s see the variables.tf file of the deployment folder now:

variables.tf file for deployment

This variables should be defined in the CI/CD variables of your tool of choice appending a prefix TF_ for the variable names. Otherwhise you’ll have to define those on a terraform.tfvars file.

providers.tf file:

providers.tf file for deployment

version.tf file:

Note! This will save your .tfstate file in your local machine. You should save that state into a remote bucket, for example, GCP. You’ll find a way to do that in this tutorial.

The last part for the deployment is write the yaml file with the definition of the monitors to be created. Let’s create two of them.

The first monitor resource called awesome_monitor_number_one will get the average system load in the past 5 minutes, aggrupated every 1 min (because that’s what that metric system.load.1 does) and if it surpasses the threshold of 80% it will trigger the alert. It will send a notification to whoever or whatever channel you specify in the message value. For example @Slack-your-slack-channel.

The same with the second one, with the exception being this one gets the data from the Real User Monitoring. It will count all sessions in the android service for the production environment and if those sessions are below 10 (total) in the past 5 minutes, it will trigger an alert.

Now we can run the command terraform plan inside the deployment folder to check what will this Terraform do and after that you’ll see something like this:

Terraform message after applying “**terraform plan**”

Only then you can run terraform apply -auto-approve and you’ll see this:

Terraform message after applying “**terraform apply -auto-approve**”

Next steps

In the next part of the article we will be creating a Github Actions pipeline to deploy the monitors we have created to your Datadog account.

Read it here!

And you can see the whole code here!

Share, follow or comment?

Hey there! Share, follow, or drop a line? Got something specific in mind? Missing a sprinkle of magic? Think I missed the mark or could spice things up? Dive into the comments and let’s stir up some conversation!

I’m all about honest feedback to keep the learning journey spicy and exciting. And hey, if you enjoyed the ride, hit that follow button (it’s a freebie) and sprinkle some thumbs-up magic (limit’s 50, let’s not hold back)!