Datadog agent checks are not just for collecting metrics; they are the primary mechanism for integrating external systems into Datadog’s monitoring and alerting.

Here’s a simple Python check that monitors a local HTTP server, reporting its status and latency.

# my_http_check.py
from datadog_checks.base import AgentCheck
from datadog_checks.base.errors import CheckException
import requests
import time

class MyHttpCheck(AgentCheck):
    def check(self, instance):
        url = instance.get("url")
        if not url:
            raise CheckException("No 'url' defined in instance configuration.")

        start_time = time.time()
        try:
            response = requests.get(url, timeout=5)
            response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx)
            elapsed_time = time.time() - start_time

            self.gauge("my_http_check.response_time", elapsed_time, tags=["url:%s" % url])
            self.increment("my_http_check.up", tags=["url:%s" % url])
            self.log.info("Successfully connected to %s in %.2f seconds.", url, elapsed_time)

        except requests.exceptions.Timeout:
            self.increment("my_http_check.down", tags=["url:%s" % url])
            self.log.error("Request to %s timed out.", url)
            raise CheckException("Request timed out.")
        except requests.exceptions.RequestException as e:
            self.increment("my_http_check.down", tags=["url:%s" % url])
            self.log.error("Error connecting to %s: %s", url, e)
            raise CheckException(str(e))

To use this, you’d place my_http_check.py in your Datadog agent’s checks.d directory (e.g., /etc/datadog-agent/checks.d/). Then, create a configuration file in conf.d (e.g., /etc/datadog-agent/conf.d/my_http_check.d/conf.yaml):

init_config:

instances:
  - url: http://localhost:8000

After restarting the Datadog agent (sudo datadog-agent restart), this check will run every 15 seconds by default, sending my_http_check.response_time, my_http_check.up, and my_http_check.down metrics. The instance dictionary passed to the check method is populated directly from the instances list in your conf.yaml.

The core of any custom check is the check method, which the agent calls periodically. Inside check, you’ll perform your custom logic. For this HTTP check, we’re using the requests library to make a GET request. We track the elapsed_time to measure latency and use response.raise_for_status() to automatically detect and report HTTP errors. The self.gauge() method reports a value that can fluctuate (like response time), self.increment() reports a count that only goes up (like successful or failed requests). Tags are crucial for filtering and grouping metrics in Datadog.

The check method can raise CheckException to signal a failure to the agent. This is important for alerting. If an exception is raised, the agent will mark the check as critical. We catch specific requests exceptions like Timeout and generic RequestException to provide detailed error messages and ensure the my_http_check.down metric is incremented. The self.log object works like Python’s standard logging module, sending messages to the agent’s logs.

When you define a new check, the agent discovers it by looking for a Python file named *.py within a directory named after the check in checks.d. The configuration file structure mirrors this: conf.d/<check_name>.d/conf.yaml. The init_config section is for global configuration of the check, while instances allows you to configure multiple, distinct runs of the same check. For example, you could have multiple instances in conf.yaml to monitor several different URLs, each producing its own set of metrics tagged with its specific URL.

The Datadog Agent’s internal plumbing for custom checks involves a Python interpreter embedded within the agent process. When the agent starts, it scans checks.d, imports any discovered check classes (like MyHttpCheck), and then, based on the conf.d files, instantiates these classes with the provided init_config and instance parameters. These instances are then added to a scheduler. The scheduler calls the check method of each instance at its configured interval (defaulting to 15 seconds). The metrics and events reported by the check method are then serialized and sent to the Datadog backend via the agent’s main communication channel. You can also define service_checks for more structured health reporting, which are distinct from metrics.

The next step is often to explore how to configure alerts based on these custom metrics within the Datadog UI.

Want structured learning?

Take the full Datadog course →