PostgreSQL Monitoring with Datadog

Cloud Service Navigation. Simplified.

PostgreSQL Monitoring with Datadog

Clouductivity - PostgreSQL Monitoring with Datadog

If you’ve managed PostgreSQL databases in the past, you know how important it is to have proper monitoring in place for them. One bad query at the right volume or time of day can cascade into a plethora of issues. Such a query could cause crippling CPU utilization on your database server or result in overutilization of the available database connections.

If you’re using Datadog to monitor some of your other infrastructures, then leveraging their PostgreSQL monitoring integration could provide you with some great insights into what’s happening inside your databases.

In this blog, I’ll be detailing some of the useful metrics that I found to be provided by their PostgreSQL monitoring integration, as well as how to configure it.

Also, just to be clear, I am not partnered with Datadog or any of the vendors mentioned in this post. I’m just here to provide some tips that will help you extend your current implementation to PostgreSQL!

Metrics Worth Monitoring In PostgreSQL

Some of the key metrics in Postgres that I like to keep tabs on are:

  • Database connections
  • Tables being scanned
    • Also known as sequential scans in Postgres
  • How many rows are being scanned within a given table
    • Sequential scans can happen for benign reasons, but if you see a large numbers of rows scanned, you should look to refactor the code, add an index (where appropriate), etc.
  • Fragmented tables that need vacuuming/analyzing
    • Autovacuum in Postgres normally takes care of this, but frequently, tuning on a per-table basis is required
    • If tables are highly-fragmented or autovacuum can’t clear deleted rows out or update statistics, database queries can slow to a crawl!
  • Transaction rollbacks
    • Rollbacks can be a natural part of your application, but if you identify deviations from your baseline, there could be an underlying problem that requires investigation!
  • Databases or individual tables that could grow out of hand

    • I like to be able to catch subtle application issues by monitoring the size of tables via Datadog’s metrics. Maybe a cleanup CRON isn’t working, or we’re not cleaning up old records for some reason.

These are just some of the metrics I tend to monitor within Postgres. Datadog’s integration checks all of these boxes. We’ll cover a more comprehensive list of useful metrics later on in this piece.

Clouductivity - PostgreSQL Monitoring with Datadog

If you’re hosting your database server on-prem or on virtual machines, you can couple these metrics with the foundational metrics provided by the Datadog agent to build a good foundation for database system monitoring.

Are you using Amazon RDS or Google CloudSQL? I’d highly recommend integrating your AWS or GCP account into Datadog, by providing them with a read-only account to pull metrics with. Once this information is pulled in, you can leverage the StackDriver or CloudWatch metrics for your infrastructure with your Postgres integration’s metrics to build a holistic monitoring suite for your backend Postgres instances.

Step One – Create User For Datadog

As per best practices, you should create a dedicated monitoring database user. This helps with tracking down connections and performance issues should anything pop up.

For Postgres 10:

CREATE USER datadog_monitoring WITH PASSWORD '';
grant pg_monitor to datadog_monitoring;

Postgres Versions < 10:

CREATE USER datadog_monitoring WITH PASSWORD '';
GRANT SELECT ON pg_stat_database to datadog_monitoring;

# You also need a special definer since the pg_monitor role isn't available in PG < 10 :(
CREATE FUNCTION pg_stat_activity() RETURNS SETOF pg_catalog.pg_stat_activity AS 
$$ SELECT * from pg_catalog.pg_stat_activity; $$
LANGUAGE sql VOLATILE SECURITY DEFINER;

CREATE VIEW pg_stat_activity_dd AS SELECT * FROM pg_stat_activity();
grant SELECT ON pg_stat_activity_dd to datadog_monitoring;

Step Two – Create Your Monitoring YAML File(s)

If you didn’t already know, Datadog automatically loads YAML files you put under the agent’s conf.d/ directory when their agent starts up.

Here’s the directory I put my Datadog Postgres monitoring YAML files in:

/etc/datadog-agent/conf.d/postgres.d/

Side Note: Configuration Management

Since I use Terraform to manage multiple environments, I opted to dynamically build my YAML config files based on outputs from Terraform modules (i.e. RDS instances), but you can manually configure yours if it suits you.

Using another Configuration Management tool like Puppet or Ansible would work great too; that’s another route I’ve taken.

Datadog Postgres YAML Config File

init_config:

instances:
  # X Database
  - host: localhost
    port: 5432
    username: ${database_username}
    password: ${database_user_password}
    dbname: ${database_name}
    # This allows Datadog to pull statement/transaction metrics
    collect_activity_metrics: true
    # Tags are a critical concept in Datadog to enable monitor grouping/filtering, etc.
    tags:
      - environment:${environment}
      - project:${project}
      - db:${database_name}
    # Optionally, whitelist only certain relations you want to collect metrics for
    relations:
      - table_name_1
      - table_name_2

  # Y Database
  - host: localhost
    port: 5432
    username: ${database_username}
    password: ${database_user_password}
    dbname: ${database_name_2}
    # This allows Datadog to pull statement/transaction metrics
    collect_activity_metrics: true
    # Tags are a critical concept in Datadog to enable monitor grouping/filtering, etc.
    tags:
      - environment:${environment}
      - project:${project}
      - db:${database_name_2}
    # Optionally, whitelist only certain relations you want to collect metrics for
    relations:
      - table_name_1
      - table_name_2

When I first implemented Postgres monitoring in Datadog, I had to specify all relations that I wanted to monitor, but their latest documentation tends to suggest that by default, all are included. If you find this to not be the case, try specifying the relations list as detailed above.

After getting the config files deployed to your environment, restart the Datadog agent.

If you’re on Version 6 of the Datadog Agent, you can issue this command to confirm that the integration is working:

agent status

Datadog Metrics Of Interest For Postgres:

Here’s a quick list of some of the metrics you’ve reaped from the Datadog Postgres integration, which I’ve found valuable:

  • postgresql.connections
    • Number of active connections opened in the database
  • postgresql.rollbacks
    • Number of transactions that were rolled back in the database
  • postgresql.rows_returned
    • Number of rows returned by queries in the database
  • postgresql.rows_inserted
    • Number of rows inserted by queries in the database
  • postgresql.rows_updated
    • Number of rows updated by queries in the database
  • postgresql.database_size
    • Size of the database, in bytes (Datadog will display this as GiB,MiB, etc. in graphs)
  • postgresql.bgwriter.checkpoints_requested
    • Number of checkpoints that were forced, rather than happening organically
  • postgresql.locks
    • Number of locks currently in the database
  • postgresql.seq_scans
    • Number of table scans happening in the table/database
  • postgresql.seq_rows_read
    • Number of rows scanned in the table/database
  • postgresql.index_scans
    • Number of index scans performed on table
  • postgresql.index_rows_fetched
    • Number of live rows fetched by index scans performed in table
  • postgresql.dead_rows
    • Number of dead rows awaiting vacuuming in a table
  • postgresql.table_size
    • Size of the table in bytes (also scaled to GiB/MiB in Datadog graphs)
  • postgresql.index_size
    • Size of the target index
  • postgresql.max_connections
    • Maximum number of non-super user connections supported
  • postgresql.percent_usage_connections
    • Percentage of connections used, relative to max_connections
  • postgresql.replication_delay
    • Current seconds of experienced replication delay
  • postgresql.transactions.open
    • Number of transactions that are in an open state within the target database
  • postgresql.active_queries
    • Number of queries in an active state within the database
  • postgresql.waiting_queries
    • Number of queries waiting to become active

Quick-Tip For Automatic Tags:

Datadog automatically adds tags to some of the data points, so be sure to tinker with averaging by table, database, etc.

False Alarms? No thanks; Enter Actionable Monitoring

I become SO fired up talking about monitoring; it’s a really interesting topic. With that being said, we can all go a little overboard with our monitoring if we’re not careful.

One of the things I’ve tried to do over my career is to partition metrics into three categories:

  • Actionable (Active)
    • Things that need to be addressed right meow (did you just say meow?)
    • Examples:
      • Database server crashed or CPU is pegged at 100%
      • Database server is nearing max_connections
      • A mission-critical application is offline or degraded
      • A user-facing web service is experiencing high error rates
  • Passive/Investigative
    • These are things you might want to keep tabs on but aren’t worth waking up at night for. Can it wait until Monday?
  • Dashboard-Worthy
    • If the – you know what – hits the fan, it would be nice to see these metrics as part of a dashboard.
    • Examples:
      • Number of rows scanned per table/database
      • Database connections, grouped by username
      • API calls, grouped by route and/or method

I’ve found that taking this approach from the beginning helps you to better characterize the appropriate notification medium when a metric exhibits a certain behavior.

Let’s be honest – we don’t like waking up in the middle of the night to answer PagerDuty or VictorOps alerts, but it’s part of the job. What we can do to make it better is to only push the big red alarm button when immediate action is necessary. Passive alerts can go to a dedicated Slack channel for DevOps/SRE to observe and investigate as needed.

--

Clouductivity - Marcus Bastian About the Author

Marcus Bastian is a Senior Site Reliability Engineer that works with AWS and Google Cloud customers to implement, automate, and improve the performance of their applications.

Find his LinkedIn here.

Clouductivity - AWS Solutions Architect Professional Certification - Marcus Bastian