So you’ve launched a new application in production, and your applications are performing snappily as you might expect. The cloud is infinitely scalable, right? A few months or a year down the road, you start noticing seemingly random slowdowns for Postgres queries that normally execute quickly. CPU and memory utilization look fine, so you start digging into the queries. The queries execute just fine in the staging environment, so you’re bewildered.
Scaling Postgres in an on-premise data center or in the cloud have their similarities and differences. One of the critical choke points in the world of Postgres is the management of open connections.
In this article, we’re going to cover some of the issues you might run into while scaling up the demand for your Postgres backend. More importantly, I’m going to detail some of the ways you can mitigate them.
In the past, prior to really dealing too much with Postgres, I – to a certain extent – understood that an RDBMS like MySQL had its limits for the number of practical open connections. Most of the time, the solution would be to increase max_connections and make sure that the system had enough memory to serve all of the per-connection buffers; not a trivial task, but somewhat easy to mitigate.
In my experiences working as an SRE at Malwarebytes, I learned how vastly different MySQL and Postgres deal with connections.
As you may know, in the standard MySQL configuration, one thread is created for each connection. In other words, the number of processes used is somewhat minimal.
In contrast, Postgres follows a process-per-user model. Because of this, creating new connections too rapidly or simply having too many connections open burdens the system more significantly.
The days of large monolithic application servers are sunsetting, and the new world of virtualized, containerized, autoscaling and load-balanced cloud applications is here. While the infrastructure’s instance counts were static, it was somewhat easy to manage connections, but with elastic infrastructure, careful considerations must be made to ensure that as the apps scale out, they don’t push Postgres to the point of no return.
While helping debug a production issue as a Senior SRE at Malwarebytes, I noticed that queries that ran quickly before were slowing down dramatically. The application didn’t yet have connection pooling implemented, so I looked into the number of connections opened to Postgres. We weren’t anywhere near our max connections, but everything was slowing down. CPU was getting fairly high at the time as well. The demand for connections kept growing as queries were retried and the client application that was making requests to our Postgres-backed API. This exacerbated the issue further.
The development team hot-fixed the connection pooling issue and I was able to effectively manage the number of connections allocated per application container.
This bought us some time until our applications needed even more connections. We were scaling up the number of clients rapidly.
There’s always a balance to be struck between allocating too many connections for an application and starving it. I scaled up the RDS master instance (no replicas at this point), which seemed to give us some breathing room, but it still kicked the can further down the road.
So now that we understand the problem with opening too many connections, let’s highlight some mitigation techniques.
In my experience, PgBouncer appears to perform extremely efficiently. While running load tests against one of the applications I manage (2000 rq/s) that’s backed by PgBouncer, the PgBouncer container only used 8 MB of memory and very minimal CPU.
One of the most critical things to consider if you’re looking towards an external connection pooling software is the transaction isolation method. This effectively dictates how well PgBouncer is able to reduce the total number of backend connections required. I tend to shoot for the ‘transaction’ pooling mode, as it will schedule inbound queries on a per-transaction basis towards the available backend connections. This makes it possible to serve 100 connections from your application across, say, 10 connections to the backend. Please see their documentation for a deeper dive on the available pooling modes.
First, you’ll need to configure/bootstrap PgBouncer to:
Here’s an example config file that I used in my Dockerized version of PgBouncer:
[databases] $DATABASE_NAME = host=$DATABASE_HOST dbname=$DATABASE_NAME port=$DATABASE_PORT [pgbouncer] ignore_startup_parameters = client_min_messages # Listener config listen_port = 6432 listen_addr = * # Additional networking specs listen_backlog = 4096 # Authentication auth_type = md5 auth_file = userslist.txt # Stats / admin users for PGBouncer admin_users = $DATABASE_USERNAME stats_users = $STATS_COLLECTOR_USER # File management logfile = pgbouncer.log pidfile = pgbouncer.pid log_pooler_errors = 1 # Controls behavior of pooling mechanism pool_mode = $PGBOUNCER_POOL_MODE # Define the range of the pool default_pool_size = $PGBOUNCER_POOL_SIZE_MAX max_client_conn = $PGBOUNCER_MAX_CLIENT_CONNECTIONS max_user_connections = $PGBOUNCER_POOL_SIZE_MAX min_pool_size = $PGBOUNCER_POOL_SIZE_MIN max_db_connections = $PGBOUNCER_POOL_SIZE_MAX reserve_pool_size = $PGBOUNCER_POOL_ADDITIONAL_RESERVED # Timeouts server_idle_timeout = $PGBOUNCER_BACKEND_IDLE_TIMEOUT
Since high availability is usually the goal, you’ll want at least 2 instances of PgBouncer running. Try to keep the total number of connections to the backend below 100 if possible, provided that you’re using transaction pooling mode.
Remember that each instance of PgBouncer will be able to serve that number of requests, so if you had two instances of PgBouncer…
100 / 2 = 50 max connections per PgBouncer
4 instances of PgBouncer:
100 / 4 = 25 max connections per PgBouncer
The auth_file was one of the most confusing parts about PgBouncer for me personally. Effectively, you need to generate a hashed version of the password and username and store it in the userslist.txt file. This ensures that requests to PgBouncer are authenticated and enables you to connect to the backend database. The passwords between userlist.txt and the backend RDS/CloudSQL instance should match up.
Here’s a bash function that you can use to generate this ad-hoc or in an automated fashion:
generate_pg_hash_auth_token() {
local user=$1
local pass=$2
echo "Building PG user MD5 for user $user in auth file..."
pg_auth_md5=$(echo -n "$pass$user" | md5sum | awk '{ printf "md5%s", $1}')
echo "\"$user\" \"$pg_auth_md5\"" >> userslist.txt
}
So now that we’ve covered how PgBouncer works, let’s talk about what it would take to get it running.
Let’s say, for the sake of simplicity, that you want to use EC2 instances to host your PgBouncer (basically your new reverse proxy for database connections).
Get your instances rolled out and confirm that, from the PGBouncer instances, you’re able to connect to your RDS instance.
Create a DNS record for your PgBouncer instances (or to a TCP load balancer) and configure your applications to connect to that instead of the RDS instance’s DNS record. The RDS instance’s DNS record should already be in your PgBouncer’s database host configuration.
Of course, you’ll want to make sure all of your VPC and networking components allow access to the subnets and VPC security groups used by your RDS instance. This includes VPC routes, network ACLs, security groups and so on.
By now, if everything checks out, your applications should now be using PgBouncer to connect to your RDS instance.
If you’re using CloudSQL in your Google environment, then you’re likely already familiar with the concept of using CloudSQL Proxy to access your CloudSQL instance. I don’t like to make my database servers publically available, so I chose to go the CloudSQL proxy route.
For those of you that aren’t familiar, CloudSQL proxy is usually used as a sidecar to your application in order to facilitate accessing CloudSQL instances privately. It uses a service account key to authenticate with Google’s APIs in order to authorize queries.
In the case of Kubernetes, whereby the proxy is sidecar’d, your application will likely use ‘localhost’ as the database host, and whatever port that you have the CloudSQL proxy container running on (e.g. 5432).
The problem with using the proxy is that it doesn’t pool connections at a higher level. You still have a one-to-one relationship between the application container and CloudSQL proxy. As your app scales out, it can gobble up as many connections as it wants.
When I migrated our applications away from the CloudSQL proxy sidecar approach, I chose to create a separate Kubernetes service and deployment for each application’s PgBouncer component. Within each PgBouncer pod, I would sidecar CloudSQL proxy, instead of with the application. I’m going to give an example to hopefully make this clearer.
Let’s say I had two applications called “telem-api” and “user-api”.
In that case, I would have:
The idea here is to isolate the impact of any given application going haywire. If we had a shared database instance with multiple databases within it which are leveraged by multiple apps, we’d be in a safer position.
With this approach, we can also control how many database connections each application can open in said RDS/CloudSQL instance. Honestly, it also helped simplify the automation of dynamically building the PgBouncer configuration file too!
Whether you’re on RDS or CloudSQL, there are certainly other things that can degrade performance, which you might want to look into as well, such as:
EXPLAIN ${query}
These are just some of the things that could be hindering the performance of your PostgreSQL instances. I’ve found implementing connection pooling to have been one of the most profound contributors to improving the performance and scalability of our PostgreSQL-backed applications.
--I work at a staffing agency. Day in and day out, I get to hear our recruiters prep candidates as…
Getting into DevOps is like any other career, in the sense that it is a life-altering decision - and, with…
If you've managed PostgreSQL databases in the past, you know how important it is to have proper monitoring in place…
Not too long ago, I was searching for information on the AWS Solutions Architect Professional certification exam. I was trying…
It's Not Easy Doing Everything There's an oft-circulated image in the technology sector - a company looking for a Swift…
Run a Task on Each ECS Cluster Node If you're using Amazon ECS with tools such as Datadog or local…