perrygeo

About | Articles | CV

Don't install PostgreSQL - Using containers for local development.

Fri 11 February 2022

So you need a database for an application you're developing. You've looked around and decided that PostgreSQL fits the bill. Excellent choice! Now it's time to start coding. How do you get postgres running locally to devlop and test against it?

The typical suggestion for many web application frameworks is to install PostgreSQL to your system using your chosen dependency management tool - brew install postgresql or apt install postgresql - then configure it to work for your application (maybe tweaking some settings in /etc/postgresql/ as the root user), starting a background process with your system supervisor of choice (sudo systemctl start postgresql), hooking it up to your app, and you're off to the races.

But what happens when you're working on project that needs a different major version of postgresql, with different extensions or entirely different settings? I often found myself in a scenario where my system was full of cruft, having been reworked many times over to swap out different postgresql instances. Additionally there is only a single data directory (/etc/postgresql/<version>/main) so if you need the data to persist for more than a single project, you have to manage backup and restore each time you switch contexts.

A traditional system install just doesn't cut it. We need a way to run many different postgres instances, independent of each other with isolated data, settings and software versions. We can use Docker containers to run postgresql in a more flexible way that allows for greater experimentation, data stability, and greatly improved ease of use.

Running postgres in Docker, the naive approach

There's no real secret to running Docker containers. We know that postgresql docker images exist and we should be able to run them like any other.

$ docker run postgres:14.1
Unable to find image 'postgres:14.1' locally
14.1: Pulling from library/postgres
...
Status: Downloaded newer image for postgres:14.1
Error: Database is uninitialized and superuser password is not specified.
       You must specify POSTGRES_PASSWORD to a non-empty value for the
       superuser. For example, "-e POSTGRES_PASSWORD=password" on "docker run".

       You may also use "POSTGRES_HOST_AUTH_METHOD=trust" to allow all
       connections without a password. This is *not* recommended.

       See PostgreSQL documentation about "trust":
       https://www.postgresql.org/docs/current/auth-trust.html

Ah, clearly there are a few tricks specific to running postgres in a container. If we set a postgres password, we can get a running postgres instance.

$ docker run -e POSTGRES_PASSWORD=password postgres:14.1
...
2022-02-03 18:23:38.823 UTC [1] LOG:  database system is ready to accept connections

The container startup script will initialize your database, create users and start the process, listening for connections. But where is it listening? We can't yet connect to it. And where is the data? We can't see any data anywhere on our host system. Everything is, well, contained within the running Docker container.

To make this workflow viable for local development, we'd like

An open TCP port on the host system so we can connect to it.
The data to live on the host system, not in the container's overlay filesystem.
To give postgres access to files from the host system so that we can import datasets.
Settings to live on the host system so that we can adjust them and optionally check them into source control.

Of course the offical PostgreSQL Docker documentation covers these exact scenarios, showing us how we can use port forwarding and volume mounts.

An alternative to system-wide PostgreSQL installs

Here is my opinionated take on how to set up an ergonomic postgres environment for local development.

First, create a database directory in your project to hold all things postgres

Then create database/postgresql.conf to specify the postgres settings. The example below is a subset of the full postgres config, the settings that I typically need to adjust when doing any serious performance-sensistive development

# PostgreSQL configuration file
# See https://github.com/postgres/postgres/blob/master/src/backend/utils/misc/postgresql.conf.sample

#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#------------------------------------------------------------------------------
listen_addresses = '*'
port = 5432             # (change requires restart)
max_connections = 100           # (change requires restart)

#------------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#------------------------------------------------------------------------------
shared_buffers = 2048MB         # min 128kB
work_mem = 40MB             # min 64kB
maintenance_work_mem = 640MB        # min 1MB
dynamic_shared_memory_type = posix  # the default is the first option
max_parallel_workers_per_gather = 6 # taken from max_parallel_workers
max_parallel_workers = 12       # maximum number of max_worker_processes that

#------------------------------------------------------------------------------
# WRITE-AHEAD LOG
#------------------------------------------------------------------------------
checkpoint_timeout = 40min      # range 30s-1d
max_wal_size = 1GB
min_wal_size = 80MB
checkpoint_completion_target = 0.75 # checkpoint target duration, 0.0 - 1.0

#------------------------------------------------------------------------------
# REPORTING AND LOGGING
#------------------------------------------------------------------------------
logging_collector = off
log_autovacuum_min_duration = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = default
log_min_duration_statement = 20ms
log_lock_waits = on
log_temp_files = 0
log_timezone = 'UTC'

#------------------------------------------------------------------------------
# AUTOVACUUM
#------------------------------------------------------------------------------
autovacuum_vacuum_scale_factor = 0.02   # fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.01  # fraction of table size before analyze

#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------
datestyle = 'iso, mdy'
timezone = 'UTC'
lc_messages = 'C.UTF-8'
lc_monetary = 'C.UTF-8'
lc_numeric = 'C.UTF-8'
lc_time = 'C.UTF-8'
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'pg_stat_statements'

Create a database/pg_hba.conf to control access to the database. You might need to adjust this to experiment with different networking setups, different users, etc. Usually the defaults here are fine.

# PostgreSQL Client Authentication Configuration File
# ===================================================
# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD

# Database administrative login by UNIX sockets
# "local" is for Unix domain socket connections only
local   all         postgres                          ident
local   all         all                               ident

# IPv4 local connections:
host    all         all         172.17.0.0/16         md5

# IPv6 local connections:
host    all         all         ::1/128               md5

Make two subdirectories to hold the data: database/mnt_data to hold data you intend to import/export and database/pgdata to hold the actual database.

$ mkdir mnt_data
$ mkdir pgdata

You probably don't want to check your datasets or database into source control. Create a database/.gitignore to ignore them

# .gitignore
pgdata
mnt_data

Finally, create a run-postgres.sh script to launch the docker container with everything hooked up.

# run-postgres.sh
set -e
HOST_PORT=5432
NAME=postgres-dev
DOCKER_REPO=postgres
TAG=14.1

docker run --rm --name $NAME \
  --volume `pwd`/pgdata:/var/lib/pgsql/data \
  --volume `pwd`/mnt_data:/mnt/data \
  --volume `pwd`/pg_hba.conf:/etc/postgresql/pg_hba.conf \
  --volume `pwd`/postgresql.conf:/etc/postgresql/postgresql.conf \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_USER=postgres \
  -e PGDATA=/var/lib/pgsql/data/pgdata14 \
  -e POSTGRES_INITDB_ARGS="--data-checksums --encoding=UTF8" \
  -e POSTGRES_DB=db \
  -p ${HOST_PORT}:5432 \
  ${DOCKER_REPO}:${TAG} \
  postgres \
    -c 'config_file=/etc/postgresql/postgresql.conf' \
    -c 'hba_file=/etc/postgresql/pg_hba.conf'

Note the HOST_PORT variable. If you've already got another database running on 5432, this won't work. This is where you need to get a bit creative and tune the process to your needs. What I typically do is use port 6432 and increment by one for every project so they don't conflict. This allows to run all of your databases at the same time on one machine. The only downside is you need to remember which port maps to which database!

Running it

$ ./run-postgres.sh
...
2022-02-03 19:13:09.673 UTC [1] LOG:  starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-02-03 19:13:09.673 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-03 19:13:09.673 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-03 19:13:09.677 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-03 19:13:09.685 UTC [26] LOG:  database system was shut down at 2021-11-13 21:34:06 UTC
2022-02-03 19:13:09.700 UTC [1] LOG:  database system is ready to accept connections

Using this setup, the logs are sent directly to stdout so you'll see everything in the terminal. The ports and paths in the logs are inside the container, so don't get fooled trying to find them on your host system.

To connect, we use the defined host port

$ psql postgres://postgres:password@localhost:6432/postgres

You can put data in mnt_data from the host system, which will be exposed to postgresql as the /mnt/data directory inside the container. For example, load it with psql using COPY data FROM '/mnt/data/my.csv' WITH CSV HEADER;. Likewise, any data dumps or exports from postgres can be output to this directory, immediately accessible to the host system.

To stop the server, use Ctrl-C. The data will persist to your pgdata directory. Resist the temptation to touch any files therein as they are managemed internally to postgres. But you can move the directory as a whole around the filesystem or to another machine. It's not quite as convenient as a process-less, single file SQLite database but it's close.

Because the pgdata directory is created by postgres which provides strong gaurantees that the on-disk data format will be consistent within a major version, we can even use a different image altogether to access the same underlying dataset. This can be very handy for e.g. switching between vanilla postgres and postgis, or for testing different versions of extensions, etc. As long as the image follows the basic rules of the postgres container behavior and uses the same major version, it should just work.

What about in production?

Installing postgresql on a VM or bare-metal server is still viable, especially if automated with configuration tools like Ansible or Chef. But there are other options.

If your project is all-in on containers in production, consider checking out some of the Kubernetes operators for postgres. You can use the exact same container image in production that you test on locally, albeit with some additional operational concerns around availability and stateful data. Operator software like Crunchy PostgreSQL for Kubernetes and Kubegres can be configured for load balancing, high-availability, backups, monitoring, etc. which can ease the operational burden should your database require such things.

Of course, there is always the cloud hosted option. I've used postgresql on both GCP Cloud SQL and AWS RDS and, while you give up some control of the environment and are no longer able to run the exact same database locally as you do in prod, the easy of adminstering these hosted databases might be worth it.

Conclusion

Docker containers provide a robust way to run postgres in local development, with very few compromises. A container-based workflow makes it easier to maintain multiple parallel database, and to move data freely between systems. For my money, there's no need to apt install postgres again.