perrygeo

About | Articles | CV

Clojure is for optimists - notes on the 2024 Conj

Wed 20 November 2024

I started playing with Clojure in 2013 and was immediately impressed. It had the killer combination of immutable data structures, functional programming, an interactive REPL, and structural editing. I dabbled a bit, created a few demos, fell in love with the language, evangelized everywhere I could, then went back to writing OtherLang because that's what pays the bills. This is a common sentiment, as I came to learn. Clojure is definitely a niche language, but the uncompromising focus on both quality and fun seems to compel programmers to use it.

Ten years later, for no reason in particular, I opened up my old Clojure projects and dove in again. The ecosystem had changed a bit: We now had deps.edn as an alternative to Leiningen (which I did not enjoy), an LSP, a linter, and shadow-cljs to name a few. But my decade-old code still "just worked" for the most part; the changes were limited to the outer layer, the developer experience. The core language is stable, a breath of fresh air!

So I decided to attend the 2024 Clojure Conj on my own dime. And I'm glad I did! I found a friendly community of thoughtful folks eager to talk about Clojure and how they're applying it in the wild.

The videos for all 2024 talks are now available on youtube!

Here's a minimally-edited summary of my notes of the experience, better late than never...

Venue

On day 1, I was convinced I'd shown up at the wrong place. The George Washington Masonic Memorial is an impressive building and not your typical conference venue! Very cool.

The amazing day of datomic

One of my primary quests was to learn everything I could about Datomic, and evaluate it as an alternative to Postgresql for certain use cases. Datomic Pro is now Apache licensed, free to use by anyone. However, that's only the compiled jar file; the source code isn't available. And I'm fine with that, open-source purists can make that decision for themselves.

The core idea of Datomic can be summarized as an unbundled, immutable database with identity and time as first class citizens. Let's break that down:

Datomic is unbundled: the writer, the readers, and the storage layer are separate components. Rather than a centralized database server that serves and coordinates all client queries, Datomic has a peer architecture. You run queries directly in your application process, allowing your app to hit storage directly without going through a central server. Combined with a storage engine like DynamoDB, this gives you reads that can be horizontally scaled. The write load continues to be served by a single server in charge of building efficient covering indexes and notifying peers, known as the "transactor".
It is immutable in that it acts as a ledger. You don't delete data, you assert new facts and retract old facts. Those facts never change.
It uses an EAVT data model where every fact is a tuple of entity, attribute, value and time. The notion of time and identity allows datomic to describe a full history instead of just the current mutable state. You can time travel to any point and compare values across time.

We learned how to set up a schema and transact new entities, how to write queries with Clojure's flavor of datalog, discussed some more complex topics around architecture and performance.

Day 2

Day two started with more datomic, this time a deap dive into the architecture and some new observabilty features: io-stats for disk IO and tx-stats for transaction timing. It also introduced to me the idea of squiids, sequential UUIDs, which sort by time but still looks like an arbitrary uuid4. This allows for serial id-like performance (index locality) without exposing an integer that reveals the size of your database. This sounds like a perfect compromise to the bigint vs uuid tradeoff that most postgres users face.

There was a surprising amount of discussion of Python, mainly in the context of machine learning. ML/AI is fully dominated by Python and even shops that are all-in on Clojure still work with models trained in Python. I learned about ONNX, a standard interchange format which allows to package up a Python model (sklearn or pytorch) and run the inference engine in another language.

I learned of test.contract an impressive library for mocking external systems. This is ideal for deterministic simulation testing of distributed systems Though I'll still strongly advocate for more integration and e2e tests, this talk gave me a glimpse of more advanced unit testing techniques that might come in handy some day.

I was introduced to the standard-clj formatter which aims to be the canonical formatter. I'm not sure how it compares to cljfmt which I current use.

Arne gave a wonderful lightning demo of Overtone, a music composition software library for Clojure. If you've never seen anyone create a live music experience with an emacs REPL, it's hard to describe. As I'm currently studying music theory, I plan on diving into overtone with gusto when I return. Inspirational.

Luke gave a wonderful overview of resource description framework (RDF) and the current state of LLMs. Yes, the same RDF that underpinned the "Semantic Web" ideas (remeber that?). The thesis was compelling: LLMs could use RDF to gain accuracy and logical consistency. He pointed out an AI-from-scratch tutorial that seems comprehensive and mentioned the seminal paper that started this whole AI gold rush: Attention is all you need

Probably my favorite talk of the conference, Paula gave us an experience report of porting the clojure.java.math library to ClojureScript. This meant navigating licenses and IP concerns, the Clojure patch process, and reimplementing many of the core math functions. Most impressive was the test harness for thoroughly testing cljs-math against its Clojure sibling. Very impressive work, and really embodied the community focus on quality, stability, and keeping parity between the two main hosts platforms (Java and Javascript).

Day 3

Rich Hickey started the day with a brief but inspriring message. He's on his way to retirement and seems in many ways to be "passing the torch" to others in the community. And his message was simple: Clojure is for optimists, expert programmers who take pride in the quality of their work and face down gnarly problems with intellectual curiosity and a good attitude. It's a concise and accurate way to describe the vibe of the community.

I learned about Kevel's architecture of distributed system on AWS. Definitely worth a re-watch, this is an experience report that carries lots of valuable insigh. There was a strong emphasis on causality, which I appreciated. More background reading on causal consistency in the cloud.

Scientific clojure has a real potential. It's got interoperability with Python, Wolfram, etc. pluys a native clojure tool ecosystem that covers much of modern "data science" stack. All the pieces are there, a la carte in true Clojure fashion.

We got a great Metabase experience report. I learned all about pet birds, astrology signs, and which crystals to put on your desk to enhance your energy flows. Seriously. The opening 5 minutes of the talk might have been the hardest I've ever laughed at a conference. Standup commedy gold. Eventually talk turned back to Clojure and I got to hear an experience report that gave me a taste of some of the challenges faced by teams on large clojure codebases.

The Clojure Camp folks and their presentation was inspiring. Despite Clojure being an unabashedly expert-oriented tool, there is an unmistakable community commitment to welcoming newbies and taking learning seriously.

Colin Fleming, the author of Cursive, demonstrated how LLMs could be more tightly integrated into our IDEs, as well as some of their drawbacks. Once again, the tone was pleasantly objective: no over-hyping nor hating on AI, just solid observations. The Caveman web framework/tutorial was released a few days earlier; he showed how Claude sonnet could follow the tutorial and build the software from English instructions. Impressive!

To end the conference, Alex Miller gave an update to last year's keynote, Design in practice. It's a strategy for a software design process with six steps: Describe, Diagnose, Delimit, Direction, Design, then Develop

Why front-load the thinking, as opposed to simply sprinting away?

"The solution should feel obvious once you've written the right problem statement" - Rich Hickey (via Alex)

The design process and the presentation itself was in Google Sheets! Definitely the first time I've seen a slideshow embedded in a spreadsheet. It was a fascinating glimpse into the inner workings of the Clojure language design process leading up to the 1.12 release, which emphasizes clear thinking and real-world feedback loops. Notice that writing production code is the final step and you can backtrack anytime; I can't see many Agile/Scrum shops engaging with such an extended planning process but it evidently yields high quality software.

Fin

I came away from the conference with renewed enthusiasm, not just for Clojure but for the field of software engineering as a whole. The 2024 zeitgeist would say the software industry is in crisis: "the job market is terrible" and "AI is going to take our jobs". Seems as though software projects are increasingly drowning in unbounded cloud complexity on one hand, and the weight of legacy technical decisions on the other.

But people have been saying that about software for decades! Is it incompetent programmers or bad managers or greedy executives or the technology they picked? Who knows. The majority of software projects still fail and we don't really know why.

But functional programming, and Clojure in particular, offer a viable alternative to the madness, a way out of the tarpit. It's not a silver bullet of course, no technology is. But in the hands of a talented and motivated group of developers, you can build challenging software systems to very high standards with Clojure. And have fun doing it. See you at Clojure/Conj 2025!

Measuring the GPU/CPU tradeoff

Tue 17 September 2024

There's a lot of well-deserved excitement around graphics processing units (GPU), not just in the stock market but the software world too. The success stories are too good to ignore. I've heard GPUs described as an "accelerant", and one might get the impression that sprinkling a little GPU on a project will be magic pixie dust to make it go faster. It's also likely to explode if you're not careful.

There are obvious and impressive applications in graphics, gaming, simulation, statistics, deep learning - any field that requires heavy number crunching with linear algebra routines. And these are many of the hottest fields in computing at the moment for a reason.

But there is a downside for software developers. In order to see any benefit at all from a GPU and to avoid wasting time, energy and money, you need to keep the hardware busy with large yet simple math problems.

Not all software is amenable to this style of programming. If not, the humble and ubiquitous CPU outperforms the GPU in many meaningful ways. But how do you know?

In this article, I'll demonstrate a workflow using matrix multiplication, perhaps the best example of a simple but time-consuming math problem. When you see a problem successfully leverage a GPU, it's a good bet matrix multiplication is involved. Or put another way, if you can express your problem as a matrix multiplication, there's a good bet GPUs will help you.

We'll generate random matrices of varying sizes, then perform multiplication using both the CPU and GPU. By timing the simplest of operations carefully, we can make some inferences about the behavior of the two processing units, and how that guides hardware decisions.

Core questions

Audience: devs faced with a decision to potentially add GPU to the mix.

Will a GPU make my numeric code faster, in clock time? Under all inputs?
How much time does the numeric code take relative to the rest of the overall process?
Will a GPU be economically efficient, in operations per dollar?
Will the burden of GPU hardware drivers and the relative scarcity of hardware itself be an issue?

If you're already deep into it or have to use it, you know this already. Questions about why? This is a great dive into the details: Making Deep Learning Go Brrrr From First Principles.

I'm not going to cover deep CUDA knowledge or details of how to implement various algorithms with it. The implementation is yours. That's the point - in order to make this assessment, you need to test real code on real hardware.

Matrix math in clojure

In order to perform identical matrix multiplication operations using both the CPU and GPU, I've chosen to use the Clojure Neanderthal library.

Why? Neanderthal allows you to write high-level code for GPU and CPU in the same language. While not identical, the code is very similar across the platforms. This lets us translate easily between GPU and CPU code, meaning we only have to figure the logic once.

Here are the clojure dependencies, typically found in a deps.edn file.

;; deps.edn
{:paths ["src"]
 :deps {;; clojure itself is just a library
        org.clojure/clojure               {:mvn/version "1.12.0-alpha12"}
        ;; linear algebra
        uncomplicate/neanderthal          {:mvn/version "0.49.1"}
        ;; NVIDIA GPU support
        uncomplicate/clojurecuda          {:mvn/version "0.19.0"}
        ;; Intel MKL CPU support
        org.bytedeco/mkl-platform-redist  {:mvn/version "2024.0-1.5.10"}}}

Our clojure namespace requires the top-level APIs we'll be working with.

(ns linalg
  (:require
   [uncomplicate.clojurecuda.core   :as    cuda]
   [uncomplicate.commons.core       :refer [with-release]]
   [uncomplicate.neanderthal.cuda   :refer [cuv cuge with-default-engine with-engine]]
   [uncomplicate.neanderthal.native :refer [dv dge fge]]
   [uncomplicate.neanderthal.random :refer [rand-uniform! rand-normal!]]
   [uncomplicate.neanderthal.core   :refer [mm! zero]]))

Next, we create two n x n matrices, X and Y, and do a matrix multiplication into a pre-allocated output matrix. Here, dge stands for Double General Matrix

(def n 1024)

(timed "cpu" n
       (let [X (rand-uniform! (dge n n))
             Y (rand-uniform! (dge n n))
             output (zero X)]
         (timed "cpu-multiply" n
                (mm! 1.0 X Y output))))

And the code for the GPU. The three differences

Note the use of cuge (CUDA General Matrix) instead of dge.
We need to wrap the operation in a few CUDA-specific lines to set up the engine.
Because GPU calls are asynchronous, we need to call syncronize to make sure we capture the actual work.

(timed "gpu" n
       (cuda/with-default
         (with-default-engine
           (with-release [X (rand-uniform! (cuge n n))
                          Y (rand-uniform! (cuge n n))
                          output (zero X)]
             (timed "gpu-multiply" n
                   (do
                     (mm! 1.0 X Y output)
                     (cuda/synchronize!)))))))

The timed call here is a customized version of the built-in Clojure time macro. As clojure macros get a bit off topic, the details aren't important. I'll just leave the code here for anyone interested.

(defmacro timed [metadata expr]
  (let [sym (= (type expr) clojure.lang.Symbol)]
    `(let [start# (. System (nanoTime))
           return# ~expr
           res# (if ~sym
                  (resolve '~expr)
                  (resolve (first '~expr)))]
       (prn (str
             (clojure.string/join "," ~metadata)
             ","
             (/ (double (- (. System (nanoTime)) start#)) 1000000.0)))
       return#)))

TLDR: it allows us to instrument our code and test it at various input sizes.

The macro outputs comma-separated lines like this to stdout.

gpu-mulitiply,8194,14.245

When run on both GPU and CPU for a variety of matrix sizes then plotted with a log-y axis, we can see several clear patterns. Note the y-axis is log scale.

Some takeaways:

Below a certain size threshold, the CPU wins hands down. GPU is busy initializing, note the gap between the multiplication and the overall time. All of that time is effectively a sunk cost.
There is a narrow band around the threshold where the CPU and GPU are roughly equal. But the advantage changes quickly.
Above the threshold, the GPU is orders of magnitude faster. It's not even in the same ballpark; the speed difference is so great that it could improve processes that currently take days down to seconds. That's not just performance optimization, that's completely game changing.

So, Answer 1: GPU's advantage is highly sensitive to the amount of data you can feed it. If you can't keep it busy or rely on it for tasks that it really can't do well, you're wasting resources initializing GPU memory.

Asking a GPU to crunch a few MBs of data is like asking a container ship to mail a letter.

The rest of the process

Let's assume your number crunching code is proven to be a good match for GPU work.

Most code will still need to run in the context of a larger process. At the very least, something has to load the data into memory and onto the GPU - off of disk or network or other input streams. Then it presumably takes the result from the GPU and does the reverse - writes to disk or pushes it to the network or output streams. We are not allowed to cheat and pretend that these don't count as part of the overall cost.

Amdahl's Law provides a good description of the problem:

"the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used"

So Answer 2: GPU optimization is only worth it if the relative gains make the overall process significantly faster. No cheating by ignoring the non-parallelized parts.

Microeconomics of matrix math

Even if our numeric code is faster on the GPU, and even if it does speed up the process overall to make it "worth it" from a clock time perspective, how does that translate to cost?

Let's start with what we know for sure. GPUs are significantly more expensive per hour.

If your organization views data products on a cost-of-goods-sold basis, your increase in compute costs need to be offset by a decrease in processing time. A GPU may cost 20x more per hour but if you can process the same volume of data 20x faster, you break even. If you process things 100x faster, you save money! If you process the data only 4x faster, your cost per unit went up 5x and you're not going to be happy.

Or maybe cost per unit isn't relevant. Maybe you are happy spending a premium to get faster speeds, or to do things that are practically impossible on a CPU.

Either way, dollars need to enter the equation at this point.

We can scale the time by the cost/time of the average ec2 and get a similar chart with a new y axis. Note that the shape doesn't really changed but the threshold of "worth it" moves up.

Answer 3: Depends on your willingness to pay for the decreased clock time. If it's speed at any cost, go for it! If you're constrained by efficiency, you need to normalize by the unit cost over time and adjust your threshold accordingly.

CUDAHell

So you've committed to writing code for the GPU, now comes the fun part - putting it into production.

At the very least, software teams will need access to GPU-capable development environments, CI testing environments, and servers to deploy. Making that part of your process, if it's not already, is a heavy lift for two reasons: expensive hardware and proprietary software drivers.

The market is tough for GPU hardware. Since it's exposed to fields like AI, gaming and cryptocurrency, demand and prices can swing wildly. Getting adequate developer laptops is a challenge.

Then there's the software drivers. Proprietary GPU drivers are terrible; I don't think I need to elaborate. Installing CUDA or OpenCL on one machine is a challenge. Installing it consistently on a fleet of developer and production machines can be a full time job.

The operational challenges of adding GPUs are significant. Those with unlimited AWS budgets and devops departments that take care of provisioning for you may not think of it as a big deal. But that's missing the point - proposing to add GPUs to your software stack is fundamentally an economic proposition, even if someone else is paying. Adding a GPU means adding complexity and costs and that needs to be justified.

Answer 4: Run through the full software dev-test-release process in a staging environment to make sure you can live with the additional operational challenges. Incorporate those as costs, again bumping up your threshold.

Review

I've outlined a rough methodology to justify using a GPU (or not) in your work.

Write code that works on both GPU + CPU. I realize this is easier said than done!
Run both on the domain of input sizes.
Find the performance threshold that makes it worth it for your process.
Find your economic threshold.
Adjust your developer experience, testing and release process.

We can't just assume a GPU will be a silver bullet. All 5 of the above require a commitment of resources from your organization, with the realization that you may need to bail out if it does not pay off. GPUs are risky, but the potential upsides are too compelling to ignore.

Getting started with application configuration in Rust

Sat 03 December 2022

Sooner or later, that application you're writing will need to be configured. At the very least, you'll need a way to adjust inputs without editing source code. Wouldn't it be nice to have a reasonable configuration system from the start?

The best way to configure your app will depend on the environment in which you're using the software, and the requirements of the project, all of which will change over time. Ideally, we'd start out with a system that had the flexibility to pull our configuration from a number of input sources:

Command Line Interface for interactive development with standard flags, clear usage and error handling
.env files for declarative configuration, either development or production
Environment variables for containers and many production settings
Reasonable defaults if nothing is provided by the user. And if there is no obvious default, mark it clearly as a mandatory argument.

For a language that is often refered to as a low-level "systems" language, Rust allows for some very ergonomic abstractions. We can implement a type-safe configuration system with a minimal amount of imperative code, letting the third-party crates handle the mechanical details. Let's walk through a new project...

Project setup

In this example, we'll create a Rust project using the clap and dotenv crates.

cargo new myapp
cd myapp

cargo add clap --features derive,env
cargo add dotenv

Your Cargo.toml file should look something like

[dependencies]
clap = { version = "4.0.29", features = ["derive", "env"] }
dotenv = "0.15.0"

Creating the configuration struct

Let's build it up from scratch, starting with a plain struct defining all values we need to configure the app.

In our src/main.rs

pub struct Config {
    pub ipaddr: String,
    pub port: i32,
    pub database_url: String,
}

Let's pause for a second to consider types. In Rust, types can help us out by providing powerful correctness guarantees.

Is ipaddr really a String? The type system should enforce a valid IPv4 address instead of a free-form string. Likewise, let's make sure the port is a unsigned 16 bit integer to stay within the range of viable port numbers.

use std::net::Ipv4Addr;

pub struct Config {
    pub ipaddr: Ipv4Addr,
    pub port: u16,
    pub database_url: String,
}

Clap annotations

Next, we use the clap crate and add annotations to our struct.

This turns our declarative struct into a powerful command line interface, with error handling, default values and type conversion.

use clap::Parser;
use std::net::Ipv4Addr;

#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    #[arg(short, long, default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    #[arg(short, long, default_value_t = 3000)]
    pub port: u16,

    #[arg(short, long)]
    pub database_url: String,
}

The author, version and about text are derived from the contents of our Cargo.toml file.

Note that the database_url does not use a default value.

Self documentation

We can add docstrings (///) to the struct and to its members. This serves the purpose of both documenting the code and exposing friendly command line usage and error messages.

use clap::Parser;
use std::net::Ipv4Addr;

/// My Awesome Application
#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    /// IPv4 address
    #[arg(short, long, default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    /// Port number
    #[arg(short, long, default_value_t = 3000)]
    pub port: u16,

    /// Database connection string
    #[arg(short, long)]
    pub database_url: String,
}

Environment handling

Clap can handle env vars explicitly by add the env(...) annotation to each configuration item. Here, we explictly define each variable name using the APP_* prefix, all upper case, as a convention:

use clap::Parser;
use dotenv::dotenv;
use std::net::Ipv4Addr;

/// My Awesome Application
#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    /// IPv4 address
    #[arg(short, long, env("APP_IPADDR"), default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    /// Port number
    #[arg(short, long, env("APP_PORT"), default_value_t = 3000)]
    pub port: u16,

    /// Database connection string
    #[arg(short, long, env("APP_DATABASE_URL"))]
    pub database_url: String,
}

Constructor

Since we want to (optionally) populate our environment using a .env file, we have to set up the environment before invoking the Clap parser. To do this, We'll implement a from_env_and_args constructor method for our Config struct.

impl Config {
    pub fn from_env_and_args() -> Self {
        dotenv().ok();
        Self::parse()
    }
}

With four potential inputs, how do we reason about which takes precendence? To determine the config value, the effective order is as follows, first one wins:

Command line interface argument
File (.env)
Environment variable
Default value

Main

Finally, we write our main function to create and construct the Config at runtime.

fn main() {
    let cfg = Config::from_env_and_args();
    println!("Starting HTTP server on {}:{}", cfg.ipaddr, cfg.port);
    println!("Connecting to {}", cfg.database_url);
}

Presumably, your application will do something more interesting here!

Result

$ cargo build
...
$ ./target/debug/myapp --help
My Awesome Application

Usage: myapp [OPTIONS] --database-url <DATABASE_URL>

Options:
  -i, --ipaddr <IPADDR>              IPv4 address [env: APP_IPADDR=] [default: 0.0.0.0]
  -p, --port <PORT>                  Port number [env: APP_PORT=] [default: 3000]
  -d, --database-url <DATABASE_URL>  Database connection string [env: APP_DATABASE_URL=]
  -h, --help                         Print help information
  -V, --version                      Print version information

In this case, we see that the database_url is undefined in the environment, has no default, but is required by the application. If we try to run it now, the app exits with status code of 2 and we get a human-readable message that we are missing the database URL:

$ ./target/debug/myapp

error: The following required arguments were not provided:
  --database-url <DATABASE_URL>

Usage: myapp --database-url <DATABASE_URL>

For more information try '--help'

To provide it we have three options, depending on your operational needs.

First, we can use the command line for interactive testing:

./target/debug/myapp --database-url postgres://postgres@localhost:5432/postgres

Or, an environment variable for production settings:

export APP_DATABASE_URL="postgres://postgres@localhost:5432/postgres"
./target/debug/myapp

Or finally, using a .env file for declarative environment setup (in prod or dev).

echo 'APP_DATABASE_URL=postgres://postgres@localhost:5432/postgres' >> .env
./target/debug/myapp

Whichever way we configure the required DATABASE_URL, we get the same result.

$ ./target/debug/myapp
Starting HTTP server on 0.0.0.0:3000
Connecting to postgres://postgres@localhost:5432/postgres

Error handling is intuitive from the command line. Let's see what happens when we provide an invalid IP adress and port number.

$ ./target/debug/myapp --ipaddr 255.255.255.999
error: Invalid value '255.255.255.999' for '--ipaddr <IPADDR>': invalid IPv4 address syntax

For more information try '--help'

$ ./target/debug/myapp --port 999999
error: Invalid value '999999' for '--port <PORT>': 999999 is not in 0..=65535

For more information try '--help'

Viola. A simple, declarative, type-safe abstraction with minimal code. We get operational flexibility and confidence in the validity of the inputs without writing imperative code to handle the details of each scenario.

This can serve as a starter template suitable for most backend server or command line applications. Here it is, all 26 lines of code in one place:

use clap::Parser;
use dotenv::dotenv;
use std::net::Ipv4Addr;

/// My Awesome Application
#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    /// IPv4 address
    #[arg(short, long, env("APP_IPADDR"), default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    /// Port number
    #[arg(short, long, env("APP_PORT"), default_value_t = 3000)]
    pub port: u16,

    /// Database connection string
    #[arg(short, long, env("APP_DATABASE_URL"))]
    pub database_url: String,
}

impl Config {
    pub fn from_env_and_args() -> Self {
        dotenv().ok();
        Self::parse()
    }
}

fn main() {
    let cfg = Config::from_env_and_args();
    println!("Starting HTTP server on {}:{}", cfg.ipaddr, cfg.port);
    println!("Connecting to {}", cfg.database_url);
}

Check out the clap docs for more examples of how you can extend this approach.

I think this interface shows that we don't need to compromise between ergonomics and type-safety, speed and correctness. It's a great example of Rust's potential as a higher level application language.

Don't install PostgreSQL - Using containers for local development.

Fri 11 February 2022

So you need a database for an application you're developing. You've looked around and decided that PostgreSQL fits the bill. Excellent choice! Now it's time to start coding. How do you get postgres running locally to devlop and test against it?

The typical suggestion for many web application frameworks is to install PostgreSQL to your system using your chosen dependency management tool - brew install postgresql or apt install postgresql - then configure it to work for your application (maybe tweaking some settings in /etc/postgresql/ as the root user), starting a background process with your system supervisor of choice (sudo systemctl start postgresql), hooking it up to your app, and you're off to the races.

But what happens when you're working on project that needs a different major version of postgresql, with different extensions or entirely different settings? I often found myself in a scenario where my system was full of cruft, having been reworked many times over to swap out different postgresql instances. Additionally there is only a single data directory (/etc/postgresql/<version>/main) so if you need the data to persist for more than a single project, you have to manage backup and restore each time you switch contexts.

A traditional system install just doesn't cut it. We need a way to run many different postgres instances, independent of each other with isolated data, settings and software versions. We can use Docker containers to run postgresql in a more flexible way that allows for greater experimentation, data stability, and greatly improved ease of use.

Running postgres in Docker, the naive approach

There's no real secret to running Docker containers. We know that postgresql docker images exist and we should be able to run them like any other.

$ docker run postgres:14.1
Unable to find image 'postgres:14.1' locally
14.1: Pulling from library/postgres
...
Status: Downloaded newer image for postgres:14.1
Error: Database is uninitialized and superuser password is not specified.
       You must specify POSTGRES_PASSWORD to a non-empty value for the
       superuser. For example, "-e POSTGRES_PASSWORD=password" on "docker run".

       You may also use "POSTGRES_HOST_AUTH_METHOD=trust" to allow all
       connections without a password. This is *not* recommended.

       See PostgreSQL documentation about "trust":
       https://www.postgresql.org/docs/current/auth-trust.html

Ah, clearly there are a few tricks specific to running postgres in a container. If we set a postgres password, we can get a running postgres instance.

$ docker run -e POSTGRES_PASSWORD=password postgres:14.1
...
2022-02-03 18:23:38.823 UTC [1] LOG:  database system is ready to accept connections

The container startup script will initialize your database, create users and start the process, listening for connections. But where is it listening? We can't yet connect to it. And where is the data? We can't see any data anywhere on our host system. Everything is, well, contained within the running Docker container.

To make this workflow viable for local development, we'd like

An open TCP port on the host system so we can connect to it.
The data to live on the host system, not in the container's overlay filesystem.
To give postgres access to files from the host system so that we can import datasets.
Settings to live on the host system so that we can adjust them and optionally check them into source control.

Of course the offical PostgreSQL Docker documentation covers these exact scenarios, showing us how we can use port forwarding and volume mounts.

An alternative to system-wide PostgreSQL installs

Here is my opinionated take on how to set up an ergonomic postgres environment for local development.

First, create a database directory in your project to hold all things postgres

Then create database/postgresql.conf to specify the postgres settings. The example below is a subset of the full postgres config, the settings that I typically need to adjust when doing any serious performance-sensistive development

# PostgreSQL configuration file
# See https://github.com/postgres/postgres/blob/master/src/backend/utils/misc/postgresql.conf.sample

#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#------------------------------------------------------------------------------
listen_addresses = '*'
port = 5432             # (change requires restart)
max_connections = 100           # (change requires restart)

#------------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#------------------------------------------------------------------------------
shared_buffers = 2048MB         # min 128kB
work_mem = 40MB             # min 64kB
maintenance_work_mem = 640MB        # min 1MB
dynamic_shared_memory_type = posix  # the default is the first option
max_parallel_workers_per_gather = 6 # taken from max_parallel_workers
max_parallel_workers = 12       # maximum number of max_worker_processes that

#------------------------------------------------------------------------------
# WRITE-AHEAD LOG
#------------------------------------------------------------------------------
checkpoint_timeout = 40min      # range 30s-1d
max_wal_size = 1GB
min_wal_size = 80MB
checkpoint_completion_target = 0.75 # checkpoint target duration, 0.0 - 1.0

#------------------------------------------------------------------------------
# REPORTING AND LOGGING
#------------------------------------------------------------------------------
logging_collector = off
log_autovacuum_min_duration = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = default
log_min_duration_statement = 20ms
log_lock_waits = on
log_temp_files = 0
log_timezone = 'UTC'

#------------------------------------------------------------------------------
# AUTOVACUUM
#------------------------------------------------------------------------------
autovacuum_vacuum_scale_factor = 0.02   # fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.01  # fraction of table size before analyze

#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------
datestyle = 'iso, mdy'
timezone = 'UTC'
lc_messages = 'C.UTF-8'
lc_monetary = 'C.UTF-8'
lc_numeric = 'C.UTF-8'
lc_time = 'C.UTF-8'
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'pg_stat_statements'

Create a database/pg_hba.conf to control access to the database. You might need to adjust this to experiment with different networking setups, different users, etc. Usually the defaults here are fine.

# PostgreSQL Client Authentication Configuration File
# ===================================================
# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD

# Database administrative login by UNIX sockets
# "local" is for Unix domain socket connections only
local   all         postgres                          ident
local   all         all                               ident

# IPv4 local connections:
host    all         all         172.17.0.0/16         md5

# IPv6 local connections:
host    all         all         ::1/128               md5

Make two subdirectories to hold the data: database/mnt_data to hold data you intend to import/export and database/pgdata to hold the actual database.

$ mkdir mnt_data
$ mkdir pgdata

You probably don't want to check your datasets or database into source control. Create a database/.gitignore to ignore them

# .gitignore
pgdata
mnt_data

Finally, create a run-postgres.sh script to launch the docker container with everything hooked up.

# run-postgres.sh
set -e
HOST_PORT=5432
NAME=postgres-dev
DOCKER_REPO=postgres
TAG=14.1

docker run --rm --name $NAME \
  --volume `pwd`/pgdata:/var/lib/pgsql/data \
  --volume `pwd`/mnt_data:/mnt/data \
  --volume `pwd`/pg_hba.conf:/etc/postgresql/pg_hba.conf \
  --volume `pwd`/postgresql.conf:/etc/postgresql/postgresql.conf \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_USER=postgres \
  -e PGDATA=/var/lib/pgsql/data/pgdata14 \
  -e POSTGRES_INITDB_ARGS="--data-checksums --encoding=UTF8" \
  -e POSTGRES_DB=db \
  -p ${HOST_PORT}:5432 \
  ${DOCKER_REPO}:${TAG} \
  postgres \
    -c 'config_file=/etc/postgresql/postgresql.conf' \
    -c 'hba_file=/etc/postgresql/pg_hba.conf'

Note the HOST_PORT variable. If you've already got another database running on 5432, this won't work. This is where you need to get a bit creative and tune the process to your needs. What I typically do is use port 6432 and increment by one for every project so they don't conflict. This allows to run all of your databases at the same time on one machine. The only downside is you need to remember which port maps to which database!

Running it

$ ./run-postgres.sh
...
2022-02-03 19:13:09.673 UTC [1] LOG:  starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-02-03 19:13:09.673 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-03 19:13:09.673 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-03 19:13:09.677 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-03 19:13:09.685 UTC [26] LOG:  database system was shut down at 2021-11-13 21:34:06 UTC
2022-02-03 19:13:09.700 UTC [1] LOG:  database system is ready to accept connections

Using this setup, the logs are sent directly to stdout so you'll see everything in the terminal. The ports and paths in the logs are inside the container, so don't get fooled trying to find them on your host system.

To connect, we use the defined host port

$ psql postgres://postgres:password@localhost:6432/postgres

You can put data in mnt_data from the host system, which will be exposed to postgresql as the /mnt/data directory inside the container. For example, load it with psql using COPY data FROM '/mnt/data/my.csv' WITH CSV HEADER;. Likewise, any data dumps or exports from postgres can be output to this directory, immediately accessible to the host system.

To stop the server, use Ctrl-C. The data will persist to your pgdata directory. Resist the temptation to touch any files therein as they are managemed internally to postgres. But you can move the directory as a whole around the filesystem or to another machine. It's not quite as convenient as a process-less, single file SQLite database but it's close.

Because the pgdata directory is created by postgres which provides strong gaurantees that the on-disk data format will be consistent within a major version, we can even use a different image altogether to access the same underlying dataset. This can be very handy for e.g. switching between vanilla postgres and postgis, or for testing different versions of extensions, etc. As long as the image follows the basic rules of the postgres container behavior and uses the same major version, it should just work.

What about in production?

Installing postgresql on a VM or bare-metal server is still viable, especially if automated with configuration tools like Ansible or Chef. But there are other options.

If your project is all-in on containers in production, consider checking out some of the Kubernetes operators for postgres. You can use the exact same container image in production that you test on locally, albeit with some additional operational concerns around availability and stateful data. Operator software like Crunchy PostgreSQL for Kubernetes and Kubegres can be configured for load balancing, high-availability, backups, monitoring, etc. which can ease the operational burden should your database require such things.

Of course, there is always the cloud hosted option. I've used postgresql on both GCP Cloud SQL and AWS RDS and, while you give up some control of the environment and are no longer able to run the exact same database locally as you do in prod, the easy of adminstering these hosted databases might be worth it.

Conclusion

Docker containers provide a robust way to run postgres in local development, with very few compromises. A container-based workflow makes it easier to maintain multiple parallel database, and to move data freely between systems. For my money, there's no need to apt install postgres again.

Zonal Stats with PostGIS Rasters, part 2

Sat 28 November 2020

In my last post I compared two approaches for calculating zonal statistics:

A Python approach using the rasterstats library
A SQL approach using PostGIS rasters.

I came away happy that I could express zonal stats in SQL, but wasn't happy with the performance; an 87x slowdown compared to the equivalent Python code. When in doubt though, it's user error! I received some good suggestions from readers of this blog (Thanks Stefan Jäger and Pierre Racine!) who suggested some performance enhancements from tiling and spatial indexes.

Additionally, I wasn't happy with the setup of the last experiment; while PostGIS and Rasterio both interact with the underlying GDAL C API, in my experiment they were using GDAL libraries of different origins. And I'm skeptical that my synthetic vector data was representive of all workloads. A common case for zonal statistics is aggregating a raster by (non-overlapping) administrative boundaries. The nature of the datasets can have a significant impact; best to go with something more realistic.

Time for a reboot...

Reproducible containers

I used my docker-postgres image to easily recreate an environment where everything is built from source against the same shared libraries.

To run a postgresql server from a docker container (no messy install required) with local data volumes mounted in ./pgdata.

git clone https://github.com/perrygeo/docker-postgres.git
cd docker-postgres
./run-postgres.sh

This will download a pre-built image from Dockerhub so you can try it out without messing with your system. Then launches the Postgresql server process, with your local pgdata, mnt_data and log directories mounted as container volumes.

In order to run Python code from the same container, we can exec into it to get shell access:

docker exec -ti postgres-server /bin/bash

From here we can run our Python-based command line tools (Rasterio)

$ rio --version
1.1.8
$ rio --gdal-version
3.2.0

Connecting to the server with psql, I can use the built-in version commands to show what we're working with

SELECT version();

PostgreSQL 13.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

SELECT postgis_full_version();

POSTGIS="3.1.0alpha3 b2221ee"
PGSQL="130"
GEOS="3.9.0dev-CAPI-1.14.0"
PROJ="7.2.0"
GDAL="GDAL 3.2.0, released 2020/10/26"
LIBXML="2.9.4"
LIBJSON="0.12.1"
LIBPROTOBUF="1.3.3"
WAGYU="0.5.0 (Internal)"
RASTER

Since the Rasterio library is running in the container, linked to exact same GDAL, GEOS and PROJ libraries as PostGIS, we can be assured of a more consistent environment.

Raster dataset

For our raster dataset, we'll use the historic climate data provided by the WorldClim project. For our experiment we'll use the historic average monthly temperature rasters.

wget http://biogeo.ucdavis.edu/data/worldclim/v2.1/base/wc2.1_2.5m_tavg.zip
unzip wc2.1_2.5m_tavg.zip

The result is a dozen monthly GeoTIFF files representing the historic average temperature for the month - we'll use wc2.1_2.5m_tavg_07.tif, the average July temperature. Each raster is a 4320 x 8640 grid with global coverage in WGS84 coordinates.

And use Rasterio to inspect the shape of the raster grid

rio info wc2.1_2.5m_tavg_07.tif | jq -c .shape

which prints to stdout, confirming the raster grid shape:

[4320,8640]

Vector data

For our vector dataset, we're using the Natural Earth Admin dataset with 241 multipolygons, one for each nation.

wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/cultural/ne_50m_admin_0_countries.zip
unzip ne_50m_admin_0_countries.zip

Check the number of features using Fiona

$ fio info  ne_50m_admin_0_countries.shp | jq .count
241

Overlaying the admin polygons on top of the stylized temperature raster and we get a good picture of the question we're trying to answer:

What is the historical average temperature of each country in the month of July?

Zonal Stats using `python-rasterstats`

from rasterstats import zonal_stats

stats = zonal_stats(
    vector="ne_50m_admin_0_countries.shp",
    raster="wc2.1_2.5m_tavg_07.tif",
    stats=["sum", "mean", "count", "std", "min", "max"]
)

The time to complete this script was 6.67 seconds (fastest of 3 runs).

Zonal Stats using `postgis_raster`

To test the performance of the database, we need to get the data in:

Load the raster data

In part 1, I imported my raster data using a rather naive raster2pgsql command. This time, we add a few more options to tune performance.

raster2pgsql -Y -d -t 256x256 -N '-3.4e+38' -I -C -M -n "path" \
    wc2.1_2.5m_tavg_07.tif tavg_07 | psql

The -t 256x256 is a key parameter. By cutting the raster into 256-pixel square tiles, the resulting raster table contains multiple rows, one per tile. A spatial index on the tiles, combined with rewriting the SQL to take advantage of the index and to aggregate across tiles, zonal stats can be made much more efficient inside PostgreSQL.

The -I indicates that a spatial index of the raster tiles should be built after import. The spatial index, along with a spatial query that can take advantage of it, can quickly select the subset of tiles that overlap your features of interest.

The other parameters to note:

-Y uses COPY for more efficient transfer.
-d deletes the table if it already exists (useful for testing but careful in production).
-N defines a nodata value directly at the CLI.
-n create a path column to store the filename.
-C applies constraints to ensure valid raster alignment, etc.
-M runs VACUUM ANALYZE on the table as a final step.

Load the vector data

Using a standard shp2pgsql with a -I to build and index.

shp2pgsql -g geometry -I -s 4326 ne_50m_admin_0_countries.shp countries | psql

Run the query

Now we have two tables loaded, countries and tavg_07, and can ask our question in SQL:

SELECT
    (ST_SummaryStatsAgg(ST_Clip(raster.rast, countries.geometry, true), 1, true)).*,
    countries.name AS name,
    countries.geometry AS geometry,
    count(1) as n_tiles
FROM
    tavg_07 as raster
INNER join countries on
    ST_INTERSECTS(countries.geometry, raster.rast)
GROUP BY
    name, geometry;

I added the GROUP BY to aggregate across tiles; otherwise we'd get multiple rows per country. And on the SELECT side, PostGIS provides a ST_SummaryStatsAgg function (the aggregate variant of the ST_SummaryStats) to sum across tiles.

Here's the resulting map data rendered via DBeaver. The count is the number of raster pixels intersecting the feature, while the n_tiles is the number of raster tiles. The mean is probably what we're interested in; the avergage temperature.

Here's the bottom line on performance: PostGIS can perform this query in 6.1s. Marginally faster than the Python rasterstats version even. It could be that the latest improvements in the geospatial stack account for some of this effect but tiling clearly matters to performance.

Effect of tile size

The chosen value of -t determines how much data fits into each tile. There's an unavoidable inverse relationship between the size of a row and the number of rows/tiles. Not surprisingly we find a tradeoff between those two constraints.

tilesize	query (s)	raster2pgsql import (s)
64x64	5.9	58.7
256x256	6.6	15.8
1024x1024	8.5	7.3
untiled	49.2	5.2

Smaller tiles with a spatial index means more efficient queries, at the expernse of pre-chopping the raster into many tiles. Depending on the nature of your analysis, you'll want to adjust accordingly. The optimal tilesize is likely to depend on hardware, the tiling patterns of the orignal data and and the usage patterns you expect.

For this dataset, somewhere around 256x256 appears to be an optimal size. It would make a good default providing the benefits of tiling without as much import overhead as smaller tiles.

Surprisingly, the untiled version still performs ok relative to the python code. The query on an untiled raster is "only" 7.5x slower than the python code, not as bad as the 80x performance hit I found in part 1. While this factor seems highly dependent on the data at hand, the conclusion doesn't change - tiling maters.

Conclusion

Use raster2pgsql -t 256x256 -I to tile your PostGIS rasters. Combined with aggregate functions and spatial indexes, you get similar zonal stats query functionality and performance from PostGIS as you would with equivalent single-threaded Python/GDAL approaches.

There's still much to be explored regarding optimal tiling, parallel aggregates, out-of-band rasters, and the impact of source raster data file layout on performance. More to come in part 3...

Page 1 / 23