perrygeo

About | Articles | CV

Zonal Stats with PostGIS Rasters

Mon 31 December 2018

Zonal statistics is a technique to summarize the values of a raster dataset overlapped by a set of vector geometries. The analysis can answer queries such as "Average elevation of each nation park" or "Maximum temperature by state".

My goal in this article is to demonstrate a PostGIS implementation of zonal stats and compare the results and runtime performance to a reference Python implementation.

Python with the rasterstats library using GeoTIFF and GeoJSON files.
SQL queries using PostGIS raster and vector tables.

The Dataset

For the raster data, let's use the ALOS Global Digital Surface Model (from the Japan Aerospace Exploration Agency ©JAXA). I picked a 1°x1° tile with 1 arcsecond resolution (roughly 30 meters) in GeoTIFF format.

Next, generate 100 random circular polygon features covering the extent of the raster. The following Python script shows how to do so with the Rasterio and Shapely libs.

#!/usr/bin/env python
import json
import random
import sys

import rasterio
from shapely.geometry import Point


def random_features_for_raster(path, steps=100):
    with rasterio.open(path) as src:
        x1, y1, x2, y2 = src.bounds

    xs = [random.uniform(x1, x2) for _ in range(steps)]
    ys = [random.uniform(y1, y2) for _ in range(steps)]
    for i, (x, y) in enumerate(zip(xs, ys)):
        buffdist = random.uniform(0.002, 0.04)
        shape = Point(x, y).buffer(buffdist)
        yield {
            "type": "Feature",
            "properties": {"name": str(i)},
            "geometry": shape.__geo_interface__,
        }


if __name__ == "__main__":
    for feat in random_features_for_raster(sys.argv[1]):
        print(json.dumps(feat))

Piping the features through fio collect gives us a valid GeoJSON collection with 100 polygon features.

python make-random-features.py N035W106_AVE_DSM.tif | fio collect > regions.geojson

Visualizing the data in QGIS shows what we're working with. The goal is to find basic summary statistics for elevation in each of the regions:

Python with `rasterstats`

Using zonal_stats Python function allows you to express the processing at a high-level.

#!/usr/bin/env
import json
from rasterstats import zonal_stats

features = zonal_stats(
    "regions.geojson"
    "N035W106_AVE_DSM.tif"
    stats=["count", "sum", "mean", "std", "min", "max"],
    prefix="dem_",
    geojson_out=True)

with open("regions_with_elevation.geojson", "w") as dst:
    collection = {
        "type": "FeatureCollection",
        "features": list(features)}
    dst.write(json.dumps(collection))

Running this script takes about 2.4 seconds and creates a new GeoJSON file regions_with_elevation.geojson with the following attributes, as viewed in QGIS

And the resulting features can be mapped, in this case using the dem_mean field to show the average elevation of each region:

Postgis

Instead of working with GeoTiff rasters and GeoJSON files, we can perform the same thing in PostGIS tables using SQL.

Loading the data

To create a raster table name dem from the GeoTIFF.

raster2pgsql N035W106_AVE_DSM.tif dem | psql <connection info>

For some rasters, it might be necessary to explictly set the nodata value.

UPDATE dem SET rast = ST_SetBandNoDataValue(rast, -32768);

To create a vector table named regions from the GeoJSON file. (See ogr2ogr docs for details on the connection info)

ogr2ogr -f PostgreSQL PG:"<connection info>" regions.geojson

Zonal Statistics in SQL

Now we can express our zonal stats analysis as a SQL statement.

SELECT
    -- provides: count | sum | mean | stddev | min | max
    (ST_SummaryStats(ST_Clip(dem.rast, regions.wkb_geometry, TRUE))).*,
    regions.name AS name,
    regions.wkb_geometry AS geometry
INTO
    regions_with_elevation
FROM
    dem, regions;

Let's break that down a bit

FROM dem, regions does a full product of the 100 regions X 1 raster row.
ST_Clip function clips each raster to the precise geometry of each feature.
ST_SummaryStats function summarizes each clipped raster and produces a count, sum, mean, standard deviation, min and max column.
INTO regions_with_elevation creates a new table with the results.

Conceptually, this approach is similar to the internal process used by rasterstats.

database=# SELECT name, min, max, mean, count from regions_with_elevation;

name | min  | max  |       mean       | count
-----+------+------+------------------+-------
...
32   | 2104 | 2196 | 2141.13257847212 |  6977
33   | 2296 | 2667 | 2429.01510429154 |  4171
34   | 1784 | 1917 | 1852.97140948564 |  7485
35   | 2033 | 2144 | 2083.38765260393 | 51768
36   | 1796 | 1843 | 1828.69792802617 |   917
37   | 2072 | 2206 |  2122.1204719764 |  8475
38   | 2117 | 2214 | 2152.05270513076 |  5009
39   | 1915 | 2071 | 2040.61622890496 | 15762
...

Compared to attribute table screenshot above, the results are identical for all columns. That isn't too surprising given that both approachs use GDAL's rasterization API under the hood.

Performance is a different story. The zonal stats query took 81.90 seconds, roughly 34x slower than the Python code for the equivalent result.

Thoughts

In terms of the expressiveness of the two approaches, I can see the appeal of both Python code and SQL queries. Of course this will be personal preference depending on your background and familiarity with the environments. The Python API hides the implementation details and is more flexible, with more statistics options and rasterization strategies. But the SQL approach covers the common use case in a declarative query; it exposes the implementation details yet remains very readable.

The performance impact is significant enough to be a deal breaker for PostGIS. I haven't delved into the issues too closely; There might some obvious ways to optimize this query but I haven't found anything as of writing this. PostGIS experts, please get in touch if you find any speedups that I could consider here!

Performance combined with the additional overhead of managing postgres instances and data imports tells me that running zonal stats in postgis will not be a great option unless you're already running PostgreSQL. If your application is already committed to postgres and you want to integrate zonal stats tightly into your data management strategy, it could be a viable approach. For example, you could create a TRIGGER or an asyncronous worker via LISTEN/NOTIFY to ensure run zonal statistics is run each time a new feature is inserted into your vector table.

For most other zonal stats use cases, using rasterstats against local files or in-memory Python data will be a faster with less data management overhead.

Processing vector features in Python

Sat 16 April 2016

Working with geospatial vector data typically involves manipulating collections of features - points, lines and polygons with attributes. You might change their geometries, alter their properties or both. This is nothing new. Tools like this have been around since the first days of GIS. Notice the essential role of many of these operations: taking vector data as input, doing some work and producing vector data as output. While conceptually very simple, this logic often gets siloed, tied too closely to our specific implementions, formats, and systems.

The following is my take on the best practices for designing and building your own vector processing modules using modern Python. The goals here are not primarily performance but interoperability and composability.

GeoJSON guides the way

Using GeoJSON-like Feature mappings as a representation of simple features buys us a ton of interoperability. It's not only a standard but the only one that can be translated to fully represent a feature as a python data structure. Other standards specify file formats or data structures for geometries only. Most Python modules that deal with geospatial data can speak GeoJSON-like data. And if they don't, the data structure is easy to construct manually. Let's take a look at our humble Feature

{
    'type': 'Feature',
    'properties': {
        'name': 'Example'},
    'geometry': {
        'type': 'Point',
        'coordinates': [-120.0, 42.0]}}

The geometry, the geographic component, is just iterables of lon, lat locations - you can represent points, lines, polygons or multis. The properties dictionary holds non-geographic information about the features, analagous to the "attribute table" in many GIS.

A quick note on the term "GeoJSON-like Feature mapping"... GeoJSON is a text serialization format. When we take GeoJSON and translate it into a python data structure it is no longer GeoJSON but a python dictionary (mapping) which follows the semantics of a GeoJSON Feature. From here on out, I'll just refer to this GeoJSON-like python data structure as a feature. If you're writing functions that work with vector data, they should accept and return features.

That's the convention, the golden rule of writing Python vector processing functions

Functions should take features as inputs and yield or return features.

In other words, features in, features out. That's it. It's really that simple, and the simplicity buys you a great deal of potential.

The IO Sandwich

Functions which fit this convention will not read or write to anything outside of locally-scoped variables. Does your function need to read from a file or write to the network in addition to processing features? Why should one function be responsible for doing multiple tasks? We're striving for functions that do one thing - process vector features.

All the data your function needs should be passed in as arguments. Note that this is very different than passing in a file path and doing the reading and writing of data within your function:

# BAD
process_features("/path/to/shapefile.shp", output="out.shp")

# GOOD
features = read_features("/path/to/shapefile.shp")
new_features = process_features(features)
write_features(new_features, output="out.shp")

You might be concerned about memory. But don't worry, well-behaved Python libraries can use generators to load the data as needed.

Another way to picture it is that your application should build an IO sandwich with all of the reading and writing happening outside of your processing function.

Read Shapefile into Features -->  process_features  --> Write Features to Shapefile

That way anyone can use the same function with different inputs and outputs

Read Web Service into Features -->  process_features  --> Write Features to PostGIS

Processing functions should not care where their input features come from or where the output features are going. As long as process_features takes and returns features, any number of combinations are possible.

This not only decouples IO but allows us to compose processes together

Read Features --> process1 --> process2 --> process3 --> Write Features

Other guidelines

When possible, you should strive for pure functions; avoid mutating data and return a clean copy.

Unless you have specific reason, leave the original feature intact except for the thing your function is expected to manipulate. For instance, if your function just alters the geometry, don't drop or change existing properties.

There are some cases where it makes sense to collect your features into a collection and return the entire thing at once. This will generally occur if the features are not independent. In many cases though, your features will largely be independent and can be processed one-by-one. For these situations, it makes sense to use a generator (i.e. yield feature instead of return features).

Finally, you should aim to make your features serializable. You should be able to json.dump() the output features. The properties member should not contain nested dicts which might confuse some GIS formats which require a flat structure. And if possible, avoid extending the json with extra elements outside of properties.

An Example

In this simple example, we'll write a single vector processing function that buffers a geometry by a specified distance. Taking an input of points, for example:

and buffering them by 10 units.

Here is the core processing function which follows the features in, features out convention

from shapely.geometry import shape

def buffer(features, buffer=1.0):
    """Buffer a feature by specified units
    """
    for feature in features:
        geom = shape(feature['geometry'])   # Convert to shapely geometry to operate on it
        geom = geom.buffer(buffer)          # Buffer
        new_feature = feature.copy()
        new_feature['geometry'] = geom.__geo_interface__
        yield new_feature

Then we could use it in our IO sandwich by reading features from a shapefile and outputing the Features to GeoJSON on stdout. Here's what our Python interface looks like

import fiona # for input
import json  # for output

from process import buffer

with fiona.open("data/points.shp") as src:
    for feature in buffer(src, 10.0):
        print(json.dumps(feature))

So the python interface is looking good. What if we wanted to use it in a command line interface? Well luckily click and cligj has got the input covered. The @cligj.features_in_arg reads in an iterable of features from a file, a FeatureCollection or stream of Features.

import click
import cligj
import json

from process import buffer

@click.command()
@click.argument("distance", type=float)
@cligj.features_in_arg
def buffer_cmd(features, distance):
    for feature in buffer(features, distance):
        click.echo(json.dumps(feature))

if __name__ == "__main__":
    buffer_cmd()

Which we can then use between fio cat and fio collect to process Features in a memory-efficient stream.

$ fio cat data/points.shp | python buffer_cmd.py 10 | fio collect > points_buffer.geojson

What about an HTTP interface? Flask provides us with a lightweight framework to turn our function into a web service:

import json
from flask import Flask, request, Response
from process import buffer

app = Flask(__name__)

@app.route('/buffer/<distance>', methods=['POST'])
def index(distance):
    collection = request.get_json(force=True)
    distance = float(distance)

    new_features = list(
        buffer(collection['features'], distance))

    collection['features'] = new_features

    return Response(
        response=json.dumps(collection),
        status=200, mimetype="application/json")

if __name__ == '__main__':
    app.run(debug=True)

Which gives us a buffer web service to which you can post GeoJSON FeatureCollections and get back a buffered collection:

$ fio dump data/points.shp | \
    curl -X POST -d @- http://localhost:5000/buffer/10.0 > points_buffered.geojson

Conclusion

Writing your vector processing code to follow these simple conventions enables great flexibility. You can use your code in a Python application, a command line interface, an HTTP web service - all based on the same core processing functions. Assuming you can write some glue code to express input and output as GeoJSON features, this will work with any vector data source and is not constrained to a single context. You can use this with any data, anywhere that supports Python. That's a pretty powerful concept, all made possible by the simple convention of features in, features out.

Running Python with compiled code on AWS Lambda

Sat 10 October 2015

With the recent announcement that AWS Lambda now supports Python, I decided to take a look at using it for geospatial data processing.

Previously, I had built queue-based systems with Celery that allow you to run discrete processing tasks in parallel on AWS infrastructure. Just start up as many workers on EC2 instances as you need, set up a broker and a results store, add jobs to the queue and collect the results. The problem with this system is that you have to manage all of the infrastructure and services yourself.

Ideally you wouldn't need to worry about infrastructure at all. That is the promise of AWS Lambda. Lambda can respond to events, fire up a worker and run the task without you needing to worry about provisioning a server. This is especially nice for sporadic work loads in response to events like user-uploaded data where you need to scale up or down regularly.

The reality of AWS Lambda is that you do need to worry about infrastructure in a different way. The constraints of the runtime environment mean that you need to get creative if you're doing anything beyond the basics. If your task relies on compiled code, either Python C extensions or shared libraries, you have to jump through some hoops. And for any geo data processing you are going to use a good amount of compiled code to call into C libs (see numpy, rasterio, GDAL, geopandas, Fiona, and so on)

This article describes my approach to solving the problem of running Python with calls to native code on AWS Lambda.

Outline

The short version goes like this:

Start an EC2 instance using the official Amazon Linux AMI (based on Red Hat Enterprise Linux)
On the EC2 insance, Build any shared libries from source.
Create a virtualenv with all your python dependecies.
Write a python handler function to respond to events and interact with other parts of AWS (e.g. fetch data from S3)
Write a python worker, as a command line interface, to process the data
Bundle the virtualenv, your code and the binary libs into a zip file
Publish the zip file to AWS Lambda

The deployment process is a bit clunky but the benefit is that, once it works, you don't have any servers to manage! A fair tradeoff IMO.

The process will take a raster dataset uploaded to the input s3 bucket

dem

and automatically extract the shape of the valid data region, placing the resulting GeoJSON in the output s3 bucket.

shape

Start EC2

Under the hood, your Lambda functions are running on EC2 with Amazon Linux. You don't have to think about that at runtime but, if you're calling native compiled code, it needs to be compiled on a similar OS. Theoretically you could do this with your own version of RHEL or CentOS but to be safe it's easier to use the official Amazon Linux since we know that's the exact environment our code will be run in.

I'm not going to go over the details of setting up EC2 so I'll assume we already have our account set up. The AMI ids are listed here, pick the appropriate one for your region

aws ec2 run-instances --image-id ami-9ff7e8af \
    --count 1 --instance-type t2.micro \
    --key-name your-key --security-groups your-sg

And ssh in

ssh -i your-key.pem ec2-user@your.public.ip

Make sure everything's up to date:

sudo yum -y update
sudo yum -y upgrade

Build shared libraries from source

Because your Lambda function will run in a clean AWS linux environment, you can't assume any system libraries will be there. Compiling from source isn't the only option - you could install binaries from the Enterprise Linux GIS effort but those tend to be older versions. To get more recent libs, compiling from source is an effective approach.

First install some compile-time deps

sudo yum install python27-devel python27-pip gcc libjpeg-devel zlib-devel gcc-c++

Then build and install proj4 to a local prefix

wget https://github.com/OSGeo/proj.4/archive/4.9.2.tar.gz
tar -zvxf 4.9.2.tar.gz
cd proj.4-4.9.2/
./configure --prefix=/home/ec2-user/lambda/local
make
make install

And build GDAL, statically linking proj4

wget http://download.osgeo.org/gdal/1.11.3/gdal-1.11.3.tar.gz
tar -xzvf gdal-1.11.3.tar.gz
cd gdal-1.11.3
./configure --prefix=/home/ec2-user/lambda/local \
            --with-geos=/home/ec2-user/lambda/local/bin/geos-config \
            --with-static-proj4=/home/ec2-user/lambda/local
make
make install

This should leave us with a nice shared library at /home/ec2_user/lambda/local/lib/libgdal.so.1 that can be safely moved to another AWS Linux box.

Create a virtualenv

Pretty straighforward but keep in mind that some of the dependecies here are compiled extensions so these builds are platform-specific - which is why we need to build it on the target Amazon Linux OS.

virtualenv env
source env/bin/activate
export GDAL_CONFIG=/home/ec2-user/lambda/local/bin/gdal-config
pip install rasterio

Python handler function

The handler's job is to respond to the event (e.g. a new file created in an S3 bucket), perform any amazon-specific tasks (like fetching data from s3) and invoke the worker. Importantly, in the context of this article, the handler must set the LD_LIBRARY_PATH to point to any shared libraries that the worker may need.

import os
import subprocess
import uuid
import boto3

libdir = os.path.join(os.getcwd(), 'local', 'lib')
s3_client = boto3.client('s3')

def handler(event, context):
    results = []
    for record in event['Records']:

        # Find input/output buckets and key names
        bucket = record['s3']['bucket']['name']
        output_bucket = "{}.geojson".format(bucket)
        key = record['s3']['object']['key']
        output_key = "{}.geojson".format(key)

        # Download the raster locally
        download_path = '/tmp/{}{}'.format(uuid.uuid4(), key)
        s3_client.download_file(bucket, key, download_path)

        # Call the worker, setting the environment variables
        command = 'LD_LIBRARY_PATH={} python worker.py "{}"'.format(libdir, download_path)
        output_path = subprocess.check_output(command, shell=True)

        # Upload the output of the worker to S3
        s3_client.upload_file(output_path.strip(), output_bucket, output_key)
        results.append(output_path.strip())

    return results

It's important that the handler function does not import any modules which require dynamic linking. For example, you cannot import rasterio in the main python handler since the dynamic linker doesn't yet know where to look for the GDAL shared library. Your can control the linker paths using the LD_LIBRARY_PATH environment variable but only before the process is started. Lambda doesn't give you any control over the environment variables of the handler function itself. I tried hacks like creating new processes within the handler using os.execv or multiprocessing pools but the user running the lambda function doesn't have the necessary permissions to that (both give you OSErrors - [Errno 13] Permission Denied and [Errno 38] Function not implemented respectively).

Fortunately, Lambda lets you call out to the shell so we can just do our real work through a worker script exposed as a command line interface (details in the next section). While at first this feels clunky, it has the side benefit of forcing separation of your AWS code from your business logic which can be written and tested separately.

Worker

The worker script can be written in any language, compiled or interpreted, so long as it follows the basic rules of command line interfaces. We're using Python in the handler to set up the appropriate environment. For this example, the worker will also be written in Python because of it's awesome support for geospatial data processing. But it could be written in Bash or C or just about anything so long as it's runtime environment can be configured with environment variables and arguments.

In this case, the handler is calling worker.py which looks like:

import rasterio
from tempfile import NamedTemporaryFile
import json
import sys
from rasterio import features

def raster_shape(raster_path):
    with rasterio.open(raster_path) as src:

        # read the first band and create a binary mask
        arr = src.read(1)
        ndv = src.nodata
        binarray = (arr == ndv).astype('uint8')

        # extract shapes from raster
        shapes = features.shapes(binarray, transform=src.transform)

        # create geojson feature collection
        fc = {
            'type': 'FeatureCollection',
            'features': []}
        for geom, val in shapes:
            if val == 0:  # not nodata, i.e. valid data
                feature = {
                    'type': 'Feature',
                    'properties': {'name': raster_path},
                    'geometry': geom}
                fc['features'].append(feature)

        # Write to file
        with NamedTemporaryFile(suffix=".geojson", delete=False) as temp:
            temp.file.write(json.dumps(fc))

        return temp.name

if __name__ == "__main__":
    in_path = sys.argv[1]
    out_path = raster_shape(in_path)
    print(out_path)

Notice how the worker itself has no knowledge of AWS events or S3 - it works entirely on the local filesystem and thus can be used in other contexts and tested much more easily.

Bundle

In order to deploy to Lambda, you need to package it up in a zip file in a slightly unusual manner. All of your Python packages and your handler script should be at the root while the shared libraries can be put in a directory (local/lib in this case)

cd ~/lambda

zip -9 bundle.zip handler.py
zip -r9 bundle.zip worker.py
zip -r9 bundle.zip local/lib/libgdal.so.1

cd $VIRTUAL_ENV/lib/python2.7/site-packages
zip -r9 ~/lambda/bundle.zip *
cd $VIRTUAL_ENV/lib64/python2.7/site-packages
zip -r9 ~/lambda/bundle.zip *

Publish

The details of setting up a Lambda function are far too verbose for this article - I would suggest running through the AWS S3 walkthrough to get the basic S3 example working first. Then use the AWS CLI to update your existing Lambda function:

aws lambda update-function-code \
--function-name testfunc1 \
--zip-file fileb://bundle.zip

The end result

Uploading a raster dataset to your S3 bucket should now trigger the Lambda function which will create a new GeoJSON in the output bucket. All automatically invoked based on the S3 events and completely scalable without having to worry about managing or provisioning servers. Nifty!

The worker and handler code above are intentionally kept short to be more readable. In real usage they would need significantly more error handling and conditionals to handle edge cases, malformed inputs, etc.

It occured to me after writing this that there really is nothing Python-specific about this approach - the handler could just as easily have been written in Javascript and the worker in some other language. But this should provide a general approach for incorporating native code of any sort in AWS Lambda.

It remains to be seen if this approach is faster or cheaper than a queue-based system with autoscaled EC2 instances. If you're doing a constantly-high workload with lots of data, it's probably safe to say that Lambda is not appropriate. If you're doing sporadic workloads with some discrete processing task based on user-uploaded data, Lambda might be the ticket. The primary advantage is not necessarily speed or cost but reduced infrastructure complexity and hands-off autoscaling.

Python affine transforms

Sun 13 September 2015

Raster data coordinate handling with 6-element geotransforms is a pain. Use the affine Python library instead.

The typical geospatial coordinate reference system is defined on a cartesian plane with the 0,0 origin in the bottom left and X and Y increasing as you go up and to the right. But raster data, coming from its image processing origins, uses a different referencing system to access pixels. We refer to rows and columns with the 0,0 origin in the upper left and rows increase and you move down while the columns increase as you go right. Still a cartesian plane but not the same one. xyrowcol

So how do you transform between the two? Affine transformations provide a simple way to do it through the use of matrix algebra. Geospatial software of all varieties use an affine transform (sometimes refered to as "geotransform") to go from raster rows/columns to the x/y of the coordinate reference system. Converting from x/y back to row/col uses the inverse of the affine transform. Of course the software implementations vary widely.

For the remainder, I'll assume the simple case of a non-rotated "north up" raster as that is by far the most common case.

If you're coming from the matrix algebra perspective, you can ignore the constants in the affine matrix and refer to the the six paramters as a, b, c, d, e, f. This is the ordering and notation used by the affine Python library.

a = width of a pixel
b = row rotation (typically zero)
c = x-coordinate of the upper-left corner of the upper-left pixel
d = column rotation (typically zero)
e = height of a pixel (typically negative)
f = y-coordinate of the of the upper-left corner of the upper-left pixel

Perhaps the most pervasive implementation of affine transform encoding in the GIS world is the ESRI World File. The world file is a simple text file accompanying any raster image which uses six line-separated values in this order:

a = width of a pixel
d = column rotation (typically zero)
b = row rotation (typically zero)
e = height of a pixel (typically negative)
c = x-coordinate of the center of the upper-left pixel
f = y-coordinate of the center of the upper-left pixel

It's important to note that the c and f parameters refer to the center of the cell, not the origin!

GDAL also uses the 6 parameter transform in yet a different order with the "Geotransform" array

c = x-coordinate of the upper-left corner of the upper-left pixel
a = width of a pixel
b = row rotation (typically zero)
f = y-coordinate of the of the upper-left corner of the upper-left pixel
d = column rotation (typically zero)
e = height of a pixel (typically negative)

None of those orderings are particularly intutive but at least the first, as implemented by affine, is "correct" from the matrix algebra perspective.

For python programmers looking to work with raster data, the osgeo.gdal library has existed for quite a while. With it the notion of a 6-tuple geotransform in GDAL ordering has become pervasive. And if ordering were the only issue, it wouldn't necessarily be worth switching to the use of the affine library. The more convincing argument for the use of affine is the ease with which you can transform coordinates. In other words, why should you have to worry about ordering of parameters at all?

When dealing with the geotransform as a simple 6-element tuple, you'll probably end up writing code like this to do the actual conversion:

# Using osgeo.gdal and GDAL geotransform 6-tuples
gt = ds.GetGeoTransform()

# col, row to x, y
x = (col * gt[1]) + gt[0]
y = (row * gt[5]) + gt[3]

# x,y to col,row
col = int((x - gt[0]) / gt[1]) 
row = int((y - gt[3]) / gt[5])

I'd be willing to guess that variations of that formula exist in hundreds of python codebases. Not very complicated math but opaque enough not to commit to memory. It's also very easy to slip up ("Is the y origin element 4 or 5?") and introduce non-obvious bugs. Why should such a basic formulation be reimplemented by every programmer? Again, why rely on element ordering at all? affine, through the use of clever operation overloading, gives you a much simpler interface:

# Using rasterio and affine
a = ds.affine

# col, row to x, y
x, y = a * (col, row)

# x, y to col, row
col, row = ~a * (x, y)

Clean, nice looking code that's harder to get wrong, wouldn't you agree? And as @Asgerpetersen pointed out, if there were a non-zero rotation parameter, the affine example would handle it seamlessly while the geotransform formula would fail.

Also, interoperability with GDAL-style geotransforms is painless

# construct from our GDAL geotransform
a = Affine.from_gdal(*gt)
gt = a.to_gdal()

As is the ability to read/write from World Files

from affine import loadsw, dumpsw

# Read from World File
with open('raster.tfw') as tfw:
    a = loadsw(tfw.read())

# Write to World File
with open('other.wld', 'w') as dest:
    dest.write(dumpsw(a))

With rasterio planning to deprecate the use of GDAL-style geotransforms in the 1.0 release, it's never too early to start making the switch. Your cleaner raster coordinate code will be well worth the effort.

Raspberry Pi: real-time sensor plots with websocketd

Mon 02 March 2015

This year I'm starting to delve into some electronics projects and hardware hacking. What follows is an account of my first end-to-end Raspberry Pi project. In terms of functionality, it doesn't do much at the moment - just reads from a photoresistor sensor and plots the light levels in the corner of my office. Eventually, I want to hook up a couple of light, moisture and temperature sensors throughout my garden to do some science experiments and/or remind myself to water the tomatoes. This is but the first step in that larger project...

The Pi is wired up to a 3.3v circuit with a photoresistor.
The state of the digital input pins are read by a python program.
The readings are streamed to a websocket via log file.
The HTML/Javascript interface connects with the websocket and plots the values in real time.

Although it's all just for fun at this point, I've discovered a lot of great unix networking tools and javascript libraries that will come in handy in my day job as well. Here's the details on how it all came together...

The circuit

I implemented the design from the adafruit tutorial on the subject. The adafruit image shows the basic idea:

The photoresistor provides increased resistance to electric current as the visible light becomes dimmer. Conversely, resistance decreases as light becomes brighter. It is an analog sensor but the Raspberry Pi only has digital inputs (the general purpose input output or GPIO pins). To solve that, we can employ a capacitor using "RC timing".

A capacitor builds up voltage over time and, when this voltage hits ~1.4V, the digital input pin reads "high". So instead of taking a direct analog reading, we set a loop and time how long it takes for the capacitor to "fill up".

If the time interval is small (i.e. the capacitor is charging rapidly), there is less resistance from our analog sensor which means more light. If the time interval is large (i.e. the capacitor is taking a long time to charge on each cycle), there is more resistance and less light.

Wired up to the photoresistor on my 25 year-old Radio Shack Electronics Learning Lab, it looks a bit clunkier but still does the trick:

withpi

Quick side note: The ribbon and connectors between the raspberry pi and the breadboard are called a Pi Cobber. It makes working with the GPIO pins easier but, as you can tell from the photos, the incoming cable obstructs access a bit. I might take a look at the T-Cobbler which promises to clear up some vertical space on the breadboard.

Reading digital input pins from an analog sensor

In order to read input from our analog pins, we can use the RPi.GPIO python library.

There's not much more that I can add to the adafruit tutorial which covers the topic well. I made a few modifications:

output a unix timestamp along with the reading
flush the output to stdout after every reading to make sure the output isn't buffered.

if __name__ == "__main__":

    while True:
         # Get sensor timing and unix timestamp
         reading = RCtime(18)
         n = datetime.datetime.now()
         timestamp = to_unix_timestamp(n)

         print "{},{}".format(timestamp, reading)
         sys.stdout.flush()

You can read the complete read_sensor.py script on github.

With the script in place and the circuit wired up, I can fire up the script

sudo python read_sensor.py

and see the timestamp and sensor reading written to the console as comma-separated values:

1425505117.05,793
1425505117.16,802
1425505117.38,768
1425505117.82,709
1425505117.93,801
1425505118.05,798

So what do those values mean? They represent a count of the number of cycles it took to charge the capacitor. Not a meaningful number by itself but it could be calibrated to use standard units or simply used as relative values (lower value == brighter light)

It's important to note that, on a Linux machine, you can't be guaranteed that your event loop won't get interrupted by other processes. So you probably shouldn't use Linux as a real-time sensor platform directly. However, it works well enough for demonstration provided your Raspberry Pi isn't bogged down by other CPU-intensive processes.

Another caveat with this approach - we can only use a single process to access the GPIO pins in this manner. Having multiple processes or threads setting/reading GPIO pin states would cause inaccuracies as each process could reset the pins mid-cycle and interrupt the timing of other processes.

Streaming websockets

Websockets are an extension to HTTP that allow data to be sent from a server to a client using a persistent connection. Think pushing notification messages.

websocketd allows you to take the standard output from any unix program and publish it on a websocket. It can also work with standard input, opening up the doors for some amazing software workflows: imagine taking any well behaved Unix command and immediately wrapping it's functionality in a web protocol!

To output the sensor readings using a websocket, I'll first run the read_sensor.py script in the background with high priority (nice -20) and redirect the output to a logfile:

sudo nice -20 python read_sensor.py > log.txt &

Then I will run websocketd on port 8080, serve a few static files and provide a command to run. In this case, the command is the basic unix tail -f which streams the contents of the log file.

websocketd --port=8080 --staticdir=./static tail -f log.txt

Now the sensor readings are being logged and a websocket server is running. For each client that connects to the websocket, a new process (tail -f log.txt) will be started and stdout will be streamed to that client via websocket messages.

Note that the tail -f command is not yet running until a websocket client makes a connection. Because it runs in its own process and simply reads the sensor log file, we can start as many of them as our hardware can handle.

In summary, the pattern is: run a single process that reads from the GPIO pins and writes to a sensor log, then fire off multiple processes that read the log and stream the output over websockets.

Now we're ready to test it.

HTML/Javascript interface

Working with websockets in Javascript is fairly straightforward. First, create a connection

var ws = new WebSocket('ws://example.org:8080/');

then set some callbacks to handle incoming messages from the server.

ws.onmessage = function(e){
   console.log("We got something:", e.data);
}

Websockets are built into almost every modern browser so this functionality works out of the box. But if the connection is lost for any reason, the native Websocket implementations do not automatically reconnect. To solve that problem, there is ReconnectingWebSocket which does exactly what it sounds like; attempts to reconnect automatically when needed.

Then to create an animated real time plot of the streaming data, you'll need a javascript library like Smoothie Charts.

I should also note that the server (websocketd), the javascript plotting library (Smoothie Charts), and the javascript networking library (ReconnectingWebSocket) were all written by joewalnes - this guy is responsible for making the three biggest pieces of this system and deserves mad props!

All of the HTML and js can be found here: index.html.

Finally, here is the result. A streaming, real time plot of sensor readings. This clip was recorded as I came into my office, opened a few windows and turned on a light. As the room gets brighter, you can see the sensor readings drop, and then rise again as I pass my hand over sensor a few times to block the light.

Maybe not incredibly useful in it's current state but it provided an excellent learning experience to work on the entire stack, integrating electronics and hardware with web software. It opens the doors for all sorts of new projects. All of the code is available on my github repo. Any questions? Shoot me an email or message on twitter. I'm a beginner when it comes to electrical theory so somebody please correct me if I'm way off the mark on something.

Page 2 / 23