perrygeo

About | Articles | CV

Zonal statistics: histograms as user-defined aggregate functions

Mon 23 February 2015

Introduction

Zonal statistics allow you to summarize raster datasets based on vector geometries by aggregating all pixels associated with each vector feature, typically to a single scalar value. For example, you might want the mean elevation of each country against an SRTM Digital Elevation Model (DEM). This is easily accomplished in python using rasterstats:

from rasterstats import zonal_stats
stats = zonal_stats('countries.shp', 'elevation.tif', stats="mean", copy_properties=True)
for s in stats:
    print s['name'], s['mean']

Which would give us output similar to below, with the mean elevation (meters) for each country:

Afghanistan 1826.38
Netherlands 8.78
Nepal 2142.28
Zimbabwe 980.85

Zonal Histograms

Using the built-in aggregate functions in rasterstats can reveal a lot about about the underlying raster dataset (see statistics for full list). Most of the time the standard descriptive statistics like min, max, mean, median, etc. can tell us everything we need to know.

But what if we want to retain more information about the underlying distribution of values? Instead of simply stating

Afghanistan is, on average, 1826.38 meters above sea level

supposed we wanted to see how much of the country is in high vs low elevation areas. We could bin the elevations into meaningful ranges (say 0-200 meters, 200 to 400 meters, etc) and create a histogram of pixel counts to show the shape of the underlying distribution. In this case, the aggregate function does not return a scalar value but a dictionary with each bin as a key.

>> stats['elevation_histogram']
{'0 to 500m': ...,
 '500 to 1000m': ...,
 '1000 to 3000m':...,
 '3000 to 5000m':...,
 '5000m+'
}

That's the goal, now how do we accomplish this?

User-defined aggregate functions

Because a histogram might need to specify a number of arguments to customize the results, it's not feasible for rasterstats to define a generic histogram function. However, as of version 0.6, we have the ability to create custom, user-defined aggregate functions such as the zonal histogram idea described above.

First, we have to write our function. The first and only argument is a masked numpy array and will typically be handled by numpy functions. The function's return value will be added to the stats output for each feature. The returned value does not need to be a scalar, it can be any valid python value (though it's probably best to stick with dicts, lists and other simple data structures that are easily serializable).

import numpy as np
import itertools

def elevation_histogram(x):
    bin_edges = [0, 400, 1000, 3000, 5000, 10000]
    hist, _ = np.histogram(x, bins=bin_edges)
    data = {}  
    for upper, lower, value in itertools.izip(bin_edges, bin_edges[1:], hist):
        key = "{} to {}m".format(upper, lower)
        data[key] = value
    return data

And then add our custom elevation_histogram function to our zonal_stats call using the add_stats keyword argument:

stats = zonal_stats('countries.shp', 'elevation.tif', copy_properties=True,
                    add_stats={'elevation_histogram': elevation_histogram})
for s in stats:
    print s['name'], s['mean'], s['elevation_histogram']

which gives us output similar to the following which gives you raw pixel counts for each of the elevation bins (formatted for readability)

Afghanistan 1826.38 {
    '3000 to 5000m': 1099730, 
    '0 to 400m': 1754317,
    '1000 to 3000m': 2884917, 
    '5000 to 10000m': 83158, 
    '400 to 1000m': 1907790}

The only caveat with using this technique is that nested dictionaries and other non-scalar values might cause difficulty when trying to serialize this data structure to other formats. For example, most GIS formats don't support hierarchical properties (nested dictionaries) so you might have to flatten the data before writing to e.g. PostGIS or an ESRI shapefile.

With the ability to write user-defined aggregate functions, I can keep the core of rasterstats light while allowing for the possibility of complex aggregate analysis that might be needed in the future. Good stuff.