Mon 23 February 2015
                        Introduction
Zonal statistics allow you to summarize raster datasets based on vector geometries 
by aggregating all pixels associated with each vector feature, typically to a single scalar value. For example, you might want the mean elevation of each country against an SRTM Digital Elevation
Model (DEM). This is easily accomplished in python using rasterstats:
from rasterstats import zonal_stats
stats = zonal_stats('countries.shp', 'elevation.tif', stats="mean", copy_properties=True)
for s in stats:
    print s['name'], s['mean']
 
Which would give us output similar to below, with the mean elevation (meters) for each country:
Afghanistan 1826.38
Netherlands 8.78
Nepal 2142.28
Zimbabwe 980.85
 
Zonal Histograms
Using the built-in aggregate functions in rasterstats can reveal a lot about 
about the underlying raster dataset (see statistics for full list). Most of the time the standard descriptive statistics
like min, max, mean, median, etc. can tell us everything we need to know.
But what if we want to retain more information about the underlying distribution of
values? Instead of simply stating 
Afghanistan is, on average, 1826.38 meters above sea level
supposed we wanted to see how much of the country is in high vs low elevation areas.
We could bin the elevations into meaningful ranges (say 0-200 meters, 200 to 400 meters, etc) and create a histogram of pixel counts to show the shape of the underlying distribution. In this case, the aggregate function does not return a scalar value but a dictionary with 
each bin as a key.
>> stats['elevation_histogram']
{'0 to 500m': ...,
 '500 to 1000m': ...,
 '1000 to 3000m':...,
 '3000 to 5000m':...,
 '5000m+'
}
 
That's the goal, now how do we accomplish this? 
User-defined aggregate functions
Because a histogram might need to specify a number of arguments to customize the results, it's not
feasible for rasterstats to define a generic histogram function. However, as of version 0.6, we
have the ability to create custom, user-defined aggregate functions such as the zonal histogram 
idea described above.
First, we have to write our function. The first and only argument is a masked numpy array 
and will typically be handled by numpy functions. The function's return value will be added to the stats output for each feature. The returned
value does not need to be a scalar, it can be any valid python value (though it's probably
best to stick with dicts, lists and other simple data structures that are easily 
serializable).
import numpy as np
import itertools
def elevation_histogram(x):
    bin_edges = [0, 400, 1000, 3000, 5000, 10000]
    hist, _ = np.histogram(x, bins=bin_edges)
    data = {}  
    for upper, lower, value in itertools.izip(bin_edges, bin_edges[1:], hist):
        key = "{} to {}m".format(upper, lower)
        data[key] = value
    return data
 
And then add our custom elevation_histogram function to our zonal_stats call
using the add_stats keyword argument:
stats = zonal_stats('countries.shp', 'elevation.tif', copy_properties=True,
                    add_stats={'elevation_histogram': elevation_histogram})
for s in stats:
    print s['name'], s['mean'], s['elevation_histogram']
 
which gives us output similar to the following which gives you raw pixel counts
for each of the elevation bins (formatted for readability)
Afghanistan 1826.38 {
    '3000 to 5000m': 1099730, 
    '0 to 400m': 1754317,
    '1000 to 3000m': 2884917, 
    '5000 to 10000m': 83158, 
    '400 to 1000m': 1907790}
 
The only caveat with using this technique is that nested dictionaries and other 
non-scalar values might cause difficulty when trying to serialize this 
data structure to other formats. For example, most GIS formats don't support hierarchical 
properties (nested dictionaries) so you might have to flatten the data before
writing to e.g. PostGIS or an ESRI shapefile. 
With the ability to write user-defined aggregate functions, I can keep the core
of rasterstats light while allowing for the possibility of complex aggregate analysis
that might be needed in the future. Good stuff.