Mon 23 February 2015
Introduction
Zonal statistics allow you to summarize raster datasets based on vector geometries
by aggregating all pixels associated with each vector feature, typically to a single scalar value. For example, you might want the mean elevation of each country against an SRTM Digital Elevation
Model (DEM). This is easily accomplished in python using rasterstats
:
from rasterstats import zonal_stats
stats = zonal_stats('countries.shp', 'elevation.tif', stats="mean", copy_properties=True)
for s in stats:
print s['name'], s['mean']
Which would give us output similar to below, with the mean elevation (meters) for each country:
Afghanistan 1826.38
Netherlands 8.78
Nepal 2142.28
Zimbabwe 980.85
Zonal Histograms
Using the built-in aggregate functions in rasterstats
can reveal a lot about
about the underlying raster dataset (see statistics for full list). Most of the time the standard descriptive statistics
like min, max, mean, median, etc. can tell us everything we need to know.
But what if we want to retain more information about the underlying distribution of
values? Instead of simply stating
Afghanistan is, on average, 1826.38 meters above sea level
supposed we wanted to see how much of the country is in high vs low elevation areas.
We could bin the elevations into meaningful ranges (say 0-200 meters, 200 to 400 meters, etc) and create a histogram of pixel counts to show the shape of the underlying distribution. In this case, the aggregate function does not return a scalar value but a dictionary with
each bin as a key.
>> stats['elevation_histogram']
{'0 to 500m': ...,
'500 to 1000m': ...,
'1000 to 3000m':...,
'3000 to 5000m':...,
'5000m+'
}
That's the goal, now how do we accomplish this?
User-defined aggregate functions
Because a histogram might need to specify a number of arguments to customize the results, it's not
feasible for rasterstats
to define a generic histogram function. However, as of version 0.6, we
have the ability to create custom, user-defined aggregate functions such as the zonal histogram
idea described above.
First, we have to write our function. The first and only argument is a masked numpy array
and will typically be handled by numpy
functions. The function's return value will be added to the stats output for each feature. The returned
value does not need to be a scalar, it can be any valid python value (though it's probably
best to stick with dicts, lists and other simple data structures that are easily
serializable).
import numpy as np
import itertools
def elevation_histogram(x):
bin_edges = [0, 400, 1000, 3000, 5000, 10000]
hist, _ = np.histogram(x, bins=bin_edges)
data = {}
for upper, lower, value in itertools.izip(bin_edges, bin_edges[1:], hist):
key = "{} to {}m".format(upper, lower)
data[key] = value
return data
And then add our custom elevation_histogram
function to our zonal_stats
call
using the add_stats
keyword argument:
stats = zonal_stats('countries.shp', 'elevation.tif', copy_properties=True,
add_stats={'elevation_histogram': elevation_histogram})
for s in stats:
print s['name'], s['mean'], s['elevation_histogram']
which gives us output similar to the following which gives you raw pixel counts
for each of the elevation bins (formatted for readability)
Afghanistan 1826.38 {
'3000 to 5000m': 1099730,
'0 to 400m': 1754317,
'1000 to 3000m': 2884917,
'5000 to 10000m': 83158,
'400 to 1000m': 1907790}
The only caveat with using this technique is that nested dictionaries and other
non-scalar values might cause difficulty when trying to serialize this
data structure to other formats. For example, most GIS formats don't support hierarchical
properties (nested dictionaries) so you might have to flatten the data before
writing to e.g. PostGIS or an ESRI shapefile.
With the ability to write user-defined aggregate functions, I can keep the core
of rasterstats
light while allowing for the possibility of complex aggregate analysis
that might be needed in the future. Good stuff.