18. Geospatial Raster Data in Python#
Attribution: Parts of this tutorial are developed based on the content from the following great sources: Parts of this tutorial are developed based on the content from the following great sources: Introduction to Python for Geographic Data AnalysisIntroduction to Python for Geographic Data Analysis; Introduction to Raster Data in PythonIntroduction to Raster Data in Python; and Introduction to Geospatial Raster and Vector Data with PythonIntroduction to Geospatial Raster and Vector Data with Python
The purpose of this lecture is to learn rasterio
and rioxarray
for loading and working with raster data.
18.1. Find a Raster Scene#
First, Let’s find a Sentinel-2 scene to work with. We will use the PySTAC
client to retrieve the latest cloud free scene across Clark University campus:
from pystac_client import Client
api_url = "https://earth-search.aws.element84.com/v1"
client = Client.open(api_url)
collections = client.get_collections()
point = dict(type="Point", coordinates=(-71.822833, 42.250809))
search_results = client.search(
collections=["sentinel-2-l2a"],
intersects=point,
query=["eo:cloud_cover<5", "s2:nodata_pixel_percentage>20"],
sortby=["-properties.datetime"],
max_items=5,
)
items = search_results.item_collection()
(Note that we also added s2:nodata_pixel_percentage
to our query parameters to find scenes that have some missing pixels. We want to use these to demonstrate how you would handle missing data later on)
18.2. Working with Rasterio#
Rasterio is a highly useful module for raster processing which you can use for reading and writing several different raster formats in Python. Rasterio
is based on GDAL and Python automatically registers all known GDAL drivers for reading supported formats when importing the module.
18.2.1. Read the Metadata#
import rasterio
import os
import numpy as np
Let’s open the nir
band of the first item:
(Reminder: you can inspect the metadata of each item, and see what bands are available in the catalog using the assets.keys()
property. e.g. items[0].assets.keys())
worcester_nir = rasterio.open(items[0].assets["nir"].href)
And let’s see what is the type of worcester_nir
:
type(worcester_nir)
rasterio.io.DatasetReader
You can see that our raster variable is a rasterio.io.DatasetReader
type which means that we have opened the file for reading. This does not contain the value of the bands yet.
You can retrieve the metadata of this raster as following:
print("CRS of the scene is:", worcester_nir.crs)
print("Transform of the scene is:", worcester_nir.transform)
print("Width of the scene is:", worcester_nir.width)
print("Height of the scene is:", worcester_nir.height)
print("Number of bands in the scene is:", worcester_nir.count)
print("Bounds of the scene are:", worcester_nir.bounds)
print("The no data values for all channels are:", worcester_nir.nodatavals)
CRS of the scene is: EPSG:32618
Transform of the scene is: | 10.00, 0.00, 699960.00|
| 0.00,-10.00, 4700040.00|
| 0.00, 0.00, 1.00|
Width of the scene is: 10980
Height of the scene is: 10980
Number of bands in the scene is: 1
Bounds of the scene are: BoundingBox(left=699960.0, bottom=4590240.0, right=809760.0, top=4700040.0)
The no data values for all channels are: (0.0,)
If you need a reminder about what is a transform
, check this reference.
Alternatively, you can retrieve all metadata of the scene using the following:
print("All metada of the scene:")
print(worcester_nir.meta)
All metada of the scene:
{'driver': 'GTiff', 'dtype': 'uint16', 'nodata': 0.0, 'width': 10980, 'height': 10980, 'count': 1, 'crs': CRS.from_epsg(32618), 'transform': Affine(10.0, 0.0, 699960.0,
0.0, -10.0, 4700040.0)}
18.2.2. Get Band Data#
You can read the data from one or multiple bands using the read
method:
nir = worcester_nir.read(1)
Note: rasterio band numbering starts from 1
unlike Python that starts from 0
.
Let’s examine the data type this band:
nir
array([[ 0, 0, 0, ..., 2784, 2754, 2574],
[ 0, 0, 0, ..., 2958, 2906, 2512],
[ 0, 0, 0, ..., 3000, 2814, 2354],
...,
[ 0, 0, 0, ..., 767, 758, 754],
[ 0, 0, 0, ..., 772, 738, 778],
[ 0, 0, 0, ..., 784, 794, 771]], dtype=uint16)
print(nir.dtype)
uint16
This data type in 16-bit unsigned integer. Unsigned integer is always equal or greater than zero and signed integer can store also negative values. For example, an unsigned 16-bit integer can store 2^16 (=65,536) values ranging from 0 to 65,535.
Let’s now visualize this band. We can use the plot
function from rasterio
:
18.2.3. Visualize Band Data#
from rasterio import plot
plot.show(nir)
<Axes: >
As you can see in this case the band data is treated as a regular (non-geospatial) array and the axes start from 0 instead of the actual coordinates of the raster. This happens as the nir
variable is a numpy array and has no geospatial attirbutes. You can use the rasterio dataset worcester_nir
and get the plot with the actual bounds:
plot.show(worcester_nir)
<Axes: >
18.3. Rioxarray#
rioxarray
based on the popular rasterio
package for working with rasters and xarray
for working with multi-dimensional arrays. rioxarray
extends xarray
by providing top-level functions (e.g. the open_rasterio
function to open raster datasets) and by adding a set of methods to the main objects of the xarray package (the Dataset
and the DataArray
). These additional methods are made available via the rio
accessor and become available from xarray
objects after importing rioxarray
.
We can implement the steps we carried out with rasterio
in the previous section using rioxarray
as following:
import rioxarray
worcester_nir_xr = rioxarray.open_rasterio(items[0].assets["nir"].href)
Let’s look at the worcester_nir_xr
:
worcester_nir_xr
<xarray.DataArray (band: 1, y: 10980, x: 10980)> [120560400 values with dtype=uint16] Coordinates: * band (band) int64 1 * x (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.098e+05 * y (y) float64 4.7e+06 4.7e+06 4.7e+06 ... 4.59e+06 4.59e+06 spatial_ref int64 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE _FillValue: 0 scale_factor: 1.0 add_offset: 0.0
The output tells us that we are looking at an xarray.DataArray
, with 1 band, 10980 rows, and 10980 columns. We can also see the number of pixel values in the DataArray
, and the type of those pixel values, which is unsigned integer (or uint16
). The DataArray
also stores different values for the coordinates of the DataArray
. When using rioxarray
, the term coordinates refers to spatial coordinates like x
and y
but also the band
coordinate. Each of these sequences of values has its own data type, like int64
for the spatial coordinates and float64
for the x and y coordinates.
Similar to rasterio
, we can inspect the attributes of the raster. They can be accessed from the rio
accessor like .rio.crs
, .rio.nodata
, and .rio.bounds()
, which contain the metadata for the file we opened.
print("CRS of the scene is:", worcester_nir_xr.rio.crs)
print("Transform of the scene is:", worcester_nir_xr.rio.transform())
print("Width of the scene is:", worcester_nir_xr.rio.width)
print("Height of the scene is:", worcester_nir_xr.rio.height)
print("Number of bands in the scene is:", worcester_nir_xr.rio.count)
print("Bounds of the scene are:", worcester_nir_xr.rio.bounds())
print("The no data values for all channels are:", worcester_nir_xr.rio.nodata)
CRS of the scene is: EPSG:32618
Transform of the scene is: | 10.00, 0.00, 699960.00|
| 0.00,-10.00, 4700040.00|
| 0.00, 0.00, 1.00|
Width of the scene is: 10980
Height of the scene is: 10980
Number of bands in the scene is: 1
Bounds of the scene are: (699960.0, 4590240.0, 809760.0, 4700040.0)
The no data values for all channels are: 0
You can get the band data using .data
:
worcester_nir_xr.data
array([[[ 0, 0, 0, ..., 2784, 2754, 2574],
[ 0, 0, 0, ..., 2958, 2906, 2512],
[ 0, 0, 0, ..., 3000, 2814, 2354],
...,
[ 0, 0, 0, ..., 767, 758, 754],
[ 0, 0, 0, ..., 772, 738, 778],
[ 0, 0, 0, ..., 784, 794, 771]]], dtype=uint16)
You can also use .values
to get the band data but .values
will try to conver the data to numpy.array
if the data is not in that type (e.g. if it’s a dask.array
). So you should be careful with calling .values
.
You can also visualize the raster:
worcester_nir_xr.plot()
<matplotlib.collections.QuadMesh at 0x7fa7c2604410>
You can also use built-in xarray
functions to calculate statistics of the data:
print("Min of the data is:", worcester_nir_xr.min())
print("Max of the data is:", worcester_nir_xr.max())
print("Mean of the data is:", worcester_nir_xr.mean())
print("Std of the data is:", worcester_nir_xr.std())
print("25th, and 75th quantiles of the data are:", worcester_nir_xr.quantile([0.25, 0.75]))
Min of the data is: <xarray.DataArray ()>
array(0, dtype=uint16)
Coordinates:
spatial_ref int64 0
Max of the data is: <xarray.DataArray ()>
array(17704, dtype=uint16)
Coordinates:
spatial_ref int64 0
Mean of the data is: <xarray.DataArray ()>
array(1485.22314339)
Coordinates:
spatial_ref int64 0
Std of the data is: <xarray.DataArray ()>
array(1531.22790953)
Coordinates:
spatial_ref int64 0
25th, and 75th quantiles of the data are: <xarray.DataArray (quantile: 2)>
array([ 0., 3086.])
Coordinates:
* quantile (quantile) float64 0.25 0.75
But there is an issue with these statistics. As you can see in the plot, this scene has a lot of missing pixels that are set to 0
(the nodata value). These are included in the calculation of the stats.
To avoid this, you need to open the file appropriately by required rasterio
to mask those pixels:
worcester_nir_xr_masked = rioxarray.open_rasterio(items[0].assets["nir"].href, masked=True)
Now let’s recalculate the stats:
print("Min of the data is:", worcester_nir_xr_masked.min())
print("Max of the data is:", worcester_nir_xr_masked.max())
print("Mean of the data is:", worcester_nir_xr_masked.mean())
print("Std of the data is:", worcester_nir_xr_masked.std())
print("25th, and 75th quantiles of the data are:", worcester_nir_xr_masked.quantile([0.25, 0.75]))
Min of the data is: <xarray.DataArray ()>
array(1.)
Coordinates:
spatial_ref int64 0
Max of the data is: <xarray.DataArray ()>
array(17704.)
Coordinates:
spatial_ref int64 0
Mean of the data is: <xarray.DataArray ()>
array(2913.7654, dtype=float32)
Coordinates:
spatial_ref int64 0
Std of the data is: <xarray.DataArray ()>
array(1522.73022461)
Coordinates:
spatial_ref int64 0
25th, and 75th quantiles of the data are: <xarray.DataArray (quantile: 2)>
array([2728., 3312.])
Coordinates:
* quantile (quantile) float64 0.25 0.75
These numbers are notably different from the old ones (check the 25th percentile as an example!)
Lastly, you can use the spatial_ref
to retrive the spatial attirbutes of the data as following:
worcester_nir_xr_masked.spatial_ref.attrs
{'crs_wkt': 'PROJCS["WGS 84 / UTM zone 18N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-75],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32618"]]',
'semi_major_axis': 6378137.0,
'semi_minor_axis': 6356752.314245179,
'inverse_flattening': 298.257223563,
'reference_ellipsoid_name': 'WGS 84',
'longitude_of_prime_meridian': 0.0,
'prime_meridian_name': 'Greenwich',
'geographic_crs_name': 'WGS 84',
'horizontal_datum_name': 'World Geodetic System 1984',
'projected_crs_name': 'WGS 84 / UTM zone 18N',
'grid_mapping_name': 'transverse_mercator',
'latitude_of_projection_origin': 0.0,
'longitude_of_central_meridian': -75.0,
'false_easting': 500000.0,
'false_northing': 0.0,
'scale_factor_at_central_meridian': 0.9996,
'spatial_ref': 'PROJCS["WGS 84 / UTM zone 18N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-75],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32618"]]',
'GeoTransform': '699960.0 10.0 0.0 4700040.0 0.0 -10.0'}
Let’s also visualize this raster:
worcester_nir_xr_masked.plot()
<matplotlib.collections.QuadMesh at 0x7fa7c252c0d0>
18.4. Working with Multi-Band Data#
So far we looked into a single band raster, i.e. the nir
band of a Sentinel-2 scene. However, to get a smaller, version of the scene, one may also want to visualize the true-color overview of the region. This is provided as a multi-band raster – a raster dataset that contains more than one band.
worcester_overview = rioxarray.open_rasterio(items[0].assets['visual'].href, overview_level=1)
worcester_overview
<xarray.DataArray (band: 3, y: 2745, x: 2745)> [22605075 values with dtype=uint8] Coordinates: * band (band) int64 1 2 3 * x (x) float64 7e+05 7e+05 7.001e+05 ... 8.097e+05 8.097e+05 * y (y) float64 4.7e+06 4.7e+06 4.7e+06 ... 4.59e+06 4.59e+06 spatial_ref int64 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE _FillValue: 0 scale_factor: 1.0 add_offset: 0.0
The band number comes first when GeoTiffs are read with the .open_rasterio()
function. As we can see in the xarray.DataArray
object, the shape is now (band: 3, y: 2745, x: 2745)
, with three bands in the band
dimension. It’s always a good idea to examine the shape of the raster array you are working with and make sure it’s what you expect. Many functions, especially the ones that plot images, expect a raster array to have a particular shape.
One can visualize the multi-band data with the DataArray.plot.imshow()
function:
worcester_overview.plot.imshow(size=5, aspect=1)
<matplotlib.image.AxesImage at 0x7fa7c268ca10>
18.4.1. Clipping Raster#
You can use the .rio.clip()
function to clip your raster.
Let’s define a geometry for clipping:
from shapely.geometry import box
minx = 760000
miny = 4620000
maxx = 790000
maxy = 4660000
clipping_geom = box(minx, miny, maxx, maxy)
worcester_overview.rio.clip([clipping_geom])
<xarray.DataArray (band: 3, y: 1000, x: 750)> array([[[ 33, 33, 33, ..., 171, 169, 173], [ 35, 34, 34, ..., 177, 175, 175], [ 36, 36, 36, ..., 177, 175, 174], ..., [166, 167, 167, ..., 153, 156, 159], [166, 167, 166, ..., 153, 157, 161], [166, 165, 164, ..., 154, 153, 162]], [[ 40, 40, 40, ..., 187, 183, 187], [ 43, 42, 43, ..., 192, 191, 191], [ 46, 45, 43, ..., 192, 191, 190], ..., [186, 188, 187, ..., 168, 171, 174], [186, 185, 186, ..., 167, 171, 175], [185, 185, 185, ..., 167, 167, 173]], [[ 43, 43, 42, ..., 178, 177, 181], [ 45, 44, 43, ..., 184, 183, 185], [ 46, 47, 45, ..., 187, 183, 183], ..., [183, 182, 182, ..., 160, 163, 168], [181, 181, 182, ..., 161, 164, 168], [182, 182, 181, ..., 161, 160, 166]]], dtype=uint8) Coordinates: * band (band) int64 1 2 3 * x (x) float64 7.6e+05 7.601e+05 7.601e+05 ... 7.899e+05 7.9e+05 * y (y) float64 4.66e+06 4.66e+06 4.66e+06 ... 4.62e+06 4.62e+06 spatial_ref int64 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE scale_factor: 1.0 add_offset: 0.0 _FillValue: 0
Another way to clip your raster, is using slice_xy
to slice the array by x,y bounds. In this case, and unlike clip
you do not need a geojson like geometry dict. Here is how it works:
worcester_overview.rio.slice_xy(minx, miny, maxx, maxy)
<xarray.DataArray (band: 3, y: 1000, x: 750)> [2250000 values with dtype=uint8] Coordinates: * band (band) int64 1 2 3 * x (x) float64 7.6e+05 7.601e+05 7.601e+05 ... 7.899e+05 7.9e+05 * y (y) float64 4.66e+06 4.66e+06 4.66e+06 ... 4.62e+06 4.62e+06 spatial_ref int64 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE _FillValue: 0 scale_factor: 1.0 add_offset: 0.0
While clip
and slice_xy
seem to work the same way, clip
has more functionality and several arguments that can costumize the task. Check the description here.
You can also use the built-in xarray.sel()
method to select values for specific values across each coordinate. For example, we can select the value as following:
worcester_overview.sel(x=[761000], y=[4630000], method="nearest")
<xarray.DataArray (band: 3, y: 1, x: 1)> [3 values with dtype=uint8] Coordinates: * band (band) int64 1 2 3 * x (x) float64 7.61e+05 * y (y) float64 4.63e+06 spatial_ref int64 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE _FillValue: 0 scale_factor: 1.0 add_offset: 0.0
The argument method
is needed when the provided x
and y
are not the actual value present in the requested coordinate. Xarray
will use the method
to interpolate the values and find the corresponding value for the requested coordinate.
18.5. Handling Large Rasters#
When using xarray.open_rasterio
(and in general any open_dataset function in xarray
the data is loaded from the underlying datastore in memory as NumPy arrays by default.
But you may want to avoid this if you are working with a very large dataset. To do that, you can either pass the cache = False
argument to xarray.open_rasterio
or specify the chunks argument to use dask.
worcester_nir_xr_not_cached = rioxarray.open_rasterio(items[0].assets["nir"].href, cache = False)
From here, you can work with the DataArray
as before and xarray
will only load the parts of DataArray
that you request. For example, if you use sel
to retrieve a value, it will only read that part of the DataArray
:
worcester_nir_xr_not_cached.sel(x=[761000], y=[4630000], method="nearest")
<xarray.DataArray (band: 1, y: 1, x: 1)> [1 values with dtype=uint16] Coordinates: * band (band) int64 1 * x (x) float64 7.61e+05 * y (y) float64 4.63e+06 spatial_ref int64 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE _FillValue: 0 scale_factor: 1.0 add_offset: 0.0
18.6. Writing Data#
You can use the .rio.to_raster
function to write a rioxarray.DataArray
to disk. Let’s write the clipped version of our multi-band imagery to disk:
worcester_clipped = worcester_overview.rio.clip([clipping_geom])
worcester_clipped.rio.to_raster("worcester_clipped.tif")
Task: Load the clipped raster and the original overview raster into QGIS and inspect their differences.
The difference is in their spatial resolution. Our clipped raster is for overview_level=1
which has a 40m spatial resolution. But the original oveerview has a native spatial resolution of 10m.
Question: What does x
and y
represent? Are they the center coordinates of each pixel, or the edge?
They represent the center coordinates of each pixel. You can verify this by comparing the bounds of the raster to the x and y coordinates of the top left pixel.