18. Geospatial Raster Data in Python#

Attribution: Parts of this tutorial are developed based on the content from the following great sources: Parts of this tutorial are developed based on the content from the following great sources: Introduction to Python for Geographic Data AnalysisIntroduction to Python for Geographic Data Analysis; Introduction to Raster Data in PythonIntroduction to Raster Data in Python; and Introduction to Geospatial Raster and Vector Data with PythonIntroduction to Geospatial Raster and Vector Data with Python

The purpose of this lecture is to learn rasterio and rioxarray for loading and working with raster data.

18.1. Find a Raster Scene#

First, Let’s find a Sentinel-2 scene to work with. We will use the PySTAC client to retrieve the latest cloud free scene across Clark University campus:

from pystac_client import Client

api_url = "https://earth-search.aws.element84.com/v1"
client = Client.open(api_url)
collections = client.get_collections()
point = dict(type="Point", coordinates=(-71.822833, 42.250809))
search_results = client.search(
    collections=["sentinel-2-l2a"],
    intersects=point,
    query=["eo:cloud_cover<5", "s2:nodata_pixel_percentage>20"],
    sortby=["-properties.datetime"],
    max_items=5,
)
items = search_results.item_collection()

(Note that we also added s2:nodata_pixel_percentage to our query parameters to find scenes that have some missing pixels. We want to use these to demonstrate how you would handle missing data later on)

18.2. Working with Rasterio#

Rasterio is a highly useful module for raster processing which you can use for reading and writing several different raster formats in Python. Rasterio is based on GDAL and Python automatically registers all known GDAL drivers for reading supported formats when importing the module.

18.2.1. Read the Metadata#

import rasterio
import os
import numpy as np

Let’s open the nir band of the first item:

(Reminder: you can inspect the metadata of each item, and see what bands are available in the catalog using the assets.keys() property. e.g. items[0].assets.keys())

worcester_nir = rasterio.open(items[0].assets["nir"].href)

And let’s see what is the type of worcester_nir:

type(worcester_nir)
rasterio.io.DatasetReader

You can see that our raster variable is a rasterio.io.DatasetReader type which means that we have opened the file for reading. This does not contain the value of the bands yet.

You can retrieve the metadata of this raster as following:

print("CRS of the scene is:", worcester_nir.crs)
print("Transform of the scene is:", worcester_nir.transform)
print("Width of the scene is:", worcester_nir.width)
print("Height of the scene is:", worcester_nir.height)
print("Number of bands in the scene is:", worcester_nir.count)
print("Bounds of the scene are:", worcester_nir.bounds)
print("The no data values for all channels are:", worcester_nir.nodatavals)
CRS of the scene is: EPSG:32618
Transform of the scene is: | 10.00, 0.00, 699960.00|
| 0.00,-10.00, 4700040.00|
| 0.00, 0.00, 1.00|
Width of the scene is: 10980
Height of the scene is: 10980
Number of bands in the scene is: 1
Bounds of the scene are: BoundingBox(left=699960.0, bottom=4590240.0, right=809760.0, top=4700040.0)
The no data values for all channels are: (0.0,)

If you need a reminder about what is a transform, check this reference.

Alternatively, you can retrieve all metadata of the scene using the following:

print("All metada of the scene:")
print(worcester_nir.meta)
All metada of the scene:
{'driver': 'GTiff', 'dtype': 'uint16', 'nodata': 0.0, 'width': 10980, 'height': 10980, 'count': 1, 'crs': CRS.from_epsg(32618), 'transform': Affine(10.0, 0.0, 699960.0,
       0.0, -10.0, 4700040.0)}

18.2.2. Get Band Data#

You can read the data from one or multiple bands using the read method:

nir = worcester_nir.read(1)

Note: rasterio band numbering starts from 1 unlike Python that starts from 0.

Let’s examine the data type this band:

nir
array([[   0,    0,    0, ..., 2784, 2754, 2574],
       [   0,    0,    0, ..., 2958, 2906, 2512],
       [   0,    0,    0, ..., 3000, 2814, 2354],
       ...,
       [   0,    0,    0, ...,  767,  758,  754],
       [   0,    0,    0, ...,  772,  738,  778],
       [   0,    0,    0, ...,  784,  794,  771]], dtype=uint16)
print(nir.dtype)
uint16

This data type in 16-bit unsigned integer. Unsigned integer is always equal or greater than zero and signed integer can store also negative values. For example, an unsigned 16-bit integer can store 2^16 (=65,536) values ranging from 0 to 65,535.

Let’s now visualize this band. We can use the plot function from rasterio:

18.2.3. Visualize Band Data#

from rasterio import plot
plot.show(nir)
../_images/332031d1484e9fe7f5ca67b921460bf983ac5a85d91b2a47ca4d3348b783077b.png
<Axes: >

As you can see in this case the band data is treated as a regular (non-geospatial) array and the axes start from 0 instead of the actual coordinates of the raster. This happens as the nir variable is a numpy array and has no geospatial attirbutes. You can use the rasterio dataset worcester_nir and get the plot with the actual bounds:

plot.show(worcester_nir)
../_images/c7eafc5e578337e2744843cbf8a702b31c7390441a17e46a35fd6c46ccc19872.png
<Axes: >

18.3. Rioxarray#

rioxarray based on the popular rasterio package for working with rasters and xarray for working with multi-dimensional arrays. rioxarray extends xarray by providing top-level functions (e.g. the open_rasterio function to open raster datasets) and by adding a set of methods to the main objects of the xarray package (the Dataset and the DataArray). These additional methods are made available via the rio accessor and become available from xarray objects after importing rioxarray.

We can implement the steps we carried out with rasterio in the previous section using rioxarray as following:

import rioxarray
worcester_nir_xr = rioxarray.open_rasterio(items[0].assets["nir"].href)

Let’s look at the worcester_nir_xr:

worcester_nir_xr
<xarray.DataArray (band: 1, y: 10980, x: 10980)>
[120560400 values with dtype=uint16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.098e+05
  * y            (y) float64 4.7e+06 4.7e+06 4.7e+06 ... 4.59e+06 4.59e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

The output tells us that we are looking at an xarray.DataArray, with 1 band, 10980 rows, and 10980 columns. We can also see the number of pixel values in the DataArray, and the type of those pixel values, which is unsigned integer (or uint16). The DataArray also stores different values for the coordinates of the DataArray. When using rioxarray, the term coordinates refers to spatial coordinates like x and y but also the band coordinate. Each of these sequences of values has its own data type, like int64 for the spatial coordinates and float64 for the x and y coordinates.

Similar to rasterio, we can inspect the attributes of the raster. They can be accessed from the rio accessor like .rio.crs, .rio.nodata, and .rio.bounds(), which contain the metadata for the file we opened.

print("CRS of the scene is:", worcester_nir_xr.rio.crs)
print("Transform of the scene is:", worcester_nir_xr.rio.transform())
print("Width of the scene is:", worcester_nir_xr.rio.width)
print("Height of the scene is:", worcester_nir_xr.rio.height)
print("Number of bands in the scene is:", worcester_nir_xr.rio.count)
print("Bounds of the scene are:", worcester_nir_xr.rio.bounds())
print("The no data values for all channels are:", worcester_nir_xr.rio.nodata)
CRS of the scene is: EPSG:32618
Transform of the scene is: | 10.00, 0.00, 699960.00|
| 0.00,-10.00, 4700040.00|
| 0.00, 0.00, 1.00|
Width of the scene is: 10980
Height of the scene is: 10980
Number of bands in the scene is: 1
Bounds of the scene are: (699960.0, 4590240.0, 809760.0, 4700040.0)
The no data values for all channels are: 0

You can get the band data using .data:

worcester_nir_xr.data
array([[[   0,    0,    0, ..., 2784, 2754, 2574],
        [   0,    0,    0, ..., 2958, 2906, 2512],
        [   0,    0,    0, ..., 3000, 2814, 2354],
        ...,
        [   0,    0,    0, ...,  767,  758,  754],
        [   0,    0,    0, ...,  772,  738,  778],
        [   0,    0,    0, ...,  784,  794,  771]]], dtype=uint16)

You can also use .values to get the band data but .values will try to conver the data to numpy.array if the data is not in that type (e.g. if it’s a dask.array). So you should be careful with calling .values.

You can also visualize the raster:

worcester_nir_xr.plot()
<matplotlib.collections.QuadMesh at 0x7fa7c2604410>
../_images/b138d563bf168b9ff32d88ad99032c095188186a9a44f9419f29ee288c3fbb96.png

You can also use built-in xarray functions to calculate statistics of the data:

print("Min of the data is:", worcester_nir_xr.min())
print("Max of the data is:", worcester_nir_xr.max())
print("Mean of the data is:", worcester_nir_xr.mean())
print("Std of the data is:", worcester_nir_xr.std())
print("25th, and 75th quantiles of the data are:", worcester_nir_xr.quantile([0.25, 0.75]))
Min of the data is: <xarray.DataArray ()>
array(0, dtype=uint16)
Coordinates:
    spatial_ref  int64 0
Max of the data is: <xarray.DataArray ()>
array(17704, dtype=uint16)
Coordinates:
    spatial_ref  int64 0
Mean of the data is: <xarray.DataArray ()>
array(1485.22314339)
Coordinates:
    spatial_ref  int64 0
Std of the data is: <xarray.DataArray ()>
array(1531.22790953)
Coordinates:
    spatial_ref  int64 0
25th, and 75th quantiles of the data are: <xarray.DataArray (quantile: 2)>
array([   0., 3086.])
Coordinates:
  * quantile  (quantile) float64 0.25 0.75

But there is an issue with these statistics. As you can see in the plot, this scene has a lot of missing pixels that are set to 0 (the nodata value). These are included in the calculation of the stats.

To avoid this, you need to open the file appropriately by required rasterio to mask those pixels:

worcester_nir_xr_masked = rioxarray.open_rasterio(items[0].assets["nir"].href, masked=True)

Now let’s recalculate the stats:

print("Min of the data is:", worcester_nir_xr_masked.min())
print("Max of the data is:", worcester_nir_xr_masked.max())
print("Mean of the data is:", worcester_nir_xr_masked.mean())
print("Std of the data is:", worcester_nir_xr_masked.std())
print("25th, and 75th quantiles of the data are:", worcester_nir_xr_masked.quantile([0.25, 0.75]))
Min of the data is: <xarray.DataArray ()>
array(1.)
Coordinates:
    spatial_ref  int64 0
Max of the data is: <xarray.DataArray ()>
array(17704.)
Coordinates:
    spatial_ref  int64 0
Mean of the data is: <xarray.DataArray ()>
array(2913.7654, dtype=float32)
Coordinates:
    spatial_ref  int64 0
Std of the data is: <xarray.DataArray ()>
array(1522.73022461)
Coordinates:
    spatial_ref  int64 0
25th, and 75th quantiles of the data are: <xarray.DataArray (quantile: 2)>
array([2728., 3312.])
Coordinates:
  * quantile  (quantile) float64 0.25 0.75

These numbers are notably different from the old ones (check the 25th percentile as an example!)

Lastly, you can use the spatial_ref to retrive the spatial attirbutes of the data as following:

worcester_nir_xr_masked.spatial_ref.attrs
{'crs_wkt': 'PROJCS["WGS 84 / UTM zone 18N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-75],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32618"]]',
 'semi_major_axis': 6378137.0,
 'semi_minor_axis': 6356752.314245179,
 'inverse_flattening': 298.257223563,
 'reference_ellipsoid_name': 'WGS 84',
 'longitude_of_prime_meridian': 0.0,
 'prime_meridian_name': 'Greenwich',
 'geographic_crs_name': 'WGS 84',
 'horizontal_datum_name': 'World Geodetic System 1984',
 'projected_crs_name': 'WGS 84 / UTM zone 18N',
 'grid_mapping_name': 'transverse_mercator',
 'latitude_of_projection_origin': 0.0,
 'longitude_of_central_meridian': -75.0,
 'false_easting': 500000.0,
 'false_northing': 0.0,
 'scale_factor_at_central_meridian': 0.9996,
 'spatial_ref': 'PROJCS["WGS 84 / UTM zone 18N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-75],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32618"]]',
 'GeoTransform': '699960.0 10.0 0.0 4700040.0 0.0 -10.0'}

Let’s also visualize this raster:

worcester_nir_xr_masked.plot()
<matplotlib.collections.QuadMesh at 0x7fa7c252c0d0>
../_images/63fb19ea6a16e36abca4d146c88060283347d5da2245dc75391b14d518e4cf9e.png

18.4. Working with Multi-Band Data#

So far we looked into a single band raster, i.e. the nir band of a Sentinel-2 scene. However, to get a smaller, version of the scene, one may also want to visualize the true-color overview of the region. This is provided as a multi-band raster – a raster dataset that contains more than one band.

worcester_overview = rioxarray.open_rasterio(items[0].assets['visual'].href, overview_level=1)
worcester_overview
<xarray.DataArray (band: 3, y: 2745, x: 2745)>
[22605075 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 7e+05 7e+05 7.001e+05 ... 8.097e+05 8.097e+05
  * y            (y) float64 4.7e+06 4.7e+06 4.7e+06 ... 4.59e+06 4.59e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

The band number comes first when GeoTiffs are read with the .open_rasterio() function. As we can see in the xarray.DataArray object, the shape is now (band: 3, y: 2745, x: 2745), with three bands in the band dimension. It’s always a good idea to examine the shape of the raster array you are working with and make sure it’s what you expect. Many functions, especially the ones that plot images, expect a raster array to have a particular shape.

One can visualize the multi-band data with the DataArray.plot.imshow() function:

worcester_overview.plot.imshow(size=5, aspect=1)
<matplotlib.image.AxesImage at 0x7fa7c268ca10>
../_images/2db0b9ec4be45fd1241d9213138cf3f2fd42ca1c3cc991810222ed9b3ed3d995.png

18.4.1. Clipping Raster#

You can use the .rio.clip() function to clip your raster.

Let’s define a geometry for clipping:

from shapely.geometry import box

minx = 760000
miny = 4620000
maxx = 790000
maxy = 4660000

clipping_geom = box(minx, miny, maxx, maxy)
worcester_overview.rio.clip([clipping_geom])
<xarray.DataArray (band: 3, y: 1000, x: 750)>
array([[[ 33,  33,  33, ..., 171, 169, 173],
        [ 35,  34,  34, ..., 177, 175, 175],
        [ 36,  36,  36, ..., 177, 175, 174],
        ...,
        [166, 167, 167, ..., 153, 156, 159],
        [166, 167, 166, ..., 153, 157, 161],
        [166, 165, 164, ..., 154, 153, 162]],

       [[ 40,  40,  40, ..., 187, 183, 187],
        [ 43,  42,  43, ..., 192, 191, 191],
        [ 46,  45,  43, ..., 192, 191, 190],
        ...,
        [186, 188, 187, ..., 168, 171, 174],
        [186, 185, 186, ..., 167, 171, 175],
        [185, 185, 185, ..., 167, 167, 173]],

       [[ 43,  43,  42, ..., 178, 177, 181],
        [ 45,  44,  43, ..., 184, 183, 185],
        [ 46,  47,  45, ..., 187, 183, 183],
        ...,
        [183, 182, 182, ..., 160, 163, 168],
        [181, 181, 182, ..., 161, 164, 168],
        [182, 182, 181, ..., 161, 160, 166]]], dtype=uint8)
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 7.6e+05 7.601e+05 7.601e+05 ... 7.899e+05 7.9e+05
  * y            (y) float64 4.66e+06 4.66e+06 4.66e+06 ... 4.62e+06 4.62e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    scale_factor:        1.0
    add_offset:          0.0
    _FillValue:          0

Another way to clip your raster, is using slice_xy to slice the array by x,y bounds. In this case, and unlike clip you do not need a geojson like geometry dict. Here is how it works:

worcester_overview.rio.slice_xy(minx, miny, maxx, maxy)
<xarray.DataArray (band: 3, y: 1000, x: 750)>
[2250000 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 7.6e+05 7.601e+05 7.601e+05 ... 7.899e+05 7.9e+05
  * y            (y) float64 4.66e+06 4.66e+06 4.66e+06 ... 4.62e+06 4.62e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

While clip and slice_xy seem to work the same way, clip has more functionality and several arguments that can costumize the task. Check the description here.

You can also use the built-in xarray.sel() method to select values for specific values across each coordinate. For example, we can select the value as following:

worcester_overview.sel(x=[761000], y=[4630000], method="nearest")
<xarray.DataArray (band: 3, y: 1, x: 1)>
[3 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 7.61e+05
  * y            (y) float64 4.63e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

The argument method is needed when the provided x and y are not the actual value present in the requested coordinate. Xarray will use the method to interpolate the values and find the corresponding value for the requested coordinate.

18.5. Handling Large Rasters#

When using xarray.open_rasterio (and in general any open_dataset function in xarray the data is loaded from the underlying datastore in memory as NumPy arrays by default.

But you may want to avoid this if you are working with a very large dataset. To do that, you can either pass the cache = False argument to xarray.open_rasterio or specify the chunks argument to use dask.

worcester_nir_xr_not_cached = rioxarray.open_rasterio(items[0].assets["nir"].href, cache = False)

From here, you can work with the DataArray as before and xarray will only load the parts of DataArray that you request. For example, if you use sel to retrieve a value, it will only read that part of the DataArray:

worcester_nir_xr_not_cached.sel(x=[761000], y=[4630000], method="nearest")
<xarray.DataArray (band: 1, y: 1, x: 1)>
[1 values with dtype=uint16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 7.61e+05
  * y            (y) float64 4.63e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

18.6. Writing Data#

You can use the .rio.to_raster function to write a rioxarray.DataArray to disk. Let’s write the clipped version of our multi-band imagery to disk:

worcester_clipped = worcester_overview.rio.clip([clipping_geom])
worcester_clipped.rio.to_raster("worcester_clipped.tif")

Task: Load the clipped raster and the original overview raster into QGIS and inspect their differences.

The difference is in their spatial resolution. Our clipped raster is for overview_level=1 which has a 40m spatial resolution. But the original oveerview has a native spatial resolution of 10m.

Question: What does x and y represent? Are they the center coordinates of each pixel, or the edge?

They represent the center coordinates of each pixel. You can verify this by comparing the bounds of the raster to the x and y coordinates of the top left pixel.