15. Raster Processing#

The goal of this lecture is to learn how to do reprojections and mosaicking of rasters in Python. We will use rioxarray as the main package for these tasks.

15.1. Reprojecting a Raster File#

If you work with more than one type of raster data, it is very common that you need to reproject them to the same CRS. There are various ways to do this using methods in rasterio which are also available in rioxarray.

In this section, our goal is to reporject a Sentinel-2 scene to the USDA Crop Data Layer (CDL) CRS. USDA CDL is a national map of croplands (some generic land cover classes) which USDS generates every year. This dataset is shared in a single GeoTIFF file for the Contiguous United States (CONUS).

import xarray as xr
import rioxarray as rxr
from rioxarray.merge import merge_datasets
import numpy as np
import pyproj
from rasterio.enums import Resampling

import os
import requests
from tqdm import tqdm
def download_file(url, local_path):
    """
    This function downloads a file from the web to a local path. 

    Inputs:
    url : string
        A public URL to the file on the web
    local_path : string
        A path to a local directory where the file should be downloaded. This should include the file name. 

    Return:
    None

    """
    if not os.path.exists(local_path):
        print(f"File not found locally. Downloading from {url}...")
        
        try:
            response = requests.get(url, stream=True)
            response.raise_for_status()  # Check if the request was successful
            
            with open(local_path, "wb") as file:
                for chunk in tqdm.tqdm(response.iter_content()):
                    file.write(chunk)
            print("Download complete.")
        except requests.exceptions.RequestException as e:
            print(f"An error occurred: {e}")
    else:
        print("File already exists locally. No download needed.")

In the following, we will use a Sentinel-2 scene available on AWS S3. We have already queried the Earth Search STAC API and retrieved the URL for the COG file of the scene:

s2_scene_path = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/CG/2023/10/S2A_19TCG_20231026_0_L2A/B04.tif'

We also need to retrieve the USDA CDL data. This is available here, and in the following we donwload and unzip the data locally. Loading this file into memory is not feasible on a personal machine or servers with regular memory capacity.

local_path = "/home/gisuser/raster-processing"
download_file(
    "https://www.nass.usda.gov/Research_and_Science/Cropland/Release/datasets/2022_30m_cdls.zip", 
    f"{local_path}/data/2022_30m_cdls.zip"
)
File already exists locally. No download needed.

Now, let’s open the CDL file. We will use the cache=False argument to not load the data into memory, rather read the metadata.

import zipfile
with zipfile.ZipFile(f"{local_path}/data/2022_30m_cdls.zip","r") as zip_ref:
    zip_ref.extractall(f"{local_path}/data/")
cdl_file = f"{local_path}/data/2022_30m_cdls.tif"
cdl_ds = rxr.open_rasterio(cdl_file, cache=False)
cdl_ds
<xarray.DataArray (band: 1, y: 96523, x: 153811)> Size: 15GB
[14846299153 values with dtype=uint8]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 1MB -2.356e+06 -2.356e+06 ... 2.258e+06 2.258e+06
  * y            (y) float64 772kB 3.173e+06 3.173e+06 ... 2.77e+05 2.769e+05
    spatial_ref  int64 8B 0
Attributes: (12/22)
    TIFFTAG_SOFTWARE:           E
    TIFFTAG_XRESOLUTION:        1
    TIFFTAG_YRESOLUTION:        1
    TIFFTAG_RESOLUTIONUNIT:     2 (pixels/inch)
    AREA_OR_POINT:              Area
    LAYER_TYPE:                 thematic
    ...                         ...
    STATISTICS_SKIPFACTORX:     1
    STATISTICS_SKIPFACTORY:     1
    STATISTICS_STDDEV:          56.056102758529
    scale_factor:               1.0
    add_offset:                 0.0
    long_name:                  Layer_1

Next, we define functions that are needed for reprojection:

def point_transform(coor, src_crs, target_crs=5070):
    """
    This function transforms a coordinate (x, y) from a source crs to a target crs. 
    
    Inputs:
        coor : list
        a list or tuple containing the coordinates in x, y format. 
        src_crs : `proj4` CRS
            The source CRS of the coordinate (it should be in a format known to `proj4`)
        target_crs : `proj4` CRS
            The target CRS for transforming the coordinate. Default is `EPSG:5070` which is the CRS of USDA CDL data.
    
    Return:
        transformed_coor : list
            A list with the transformed coordinates in x, y format. 
    """
    
    proj = pyproj.Transformer.from_crs(src_crs, target_crs, always_xy=True)
    projected_coor = proj.transform(coor[0], coor[1])
    transformed_coor = [projected_coor[0], projected_coor[1]]
    
    return transformed_coor

def find_nearest(array, value):
    """
    This function returns the closet number to `value` in the `array`.

    Inputs:
        array : numpy array
            An array 
        value : numpy array
            The value to search for nearest in the `array`. 

    Returns
        target : numpy array
            The number closes to the `value` in the `array`
    
    """
    
    idx = (np.abs(array - value)).argmin()
    target = array[idx]
    
    return target


def reproject_s2_to_cdl(tile_path,
                        cdl_ds,
                        resampling_method = Resampling.bilinear):
    
    """
    This function receives the path to a Sentinel-2 scene and reproject it to the target cdl_ds.
    
    Inputs:
        tile_path : string
            The full path to a sentinel-2 scene. This can be the url to an object on the cloud store. 
        cdl_ds : rioxarray dataset
            A rioxarray dataset that is opened with `cache=False` settings
        resampling_method : rasterio resampling method
            The method that rioxarray uses to reproject, default is bilinear. The method should be imported from rasterio.
    
    Returns:
       xds_new : rioxarray dataset
           The reprojected tile in rioxarray dataset format
    """

    s2_ds = rxr.open_rasterio(tile_path)
    
    half_scene_len = np.abs(np.round((s2_ds.x.max().data - s2_ds.x.min().data) / 2))
    
    coor_min = point_transform([s2_ds.x.min().data - half_scene_len, s2_ds.y.min().data - half_scene_len], s2_ds.rio.crs)
    coor_max = point_transform([s2_ds.x.max().data + half_scene_len, s2_ds.y.max().data + half_scene_len], s2_ds.rio.crs)
    
    x0 = find_nearest(cdl_ds.x.data, coor_min[0])
    y0 = find_nearest(cdl_ds.y.data, coor_min[1])
    x1 = find_nearest(cdl_ds.x.data, coor_max[0])
    y1 = find_nearest(cdl_ds.y.data, coor_max[1])
    
    cdl_for_reprojection = cdl_ds.rio.slice_xy(x0, y0, x1, y1)
    
    xds_new = s2_ds.rio.reproject_match(cdl_for_reprojection, resampling = resampling_method)
    
    return xds_new
s2_nearest = reproject_s2_to_cdl(
    s2_scene_path,
    cdl_ds,
    resampling_method = Resampling.nearest
)
s2_nearest.rio.to_raster("s2_nearest.tif")
s2_bilinear = reproject_s2_to_cdl(
    s2_scene_path,
    cdl_ds,
    resampling_method = Resampling.bilinear
)
s2_bilinear.rio.to_raster("s2_bilinear.tif")

Inspect the reprojected scenes in QGIS, and explain the differences between the two methods and the differences between the reprojected scenes and the original scene.

15.2. Using rioxarray with Dask#

In this section, we will review how to use Dask with rioxarray to improve reading COG files. For maximum read performance, the chunking pattern you request with rio.open_rasterio should align with the internal chunking of the COG. Typically this means reading the data in a “row major” format: your chunks should be as wide as possible along the columns. (Check out this page for more information about this)

Let’s try this on a sample scene to read the raster, and calculate the mean value. We have a sample scene from green band of Sentinel-2 available on AWS S3:

import dask 
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
print(client.dashboard_link)
http://127.0.0.1:8787/status
sample_scene = "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/CF/2023/10/S2A_19TCF_20231026_0_L2A/B03.tif"

First, we use rioxarray without Dask:

%%timeit -n 1 -r 1
green_band = rxr.open_rasterio(sample_scene)
green_band.mean()
20.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Now, let’s add Dask.

%%timeit -n 1 -r 1
green_band_dask = rxr.open_rasterio(sample_scene, chunks=(1, 'auto', -1))
green_band_dask.mean().compute()
24.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -n 1 -r 1
green_band_dask = rxr.open_rasterio(sample_scene, lock=False, chunks=(1, 'auto', -1))
green_band_dask.mean().compute()
12.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

15.3. Mosaicking Rasters#

Let’s say you are interested in generating an RGB image of Sentinel-2 across your area of interest defined in the following aoi_box.

from shapely.geometry import box
import shapely
import json
aoi_box = box(-70.74295109234716, 41.01426426532743, -69.46303157927986, 42.0218426094371)
aoi_box
../_images/1b4395c77e09f16022e94a486fdf8ecf249d712b925c630e04c887fc035b9cb7.svg

First, let’s visualize this bbox

import leafmap
m = leafmap.Map(center=[41.633484, -70.154371], zoom=8, height="500px")
m.add_geojson(json.loads(shapely.to_geojson(aoi_box)))
m

We use the PySTAC client to search for imagery across this area:

import pystac_client
import planetary_computer

def retrieve_stac_items_bbox(url, collection, bbox, max_items = 10, pc_flag = False, query = None): 
    """
    This function retrieves the latest STAC items from a STAC API
    
    Parameters
    ----------
    url : str
        The STAC API URL
    collection : str
        Collection ID to search in the STAC API
    bbox : dict
       A tuple or list with coordinates of the target area for API search in the format (min x, min y, max x, max y)
    pc_flag : boolean
        A boolean flag to specify whether the STAC API is Microsoft's planetary computer API. Default is set to False.
    query : list
        A list of query parameters for STAC API. Default is set to None. 
    max_items: integer
        Maximum number of items to retrieve from the STAC API. Default is 1. 

    Returns
    -------
    items : Generator
        The STAC item that has the latest date of acquisition given all the query parameters. 
        
    """
    if pc_flag:
        modifier = planetary_computer.sign_inplace
    else:
        modifier = None
        
    catalog = pystac_client.Client.open(
        url = url,
        modifier = modifier,
    )

    search_results = catalog.search(
        collections = [collection],
        bbox = bbox,
        query = query,
        sortby = ["-properties.datetime"],
        max_items = max_items
    )
    items = search_results.item_collection()

    return items
items = retrieve_stac_items_bbox(url = "https://earth-search.aws.element84.com/v1",
                                collection = "sentinel-2-l2a",
                                bbox = aoi_box.bounds,
                                max_items = 4,
                                pc_flag = False,
                                query = ["s2:nodata_pixel_percentage<1"]
                                )
for item in items:
    m.add_geojson(item.geometry, style = {"color": "red"})
m

So, as we can see in the map there are four scenes that overlap with our area of interest. We need to mosaick the overlapping Sentinel-2 scenes to generate one raster.

We will use merge_datasets from rioxarray

def gen_stac_asset_urls(items, asset):
    """
    This function receives an items collection returned by a STAC API, and returns
    the urls of the requested `asset` in a list. 

    Inputs:
        items : json collection
            A STAC items collection returned by STAC API
        asset : string
            Name of an asset present in the `items` collection

    Returns:
        urls : list
            List of all usls related to the `asset`
            
    """
    
    urls = []
    for item in items:
        urls.append(item.assets[asset].href)
    
    return urls
red_urls = gen_stac_asset_urls(items, "red")
green_urls = gen_stac_asset_urls(items, "green")
blue_urls = gen_stac_asset_urls(items, "blue")
blue_urls
['https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/CF/2024/10/S2A_19TCF_20241030_0_L2A/B02.tif',
 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/DF/2024/10/S2A_19TDF_20241030_0_L2A/B02.tif',
 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/CG/2024/10/S2A_19TCG_20241030_0_L2A/B02.tif',
 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/DG/2024/10/S2A_19TDG_20241030_0_L2A/B02.tif']
temp_dataarray = rxr.open_rasterio(red_urls[0])
temp_dataset = rxr.open_rasterio(red_urls[0]).to_dataset(name="red")
temp_dataarray
<xarray.DataArray (band: 1, y: 10980, x: 10980)> Size: 241MB
[120560400 values with dtype=uint16]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 88kB 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05
  * y            (y) float64 88kB 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 8B 0
Attributes:
    OVR_RESAMPLING_ALG:  AVERAGE
    AREA_OR_POINT:       Area
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0
temp_dataset
<xarray.Dataset> Size: 241MB
Dimensions:      (band: 1, x: 10980, y: 10980)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 88kB 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05
  * y            (y) float64 88kB 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 8B 0
Data variables:
    red          (band, y, x) uint16 241MB ...
red_datasets = [
    rxr.open_rasterio(url, lock=False, chunks=(1, 'auto', -1)).to_dataset(name="red") for url in red_urls
]
red_datasets[0]
<xarray.Dataset> Size: 241MB
Dimensions:      (band: 1, x: 10980, y: 10980)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 88kB 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05
  * y            (y) float64 88kB 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 8B 0
Data variables:
    red          (band, y, x) uint16 241MB dask.array<chunksize=(1, 6111, 10980), meta=np.ndarray>
mosaic_red = merge_datasets(red_datasets)
mosaic_red.red.rio.to_raster("red_mosaic.tif")
mosaic_red = mosaic_red.rio.clip([aoi_box], crs=4326)
mosaic_red.red.rio.to_raster("red_clipped.tif")
green_datasets = [
    rxr.open_rasterio(url, lock=False, chunks=(1, 'auto', -1)).to_dataset(name="green") for url in green_urls
]
blue_datasets = [
    rxr.open_rasterio(url, lock=False, chunks=(1, 'auto', -1)).to_dataset(name="blue") for url in blue_urls
]
mosaic_blue = merge_datasets(blue_datasets)
mosaic_blue = mosaic_blue.rio.clip([aoi_box], crs=4326)
mosaic_green = merge_datasets(green_datasets)
mosaic_green = mosaic_green.rio.clip([aoi_box], crs=4326)
xr.merge([mosaic_red, mosaic_green, mosaic_blue])
<xarray.Dataset> Size: 735MB
Dimensions:      (band: 1, x: 10823, y: 11323)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 87kB 3.534e+05 3.535e+05 ... 4.617e+05 4.617e+05
  * y            (y) float64 91kB 4.654e+06 4.654e+06 ... 4.54e+06 4.54e+06
    spatial_ref  int64 8B 0
Data variables:
    red          (band, y, x) uint16 245MB 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    green        (band, y, x) uint16 245MB 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    blue         (band, y, x) uint16 245MB 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
rgb = xr.merge([mosaic_red, mosaic_green, mosaic_blue])
rgb
<xarray.Dataset> Size: 735MB
Dimensions:      (band: 1, x: 10823, y: 11323)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 87kB 3.534e+05 3.535e+05 ... 4.617e+05 4.617e+05
  * y            (y) float64 91kB 4.654e+06 4.654e+06 ... 4.54e+06 4.54e+06
    spatial_ref  int64 8B 0
Data variables:
    red          (band, y, x) uint16 245MB 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    green        (band, y, x) uint16 245MB 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    blue         (band, y, x) uint16 245MB 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0