19. Raster Processing#

The goal of this lecture is to learn how to do reprojections and mosaicking of rasters in Python. We will use rioxarray as the main package for these tasks.

19.1. Reprojecting Raster#

If you work with more than one type of raster data, it is very common to would like to reproject them to the same CRS. There are various ways to do this using functions in rasterio which are also available in rioxarray.

In this section, our goal is to lean how to reporject a Sentinel-2 scene to the USDA Crop Data Layer (CDL) CRS.

import rioxarray
import numpy as np
import pyproj
from rasterio.enums import Resampling

In the following, we will use a Sentinel-2 scene available on AWS S3. We have already queried the Earth Search STAC API and retrieved the URL for the COG file of the scene:

scene_path = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/CG/2023/10/S2A_19TCG_20231026_0_L2A/B04.tif'

We also need to retrieve the USDA CDL data. This is available on this website, and in the following we donwload and unzip the data locally. This dataset include one GeoTIFF file for all of the Contiguous United States (CONUS). Loading this file into memory is not feasible on a personal machine or cloud servers with normal memory. We will learn how to work with this data:

! wget https://www.nass.usda.gov/Research_and_Science/Cropland/Release/datasets/2022_30m_cdls.zip
! unzip 2022_30m_cdls.zip
--2023-10-30 14:25:03--  https://www.nass.usda.gov/Research_and_Science/Cropland/Release/datasets/2022_30m_cdls.zip
Resolving www.nass.usda.gov (www.nass.usda.gov)... 23.214.229.251, 2a02:26f0:6d00:19b::2938, 2a02:26f0:6d00:1ad::2938
Connecting to www.nass.usda.gov (www.nass.usda.gov)|23.214.229.251|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2070138298 (1.9G) [application/zip]
Saving to: ‘2022_30m_cdls.zip’

2022_30m_cdls.zip   100%[===================>]   1.93G   184MB/s    in 11s     

2023-10-30 14:25:17 (184 MB/s) - ‘2022_30m_cdls.zip’ saved [2070138298/2070138298]

Archive:  2022_30m_cdls.zip
  inflating: 2022_30m_cdls.aux       
  inflating: 2022_30m_cdls.tfw       
  inflating: 2022_30m_cdls.tif       
  inflating: 2022_30m_cdls.tif.ovr   
  inflating: metadata_Cropland-Data-Layer-2022.htm  

Now, let’s open the CDL file. We will use the cache=False argument to not load the data into memory, rather read the metadata.

cdl_file = "2022_30m_cdls.tif"
cdl_ds = rioxarray.open_rasterio(cdl_file, cache=False)
cdl_ds
<xarray.DataArray (band: 1, y: 96523, x: 153811)>
[14846299153 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 -2.356e+06 -2.356e+06 ... 2.258e+06 2.258e+06
  * y            (y) float64 3.173e+06 3.173e+06 ... 2.77e+05 2.769e+05
    spatial_ref  int64 0
Attributes: (12/22)
    AREA_OR_POINT:              Area
    TIFFTAG_RESOLUTIONUNIT:     2 (pixels/inch)
    TIFFTAG_SOFTWARE:           E
    TIFFTAG_XRESOLUTION:        1
    TIFFTAG_YRESOLUTION:        1
    LAYER_TYPE:                 thematic
    ...                         ...
    STATISTICS_SKIPFACTORX:     1
    STATISTICS_SKIPFACTORY:     1
    STATISTICS_STDDEV:          56.056102758529
    scale_factor:               1.0
    add_offset:                 0.0
    long_name:                  Layer_1

Next, we define two function that are needed for reprojection:

def point_transform(coor, src_crs, target_crs=5070):
    """
    This function transforms a coordinate (x, y) from a source crs to a target crs. 
    
    Args:
    - coor: a list or tuple containing the coordinates in x, y format. 
    - src_crs: the source CRS of the coordinate (it should be in a format known to `proj4`)
    - target_crs: the targeted CRS for transforming the coordinate (`EPSG:5070` is the CRS of 
                  USDA CDL data.
    
    Return:
    - transformed_coor: a list with the transformed coordinates in x, y format. 
    """
    
    proj = pyproj.Transformer.from_crs(src_crs, target_crs, always_xy=True)
    projected_coor = proj.transform(coor[0], coor[1])
    transformed_coor = [projected_coor[0], projected_coor[1]]
    
    return transformed_coor

def find_nearest(array, value):
    """
    This function returns the closet number to `value` in the `array`.
    
    """
    
    idx = (np.abs(array - value)).argmin()
    
    return array[idx]
def reproject_s2_to_cdl(tile_path,
                        cdl_ds,
                        resampling_method = Resampling.bilinear):
    
    """
    This function receives the path to a Sentinel-2 scene and reproject it to the targeting cdl_ds.
    
    Assumptions:
    - tile_path is a full path that end with .tif. This can be the url to an object on the cloud store. 
    - cdl_ds is a rioxarray dataset that is opened with `cache=False` setting.
    
    
    Args:
    - tile_path: The full path to a sentinel-2 scene
    - resampling_method: The method that rioxarray uses to reproject, default is bilinear
    
    Returns:
    - 
    """

    s2_ds = rioxarray.open_rasterio(tile_path)
    
    half_scene_len = np.abs(np.round((s2_ds.x.max().data - s2_ds.x.min().data) / 2))
    
    coor_min = point_transform([s2_ds.x.min().data - half_scene_len, s2_ds.y.min().data - half_scene_len], s2_ds.rio.crs)
    coor_max = point_transform([s2_ds.x.max().data + half_scene_len, s2_ds.y.max().data + half_scene_len], s2_ds.rio.crs)
    
    x0 = find_nearest(cdl_ds.x.data, coor_min[0])
    y0 = find_nearest(cdl_ds.y.data, coor_min[1])
    x1 = find_nearest(cdl_ds.x.data, coor_max[0])
    y1 = find_nearest(cdl_ds.y.data, coor_max[1])
    
    cdl_for_reprojection = cdl_ds.rio.slice_xy(x0, y0, x1, y1)
    
    xds_new = s2_ds.rio.reproject_match(cdl_for_reprojection, resampling = resampling_method)
    
    return xds_new
s2_nearest = reproject_s2_to_cdl(
    scene_path,
    cdl_ds,
    resampling_method = Resampling.nearest
)
s2_nearest.rio.to_raster("s2_nearest.tif")
s2_bilinear = reproject_s2_to_cdl(
    scene_path,
    cdl_ds,
    resampling_method = Resampling.bilinear
)
s2_bilinear.rio.to_raster("s2_bilinear.tif")

Inspect the reprojected scenes in QGIS, and explain the differences between the two methods and the differences between the reprojected scenes and the original scene.

19.2. Using rioxarray with Dask#

In this section, we will review how to use Dask with rioxarray to improve reading COG files. For maximum read performance, the chunking pattern you request with rio.open_rasterio should align with the internal chunking of the COG. Typically this means reading the data in a “row major” format: your chunks should be as wide as possible along the columns. (Check out this page for more information about this)

Let’s try this on a sample scene to read the raster, and calculate the mean value. We have a sample scene from green band of Sentinel-2 available on AWS S3:

import dask 
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
client

Client

Client-9c6bdf3a-7730-11ee-819f-6a0ce35f57ad

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: /user/halemohammad@clarku.edu/proxy/8787/status

Cluster Info

sample_scene = "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/CF/2023/10/S2A_19TCF_20231026_0_L2A/B03.tif"

First, we use rioxarray without Dask:

%%timeit -n 2 -r 2
green_band = rioxarray.open_rasterio(sample_scene)
green_band.mean()
13.6 s ± 1.82 s per loop (mean ± std. dev. of 2 runs, 2 loops each)

Now, let’s add Dask.

%%timeit -n 2 -r 2
green_band_dask = rioxarray.open_rasterio(sample_scene, chunks=(1, 'auto', -1))
green_band_dask.mean().compute()
14.3 s ± 3.93 s per loop (mean ± std. dev. of 2 runs, 2 loops each)
%%timeit -n 2 -r 2
green_band_dask = rioxarray.open_rasterio(sample_scene, lock=False, chunks=(1, 'auto', -1))
green_band_dask.mean().compute()
9.98 s ± 71.9 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

19.3. Mosaicking Rasters#

Let’s say you are interested to generate an RGB image of Sentinel-2 across your area of interest defined in the following bbox. We have defined our geomtry in three different formats suing bbox, aoi, and clipping_geom which will be used in various parts in the following. But they all refer to the same object.

from shapely.geometry import box
import shapely
import json
aoi_box = box(-70.74295109234716, 41.01426426532743, -69.46303157927986, 42.0218426094371)
shapely.to_geojson(aoi_box)
'{"type":"Polygon","coordinates":[[[-69.46303157927986,41.01426426532743],[-69.46303157927986,42.0218426094371],[-70.74295109234716,42.0218426094371],[-70.74295109234716,41.01426426532743],[-69.46303157927986,41.01426426532743]]]}'
json.loads(shapely.to_geojson(aoi_box))
{'type': 'Polygon',
 'coordinates': [[[-69.46303157927986, 41.01426426532743],
   [-69.46303157927986, 42.0218426094371],
   [-70.74295109234716, 42.0218426094371],
   [-70.74295109234716, 41.01426426532743],
   [-69.46303157927986, 41.01426426532743]]]}

First, let’s visualize this bbox

import leafmap
m = leafmap.Map(center=[41.633484, -70.154371], zoom=8, height="500px")
m.add_geojson(json.loads(shapely.to_geojson(aoi_box)))
m
string indices must be integers, not 'str'

We use the PySTAC client to search for imagery across this area:

from pystac_client import Client

api_url = "https://earth-search.aws.element84.com/v1"
client = Client.open(api_url)
search_results = client.search(
    collections=["sentinel-2-l2a"],
    bbox = aoi_box.bounds,
    sortby=["-properties.datetime"],
    max_items=4,
)
items = search_results.item_collection()
for item in items:
    m.add_geojson(item.geometry, style = {"color": "red"})
m
string indices must be integers, not 'str'
string indices must be integers, not 'str'
string indices must be integers, not 'str'
string indices must be integers, not 'str'

So, as we can see in the map there are four scenes that overlap our area of interest. We need to mosaick the overlapping Sentinel-2 scenes to generate one raster then.

We will use merge_datasets from rioxarray

import xarray as xr
import rioxarray
from rioxarray.merge import merge_datasets
def gen_stac_asset_urls(items, asset):
    """
    This function receives an items collection returned by a STAC API, and returns
    the urls of the requested `asset` in a list. 
    
    """
    urls = []
    for item in items:
        urls.append(item.assets[asset].href)
    
    return urls
red_urls = gen_stac_asset_urls(items, "red")
green_urls = gen_stac_asset_urls(items, "green")
blue_urls = gen_stac_asset_urls(items, "blue")
blue_urls
['https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/DF/2023/10/S2B_19TDF_20231028_0_L2A/B02.tif',
 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/DG/2023/10/S2B_19TDG_20231028_0_L2A/B02.tif',
 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/CF/2023/10/S2A_19TCF_20231026_0_L2A/B02.tif',
 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/19/T/DF/2023/10/S2A_19TDF_20231026_0_L2A/B02.tif']
temp_dataarray = rioxarray.open_rasterio(red_urls[0])
temp_dataset = rioxarray.open_rasterio(red_urls[0]).to_dataset(name="red")
temp_dataarray
<xarray.DataArray (band: 1, y: 10980, x: 10980)>
[120560400 values with dtype=uint16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 4e+05 4e+05 4e+05 ... 5.097e+05 5.097e+05 5.098e+05
  * y            (y) float64 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0
temp_dataset
<xarray.Dataset>
Dimensions:      (band: 1, x: 10980, y: 10980)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 4e+05 4e+05 4e+05 ... 5.097e+05 5.097e+05 5.098e+05
  * y            (y) float64 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 0
Data variables:
    red          (band, y, x) uint16 ...
red_datasets = [
    rioxarray.open_rasterio(url, lock=False, chunks=(1, 'auto', -1)).to_dataset(name="red") for url in red_urls
]
red_datasets[0]
<xarray.Dataset>
Dimensions:      (band: 1, x: 10980, y: 10980)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 4e+05 4e+05 4e+05 ... 5.097e+05 5.097e+05 5.098e+05
  * y            (y) float64 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 0
Data variables:
    red          (band, y, x) uint16 dask.array<chunksize=(1, 6111, 10980), meta=np.ndarray>
mosaic_red = merge_datasets(red_datasets)
mosaic_red.red.rio.to_raster("red_mosaic.tif")
mosaic_red = mosaic_red.rio.clip([aoi_box], crs=4326)
mosaic_red.red.rio.to_raster("red_clipped.tif")
green_datasets = [
    rioxarray.open_rasterio(url, lock=False, chunks=(1, 'auto', -1)).to_dataset(name="green") for url in green_urls
]
blue_datasets = [
    rioxarray.open_rasterio(url, lock=False, chunks=(1, 'auto', -1)).to_dataset(name="blue") for url in blue_urls
]
mosaic_blue = merge_datasets(blue_datasets)
mosaic_blue = mosaic_blue.rio.clip([aoi_box], crs=4326)
mosaic_green = merge_datasets(green_datasets)
mosaic_green = mosaic_green.rio.clip([aoi_box], crs=4326)
xr.merge([mosaic_red, mosaic_green, mosaic_blue])
<xarray.Dataset>
Dimensions:      (band: 1, x: 10823, y: 11323)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 3.534e+05 3.535e+05 ... 4.617e+05 4.617e+05
  * y            (y) float64 4.654e+06 4.654e+06 4.654e+06 ... 4.54e+06 4.54e+06
    spatial_ref  int64 0
Data variables:
    red          (band, y, x) uint16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
    green        (band, y, x) uint16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
    blue         (band, y, x) uint16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0