7. Tutorial for Searching a STAC Catalog#

In this tutorial you will learn how to connect to a STAC API and search various datasets available for query. You will also search for specific items using a range of query parameters.

We will use the Earth Search API developed by Element84 for satellite datasets available on AWS S3 Storage (note that only those datasets that have a STAC catalog are served through this API). You can access the STAC catalog of the API here.

We will use the PySTAC pakcage for this tutorial. Check the documentation here, and these specific guides about using ItemSearch

You can find this notebook with a Dockerfile that contains all the required packages to run it on this repository.

Attribution: Parts of this notebook are inspired by the great tutorial on Access satellite imagery using Python (link)

from pystac_client import Client
api_url = "https://earth-search.aws.element84.com/v1"
client = Client.open(api_url)

7.1. Find Collections#

First, we would like to see what collections are available from this API.

collections = client.get_collections()
for collection in collections:
    print(collection)
<CollectionClient id=cop-dem-glo-30>
<CollectionClient id=naip>
<CollectionClient id=sentinel-2-l2a>
<CollectionClient id=sentinel-2-l1c>
<CollectionClient id=landsat-c2-l2>
<CollectionClient id=cop-dem-glo-90>
<CollectionClient id=sentinel-1-grd>

In order to get more information about a specific collection, you can use get_collection function:

s1_collection = client.get_collection("sentinel-1-grd")
s1_collection

7.2. Search Items#

Let’s use Leafmap to select a point where we are interested to find a satellite imagery

import leafmap
m = leafmap.Map(center=[42.250809, -71.822833], zoom=16, height="800px")
m

Pan and zoom the map to find an area of interest, then use the tools on the top left of the map to select a point on the map.

if m.user_rois is not None:
    point = m.user_rois['features'][0]['geometry']
else:
    point = dict(type="Point", coordinates=(42.250809, -71.822833))

We are interested to search for Sentinel-2 imagery that intersects with the point we selected in the previous step. So we will use the search function and insert sentinel-2-l2a as our collection of interest. We also use the intersects property to filter the scenes that intersect with our point of interest.

search_results = client.search(
    collections=["sentinel-2-l2a"],
    intersects=point,
    max_items=12,
)
print(search_results.matched())
1928

The number that is returned in the previous step is more than 12 that we had identified in our search criteria. But this doesn’t mean all the metadata about these scenes haven been retrieved. This just shows how many scenes have matched our search criteria. In the next cell, we will call item_collection() to retrieve the metadata, and check how many of them are retrieved.

items = search_results.item_collection()
len(items)
12

Now, let’s investigate an item

items[0]
print(items[0].datetime)
2023-09-26 15:41:35.094000+00:00
print(items[0].geometry)
{'type': 'Polygon', 'coordinates': [[[-71.79498386317441, 42.40785435022417], [-72.10290249964824, 41.427267029114304], [-71.2944491287618, 41.4040441194201], [-71.23687622798481, 42.390872073839105], [-71.79498386317441, 42.40785435022417]]]}

We can now use the item’s geometry and confirm that the returned scene intersects with our point of interest.

m.add_geojson(items[0].geometry)
m

7.3. Query Metadata#

Items in STAC catalog have much more metadata (in addition to location) that you can query and only return results that match your query parameters. Let’s use the datetime property and only search for scenes in 2023:

search = client.search(
    collections=["sentinel-2-l2a"],
    intersects=point,
    datetime="2023-01-01/2023-09-22"
)
print(search.matched())
173

Another property which is key for multispectral imagery is cloud cover; ideally we would be interested to find scenes with low cloud cover. Cloud cover is recorded in the metadata named eo:cloud_cover, and it ranges from 0 to 1. In the following, we are going to find scenes that only have a cloud cover of less than 5% in 2023.

search = client.search(
    collections=["sentinel-2-l2a"],
    intersects=point,
    datetime="2023-01-01/2023-09-22",
    query=["eo:cloud_cover<5"],
    max_items=10
)
print(search.matched())
26

Another useful property of the STAC API is that you can sort the results using a metadata property. For example, let’s sort our results based on the cloud cover value:

search = client.search(
    collections=["sentinel-2-l2a"],
    intersects=point,
    datetime="2023-01-01/2023-09-22",
    query=["eo:cloud_cover<5"],
    sortby=["+properties.eo:cloud_cover"],
    max_items=10
)
items = search.item_collection()
len(items)
10
for item in items:
    print(item.properties["eo:cloud_cover"])
0.003792
0.003975
0.388506
0.397059
0.437583
0.462606
0.797721
0.808833
0.953016
1.014486

You can also save the results of the search into a JSON file if you need it later on.

items.save_object("search.json")

7.4. Access Assets#

Next, we will use the assets of the returned items and retrieve the actual scene.

#Let's look at the second item:
selected_item = items[1]
# Here are the assets available for this item
assets = selected_item.assets
print(assets.keys())
dict_keys(['aot', 'blue', 'coastal', 'granule_metadata', 'green', 'nir', 'nir08', 'nir09', 'red', 'rededge1', 'rededge2', 'rededge3', 'scl', 'swir16', 'swir22', 'thumbnail', 'tileinfo_metadata', 'visual', 'wvp', 'aot-jp2', 'blue-jp2', 'coastal-jp2', 'green-jp2', 'nir-jp2', 'nir08-jp2', 'nir09-jp2', 'red-jp2', 'rededge1-jp2', 'rededge2-jp2', 'rededge3-jp2', 'scl-jp2', 'swir16-jp2', 'swir22-jp2', 'visual-jp2', 'wvp-jp2'])
for key, asset in assets.items():
    print(f"{key}: {asset.title}")
aot: Aerosol optical thickness (AOT)
blue: Blue (band 2) - 10m
coastal: Coastal aerosol (band 1) - 60m
granule_metadata: None
green: Green (band 3) - 10m
nir: NIR 1 (band 8) - 10m
nir08: NIR 2 (band 8A) - 20m
nir09: NIR 3 (band 9) - 60m
red: Red (band 4) - 10m
rededge1: Red edge 1 (band 5) - 20m
rededge2: Red edge 2 (band 6) - 20m
rededge3: Red edge 3 (band 7) - 20m
scl: Scene classification map (SCL)
swir16: SWIR 1 (band 11) - 20m
swir22: SWIR 2 (band 12) - 20m
thumbnail: Thumbnail image
tileinfo_metadata: None
visual: True color image
wvp: Water vapour (WVP)
aot-jp2: Aerosol optical thickness (AOT)
blue-jp2: Blue (band 2) - 10m
coastal-jp2: Coastal aerosol (band 1) - 60m
green-jp2: Green (band 3) - 10m
nir-jp2: NIR 1 (band 8) - 10m
nir08-jp2: NIR 2 (band 8A) - 20m
nir09-jp2: NIR 3 (band 9) - 60m
red-jp2: Red (band 4) - 10m
rededge1-jp2: Red edge 1 (band 5) - 20m
rededge2-jp2: Red edge 2 (band 6) - 20m
rededge3-jp2: Red edge 3 (band 7) - 20m
scl-jp2: Scene classification map (SCL)
swir16-jp2: SWIR 1 (band 11) - 20m
swir22-jp2: SWIR 2 (band 12) - 20m
visual-jp2: True color image
wvp-jp2: Water vapour (WVP)

Each asset has a link which can be accessed from the href property. This can be a HTTP URL or a link to S3 for this STAC catalog. For example, let’s look at the link for the thumbnail of the scene.

print(assets["thumbnail"].href)
https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/18/T/YM/2023/5/S2B_18TYM_20230527_0_L2A/thumbnail.jpg

Since this is a HTTP link, we can use Python requests package to load the image.

import requests
img_data = requests.get(assets["thumbnail"].href).content
import matplotlib.pyplot as plt
from PIL import Image
import io
plt.figure(figsize=(10, 10))
plt.imshow(Image.open(io.BytesIO(img_data)))
<matplotlib.image.AxesImage at 0xffff79a70520>
../_images/da14649cc49d0b8f796f9f2d57cd6c85621d506e5b768950b2bad396310b5521.png

Now, let’s load one of the COG assets into memory. This is a large scene, and we don’t necessary want to load all the data at once. There are various packages to do this. Here we will use rioxarray, you can also use stacstack and odc-stac.

import rioxarray
nir_href = assets["nir"].href
nir = rioxarray.open_rasterio(nir_href)
nir
<xarray.DataArray (band: 1, y: 10980, x: 10980)>
[120560400 values with dtype=uint16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.098e+05
  * y            (y) float64 4.7e+06 4.7e+06 4.7e+06 ... 4.59e+06 4.59e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

Note: At this stage the nir DataArray that we defined is empty, and only the metadata of the scene is loaded from the target href. This is consistent with open_rasterio function behavior, and you will learn more about it later in the class. When you run any function on nir or call any of the built-in function (e.g. mean()) the data will be loaded to memory.

First, let’s plot part of the nir array.

import matplotlib.pyplot as plt
plt.imshow(nir[0,1500:2000,6200:7000], cmap='gray')
plt.clim(vmin=10, vmax=5000)
../_images/8e5c5c1f18c023e5341397481b09631aad4927e753fcb74cb94e5d3238593219.png

In this case, only the portion of the array that we wanted to visualize was loaded.

We can actually time the operation of plotting the whole scene vs the small portion we selected, and see the efficiency of using this approach.

%%timeit -n 5 -r 1
plt.imshow(nir[0,1500:2000,6200:7000], cmap='gray')
5.84 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)
../_images/ff290b780b4c391faaa9079a8f1ecae4e7c404b5450ad83f6f041d79894fe876.png
%%timeit -n 5 -r 1
plt.imshow(nir[0, :, :], cmap='gray')
43.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)
../_images/43c3fb2f741e3216eea5c704a21a73025e72a9c654d4f77daaf09f1c461c1946.png

Lastly, you can save this array into a GeoTIFF file as following:

nir[0,1500:2000,6200:7000].rio.to_raster("nir_subset.tif")