Scalable Vector Data Analysis

21. Scalable Vector Data Analysis#

You can access this notebook (in a Docker image) on this GitHub repo.

In this lecture, we are going to use Dask-GeoPandas package to read a large vector dataset from Source Cooperative. Then use Dask parrallel computation to apply a spatial join operation to two geospatial DataFrames.

Our target dataset is the Google-Microsoft Open Buildings - combined by VIDA dataset hosted on Source Cooperative. This is a combined version of the Google and Microsoft Open Buildings datasets and it has files in GeoParquet format hosted on AWS S3 bucket. Read the dataset description to familiarize yourself with the dataset and its structure.

GeoParquet is a relatively new and open data format for column-oriented geospatial data. This format is build on the existing Apache Parquet format which is a very powerful format replacing CSV. You can check the specification here, and read more about the format on this website. In short, this format is interoperable, compressed and designed to work with large scale datasets.

21.1. Source Cooperative#

Source Cooperative is a neutral, non-profit data-sharing utility that allows trusted organizations to share data without purchasing a data portal SaaS subscription or managing infrastructure. Source Cooperative is managed by Radiant Earth, and hosts 10s of datasets on its repository.

All data on Source Cooperative, are hosted on AWS S3 bucket. In the following, we will use a set of functions under utils.py to download the data and analyze them.

21.2. Download and Load Buildings Footprint Data into Dask-GeoPandas#

First, we start a new Dask cluster:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
print(client.dashboard_link)

http://127.0.0.1:8787/status

2025-11-06 17:42:38,238 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 7639d00889903a61fb4cd1ea1b7316be initialized by task ('shuffle-transfer-7639d00889903a61fb4cd1ea1b7316be', 0) executed on worker tcp://127.0.0.1:32981
2025-11-06 17:42:38,260 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 7639d00889903a61fb4cd1ea1b7316be deactivated due to stimulus 'task-finished-1762450958.2596664'

In this lecture, we are going to access the data from a specific country. As noted in the dataset description, you need the 3-letter country ISO name to access the corresponding file. You can look up the ISO name for your country of choice here.

You need to enter the ISO name and the EPSG corresponding to the UTM zone of your country of choice in the following:

# Jamaica
country_code = "JAM"

Define a path to download the data:

local_path = "./data/"

Let’s import our function from utils module and run it. This function uses Dask-GeoPandas to lazy load the data from GeoParquet format into memory.

from utils import get_google_microsoft_bldgs

The following cell downloads the geoparquet file from s3 bucket, and loads it into Dask-GeoPandas GeoDataFrame. We are using a default value of 256M for the blocksize in Dask. If you run into memory issue in the rest of the notebook, lower the blocksize and re-run the following cell.

bldg_ddf = get_google_microsoft_bldgs(country_code, local_path, blocksize = "256M")

File already exists locally. No download needed.

bldg_ddf

Dask-GeoPandas GeoDataFrame Structure:

	boundary_id	bf_source	confidence	area_in_meters	s2_id	country_iso	geohash	geometry	bbox
npartitions=1
	int64	object	float64	float64	int64	object	object	geometry	object
	...	...	...	...	...	...	...	...	...

Dask Name: read_parquet, 1 expression

21.3. Read Adminsitrative Boundaries Dataset#

We are also interested to load the adminsitrative boundaries dataset for our country of choice, and assign each building an administrative unit (Parish) name.

You can download each countries administrative unit json files on GDAM website. Each country has different number of levels for their administrative units (and not all are available on GDAM website).

Check your country of choice, and find what is the highest level of administrative boundaries that is available.

In the following, we are interested in level 4 data, and the following function will download it.

from utils import get_gdam_json

boundaries = get_gdam_json(country_code = country_code, admin_level=1)

boundaries.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

boundaries.head()

	GID_1	GID_0	COUNTRY	Parish	VARNAME_1	NL_NAME_1	TYPE_1	ENGTYPE_1	CC_1	HASC_1	ISO_1	geometry
0	JAM.1_1	JAM	Jamaica	Clarendon	NA	NA	Parish	Parish	NA	JM.CL	JM-13	MULTIPOLYGON (((-77.1357 17.8332, -77.1363 17....
1	JAM.2_1	JAM	Jamaica	Hanover	NA	NA	Parish	Parish	NA	JM.HA	JM-09	MULTIPOLYGON (((-78.3474 18.3393, -78.346 18.3...
2	JAM.3_1	JAM	Jamaica	Kingston	KingstonandPortRoyal	NA	Parish	Parish	NA	JM.KI	JM-01	MULTIPOLYGON (((-76.7213 17.9457, -76.7368 17....
3	JAM.4_1	JAM	Jamaica	Manchester	NA	NA	Parish	Parish	NA	JM.MA	JM-12	MULTIPOLYGON (((-77.4873 18.2024, -77.3922 18....
4	JAM.5_1	JAM	Jamaica	Portland	NA	NA	Parish	Parish	NA	JM.PO	JM-04	MULTIPOLYGON (((-76.2634 18.0041, -76.2661 17....

21.4. Spatial Join#

Now, we will use the spatial join to add the Parish name (Parish column in boundaries) to bldg_ddf:

bldg_ddf_w_boundaries = bldg_ddf.sjoin(boundaries, how="inner", predicate="intersects")

bldg_ddf_w_boundaries

Dask-GeoPandas GeoDataFrame Structure:

	boundary_id	bf_source	confidence	area_in_meters	s2_id	country_iso	geohash	geometry	bbox	index_right	GID_1	GID_0	COUNTRY	Parish	VARNAME_1	NL_NAME_1	TYPE_1	ENGTYPE_1	CC_1	HASC_1	ISO_1
npartitions=1
	int64	object	float64	float64	int64	object	object	geometry	object	int64	string	string	string	string	string	string	string	string	string	string	string
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: sjoin, 1 expression

buildigs_per_parish = bldg_ddf_w_boundaries["Parish"].value_counts().compute()

buildigs_per_parish

Parish
Clarendon         161158
Hanover            56768
Kingston           32197
Manchester        133060
Portland           47859
SaintAndrew       222104
SaintAnn          134296
SaintCatherine    283387
SaintElizabeth    158357
SaintJames        138844
SaintMary          74672
SaintThomas        57412
Trelawny           64095
Westmoreland      137181
Name: count, dtype: int64[pyarrow]

print(f"Total number of buildings is {buildigs_per_parish.sum()}")

Total number of buildings is 1701390

Scalable Vector Data Analysis

Contents

21. Scalable Vector Data Analysis#

21.1. Source Cooperative#

21.2. Download and Load Buildings Footprint Data into Dask-GeoPandas#

21.3. Read Adminsitrative Boundaries Dataset#

21.4. Spatial Join#

21.4.1. Exercise 1: Plot the number of buildings per Parish on the map#

21.4.2. Exercise 2: Calculate percentage of the area of each Parish that is covered by buildings#