16. Scalable Vector Data Analysis#
In this lecture, we are going to use Dask-GeoPandas
package to read a large vector dataset from Source Cooperative. Then use Dask parrallel computation to apply a spatial join operation to two geospatial DataFrames.
Our target dataset is the Google Open Buildings dataset hosted on Source Cooperative. This is an updated version of the Gooel Open Buildings dataset and it has files in GeoParquet format hosted on AWS S3 bucket. Checkout the description of the dataset to learn more about it here.
Check out these links to learn more about the open buildings datasets from Google and Microsoft as well:
Visualize the combined dataset here.
16.1. Download Data#
In order to download the data, you need to create a free account on Source Cooperative. Navigate to the sign in/up page, and follow the steps to create an account.
All data on Source Cooperative, are hosted on AWS S3 bucket. In order to access them, you need a temporary credential that you can generate on Source Cooperative website.
Navigate to the Goole Open Building dataset page, and click on “ACCESS DATA”. Then click on “GENERATE CREDENTIALS” to retrieve your AWS credentials. Copy the values for each of the following keys, and paste them in the following cell:
AWS_DEFAULT_REGION = "<REGION>"
AWS_ACCESS_KEY_ID = "ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "SECRET_ACCESS_KEY"
AWS_SESSION_TOKEN = "SESSION_TOKEN"
We are going to use boto3 package to connect to AWS and download the data:
import boto3
Create a client for s3 access using the credentials you copied from Source Cooperative:
s3 = boto3.client('s3',
aws_access_key_id = AWS_ACCESS_KEY_ID,
aws_secret_access_key = AWS_SECRET_ACCESS_KEY,
aws_session_token = AWS_SESSION_TOKEN)
Let’s define a function that retrieves the keys of each object on the S3 bucket for Google Open Buildings dataset using the country code.
A little about S3 (Simple Storage Service):
def get_google_open_building_keys(country_code, aws_region, client):
"""
This function returns all the S3 keys associated with a country_code
from the Google Open Building Dataset on Source Cooperative. This function
only lists the files in GeoParquet format.
For more information about the dataset, visit
https://beta.source.coop/repositories/cholmes/google-open-buildings/description/
Args:
country_code: string indicating the country of target. Country code is
the Alpha-2 code based on ISO 3166 standard.
aws_region: string is the AWS region where the data is hosted
client: boto3 client object returned by boto3.client
Returns:
keys: list of all keys that match the country_code
"""
bucket = f"{aws_region}.opendata.source.coop"
prefix = f"google-research-open-buildings/geoparquet-by-country/country_iso={country_code}"
keys = []
kwargs = {'Bucket': bucket, 'Prefix': prefix}
while True:
resp = client.list_objects_v2(**kwargs)
for obj in resp['Contents']:
keys.append(obj['Key'])
try:
kwargs['ContinuationToken'] = resp['NextContinuationToken']
except KeyError:
break
return keys
Next we need to select a country. You can look up the ISO name for your country of choice here
country_code = "UG" #Uganda is our choice
keys = get_google_open_building_keys(country_code, AWS_DEFAULT_REGION, s3)
keys
['google-research-open-buildings/geoparquet-by-country/country_iso=UG/UG.parquet']
# Download all relevant GeoParquet files
for key in keys:
s3.download_file(Bucket = f"{AWS_DEFAULT_REGION}.opendata.source.coop",
Key = key,
Filename = key.split("/")[-1])
16.2. Read Buildings Dataset#
import dask_geopandas as dg
import geopandas as gpd
First, we start a new Dask cluster:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
client
Client
Client-a523a6b5-6dc8-11ee-81b8-ee8fe5e494a2
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: /user/halemohammad@clarku.edu/proxy/8787/status |
Cluster Info
LocalCluster
093f396f
Dashboard: /user/halemohammad@clarku.edu/proxy/8787/status | Workers: 4 |
Total threads: 8 | Total memory: 32.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-bde5009a-0c31-4716-b6d3-95cd8d538460
Comm: tcp://127.0.0.1:33079 | Workers: 4 |
Dashboard: /user/halemohammad@clarku.edu/proxy/8787/status | Total threads: 8 |
Started: Just now | Total memory: 32.00 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:38299 | Total threads: 2 |
Dashboard: /user/halemohammad@clarku.edu/proxy/35205/status | Memory: 8.00 GiB |
Nanny: tcp://127.0.0.1:46575 | |
Local directory: /tmp/dask-scratch-space/worker-odgqsn3u |
Worker: 1
Comm: tcp://127.0.0.1:35379 | Total threads: 2 |
Dashboard: /user/halemohammad@clarku.edu/proxy/43761/status | Memory: 8.00 GiB |
Nanny: tcp://127.0.0.1:43495 | |
Local directory: /tmp/dask-scratch-space/worker-o_1_cpl6 |
Worker: 2
Comm: tcp://127.0.0.1:44663 | Total threads: 2 |
Dashboard: /user/halemohammad@clarku.edu/proxy/34169/status | Memory: 8.00 GiB |
Nanny: tcp://127.0.0.1:37681 | |
Local directory: /tmp/dask-scratch-space/worker-0pte6j45 |
Worker: 3
Comm: tcp://127.0.0.1:46233 | Total threads: 2 |
Dashboard: /user/halemohammad@clarku.edu/proxy/43167/status | Memory: 8.00 GiB |
Nanny: tcp://127.0.0.1:40085 | |
Local directory: /tmp/dask-scratch-space/worker-widx5k6f |
bldg_ddf = dg.read_parquet(f"{country_code}*.parquet", blocksize = 128)
bldg_ddf.head()
area_in_meters | confidence | full_plus_code | geometry | quadkey | country_iso | |
---|---|---|---|---|---|---|
0 | 15.4232 | 0.7654 | 6GMGJW5M+5HGM | POLYGON ((30.93397 3.60793, 30.93396 3.60796, ... | 122323031331 | UG |
1 | 22.9853 | 0.7854 | 6GMGJW2F+V6WH | POLYGON ((30.92312 3.60223, 30.92309 3.60227, ... | 122323031331 | UG |
2 | 26.6149 | 0.8368 | 6GMGJW6P+RXM2 | POLYGON ((30.93743 3.61205, 30.93744 3.61210, ... | 122323031331 | UG |
3 | 16.8751 | 0.7007 | 6GMGJW6P+QWWX | POLYGON ((30.93736 3.61197, 30.93736 3.61201, ... | 122323031331 | UG |
4 | 18.6629 | 0.7845 | 6GMGJW4P+WC3W | POLYGON ((30.93607 3.60728, 30.93604 3.60730, ... | 122323031331 | UG |
This is a very large DataFrame, and you can see the number of buildings is more than 18M:
len(bldg_ddf)
18277942
bldg_ddf.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
16.3. Read Adminsitrative Boundaries Dataset#
We are interested to load the adminsitrative boundaries dataset for Uganda, and assign each building an administrative unit (Parish) name.
You can download Uganda’s administrative unit shapefiles here. We are interested in GeoJSON level 4 data, and the following cell will download it.
! wget -P data/ https://geodata.ucdavis.edu/gadm/gadm4.1/json/gadm41_UGA_4.json.zip
--2023-10-18 15:12:36-- https://geodata.ucdavis.edu/gadm/gadm4.1/json/gadm41_UGA_4.json.zip
Resolving geodata.ucdavis.edu (geodata.ucdavis.edu)... 128.120.146.30
Connecting to geodata.ucdavis.edu (geodata.ucdavis.edu)|128.120.146.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 772826 (755K) [application/zip]
Saving to: ‘data/gadm41_UGA_4.json.zip.2’
gadm41_UGA_4.json.z 100%[===================>] 754.71K 888KB/s in 0.9s
2023-10-18 15:12:38 (888 KB/s) - ‘data/gadm41_UGA_4.json.zip.2’ saved [772826/772826]
boundaries = gpd.read_file("data/gadm41_UGA_4.json.zip")
boundaries.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
boundaries.head()
GID_4 | GID_0 | COUNTRY | GID_1 | NAME_1 | GID_2 | NAME_2 | GID_3 | NAME_3 | NAME_4 | VARNAME_4 | TYPE_4 | ENGTYPE_4 | CC_4 | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UGA.1.1.1.1_1 | UGA | Uganda | UGA.1_1 | Adjumani | UGA.1.1_1 | EastMoyo | UGA.1.1.1_1 | AdjumaniTc | Biyaya | NA | Parish | Parish | NA | MULTIPOLYGON (((31.80800 3.37270, 31.79310 3.3... |
1 | UGA.1.1.1.2_1 | UGA | Uganda | UGA.1_1 | Adjumani | UGA.1.1_1 | EastMoyo | UGA.1.1.1_1 | AdjumaniTc | Central | NA | Parish | Parish | NA | MULTIPOLYGON (((31.79040 3.38700, 31.79050 3.3... |
2 | UGA.1.1.1.3_1 | UGA | Uganda | UGA.1_1 | Adjumani | UGA.1.1_1 | EastMoyo | UGA.1.1.1_1 | AdjumaniTc | Cesia | NA | Parish | Parish | NA | MULTIPOLYGON (((31.79010 3.39280, 31.79040 3.3... |
3 | UGA.1.1.2.1_1 | UGA | Uganda | UGA.1_1 | Adjumani | UGA.1.1_1 | EastMoyo | UGA.1.1.2_1 | Adropi | Esia | NA | Parish | Parish | NA | MULTIPOLYGON (((31.71120 3.31740, 31.71340 3.3... |
4 | UGA.1.1.2.2_1 | UGA | Uganda | UGA.1_1 | Adjumani | UGA.1.1_1 | EastMoyo | UGA.1.1.2_1 | Adropi | Jihwa | NA | Parish | Parish | NA | MULTIPOLYGON (((31.78740 3.46700, 31.78200 3.4... |
16.4. Spatial Join#
Now, we will use the spatial join to add the Parish name (NAME_4
in boundaries
) to bldg_ddf
:
bldg_ddf_w_boundaries = bldg_ddf.sjoin(boundaries, how="inner", predicate="intersects")
buildigs_per_parish = bldg_ddf_w_boundaries["NAME_4"].value_counts().compute()
buildigs_per_parish
NAME_4
Kyebando 58970
CentralWard 42772
Kichwabugingo 41610
Busabala 40565
Kasenge 40309
...
Nyakiyanja-MaramagamboFr 5
MpangaFr 5
OtuboiSudds 3
KyamburaGr 2
KibaaleFr 2
Name: count, Length: 4681, dtype: int64
16.4.1. Task: plot the number of buildings per Parish on a map#
boundaries_w_count = boundaries.merge(buildigs_per_parish, on="NAME_4")
boundaries_w_count
GID_4 | GID_0 | COUNTRY | GID_1 | NAME_1 | GID_2 | NAME_2 | GID_3 | NAME_3 | NAME_4 | VARNAME_4 | TYPE_4 | ENGTYPE_4 | CC_4 | geometry | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UGA.1.1.1.1_1 | UGA | Uganda | UGA.1_1 | Adjumani | UGA.1.1_1 | EastMoyo | UGA.1.1.1_1 | AdjumaniTc | Biyaya | NA | Parish | Parish | NA | MULTIPOLYGON (((31.80800 3.37270, 31.79310 3.3... | 3955 |
1 | UGA.1.1.1.2_1 | UGA | Uganda | UGA.1_1 | Adjumani | UGA.1.1_1 | EastMoyo | UGA.1.1.1_1 | AdjumaniTc | Central | NA | Parish | Parish | NA | MULTIPOLYGON (((31.79040 3.38700, 31.79050 3.3... | 19118 |
2 | UGA.7.1.3.1_1 | UGA | Uganda | UGA.7_1 | Busia | UGA.7.1_1 | Samia-Bugwe | UGA.7.1.3_1 | BusiaTc | Central | NA | Parish | Parish | NA | MULTIPOLYGON (((34.09350 0.46350, 34.08820 0.4... | 19118 |
3 | UGA.28.1.1.2_1 | UGA | Uganda | UGA.28_1 | Kotido | UGA.28.1_1 | Dodoth | UGA.28.1.1_1 | KaabongT.C. | Central | NA | Parish | Parish | NA | MULTIPOLYGON (((34.13510 3.51400, 34.12630 3.5... | 19118 |
4 | UGA.46.1.7.1_1 | UGA | Uganda | UGA.46_1 | Nakasongola | UGA.46.1_1 | Buruli | UGA.46.1.7_1 | NakasongolaTc | Central | NA | Parish | Parish | NA | MULTIPOLYGON (((32.46840 1.33130, 32.47090 1.3... | 19118 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5332 | UGA.58.1.7.3_1 | UGA | Uganda | UGA.58_1 | Yumbe | UGA.58.1_1 | Aringa | UGA.58.1.7_1 | Romogi | Limidia | NA | Parish | Parish | NA | MULTIPOLYGON (((31.27220 3.52460, 31.26550 3.5... | 2259 |
5333 | UGA.58.1.7.4_1 | UGA | Uganda | UGA.58_1 | Yumbe | UGA.58.1_1 | Aringa | UGA.58.1.7_1 | Romogi | Locomgbo | NA | Parish | Parish | NA | MULTIPOLYGON (((31.49130 3.51400, 31.45670 3.4... | 1987 |
5334 | UGA.58.1.8.1_1 | UGA | Uganda | UGA.58_1 | Yumbe | UGA.58.1_1 | Aringa | UGA.58.1.8_1 | YumbeTc | Arunga | NA | Parish | Parish | NA | MULTIPOLYGON (((31.27170 3.46040, 31.26970 3.4... | 6386 |
5335 | UGA.58.1.8.2_1 | UGA | Uganda | UGA.58_1 | Yumbe | UGA.58.1_1 | Aringa | UGA.58.1.8_1 | YumbeTc | Charanga | NA | Parish | Parish | NA | MULTIPOLYGON (((31.22280 3.47170, 31.22530 3.4... | 5182 |
5336 | UGA.58.1.8.3_1 | UGA | Uganda | UGA.58_1 | Yumbe | UGA.58.1_1 | Aringa | UGA.58.1.8_1 | YumbeTc | Lukutua | NA | Parish | Parish | NA | MULTIPOLYGON (((31.26800 3.48390, 31.27200 3.4... | 2981 |
5337 rows × 16 columns
boundaries_w_count.plot("count", vmax = 30000)
<Axes: >
boundaries_w_count.explore("count")
16.4.2. Task: calculate percentage of the area of each Parish that is covered by buildings.#
bldg_ddf_w_boundaries = bldg_ddf.sjoin(boundaries, how="inner", predicate="intersects")
bldg_area_per_parish = bldg_ddf_w_boundaries.groupby("NAME_4")["area_in_meters"].sum().compute()
boundaries_w_count = boundaries_w_count.merge(bldg_area_per_parish, on="NAME_4")
boundaries_w_count.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
To calculate the correct area of each Parish we need to project the geometries to a CRS that has units of meter. Here we will use EPSG:32636
which is the UTM Zone for Uganda.
boundaries_w_count["percent_bldg"] = boundaries_w_count["area_in_meters"] / boundaries_w_count.to_crs(32636).area * 100
boundaries_w_count.explore("percent_bldg", vmax = 10)