Assignment 4#
The goal of this assignment is to work with vector data and implement a scalable vector data anlysis.
You should submit this assignment to your existing geog313-assignments
GitHub repository under a new directory named assignment-4
. Your notebooks should be able to run on Microsoft Planetary Computer (PC) Hub. In the instructions for your code, provide guidance on how users should use your notebook within this hub. You can implement all sections of this assignment in one Notebook.
Note: As you know, the Python environment within the PC Hub can change, so you need to note the version of all packages you are using in your notebook in the documentation so anyone can reproduce your results.
Download Building Footprint Data#
For this assignment you will be working with the Google-Microsoft Open Buildings Dataset - Combined by VIDA which is available on Source Cooperative. Check out the Read Me of the dataset to understand the datasets and familiarize yourself with its metadata.
Write a Python function that receives the ISO code for a country and downloads the corresponding
geoparquet
file or files for that country (Note: you only need to download the data stored underby_country/
).Use this function to download the data for Haiti.
Load Geoparquet File(s)#
The geoparquet files in this dataset maybe large depending on the country you are working with. So, ideally you would like to benefit from dask_geopandas
functionality. To load the data into dask_geopandas
you need to first convert the geometries from WKB to geojson. The following function can help you load the data into dask_geopandas
DataFrame. (Note: this is not an optimal solution as you have to load the data into geopandas
DataFrame first, and then convert to dask_geopandas
. In the optional section below, you will be asked to rewrite this function).
Use this function to load building footprints for Haiti.
from shapely import wkb
import pandas as pd
import geopandas as gpd
import dask_geopandas as dgpd
def read_geoparquet(path):
"""
This function receives the path to a geoparquet file from the
Google-Microsoft Building Footprints dataset and returns a
dask_geopandas DataFrame of the data.
The geometry of each building in the original file is recorded
in WKB format and should be converted to json to be able to
create a geopandas DataFrame.
Args:
path: string containing the geoparquet file path
Returns:
ddf: a dask_geopandas DataFrame
"""
# Load Parquet file into a Pandas DataFrame
df = pd.read_parquet(path)
# Convert WKB geometry
df['geometry'] = df['geometry'].apply(wkb.loads)
# Load as GeoPandas dataframe
gdf = gpd.GeoDataFrame(df, geometry='geometry')
# Set the correct CRS
gdf.set_crs(epsg=4326, inplace=True)
#Convert to dask_geopandas df
ddf = dgpd.from_geopandas(gdf, chunksize = 100000)
return ddf
Analyze the Data#
In this section, you will analyze the data using the functionality provided by Dask:
Plot the histogram of the area of all buildings provided by Microsoft as the source.
Note: this might have a very skewed distribution. Try passing arguments to your histogram function to create a move even histogram plot, and explain your approach.
Count the number of building footprints that
intersects
with each otherFrom the intersecting building footprints, calculate how many:
Google building footprints intersect another Google building footprint
Microsoft building footprints intersect another Microsoft building footprint
Google building footprints intersect a Microsoft building footprint
[Optional] Rewrite the read_geoparquet
Function#
The function read_geoparquet
provided above is not optimal as it loads all the data once into memory. Rewrite this function so that it does not load the data before it is needed for computation, and it can output the same dask_geopandas
DataFrame.
Use this function, to load the data from The Philippines, and run the analyses from the previous section.