Assignment 4#

The goal of this assignment is to work with vector data and implement a scalable vector data anlysis.

You should submit this assignment to your existing geog313-assignments GitHub repository under a new directory named assignment-4. Your notebooks should be able to run on Microsoft Planetary Computer (PC) Hub. In the instructions for your code, provide guidance on how users should use your notebook within this hub. You can implement all sections of this assignment in one Notebook.

Note: As you know, the Python environment within the PC Hub can change, so you need to note the version of all packages you are using in your notebook in the documentation so anyone can reproduce your results.

Download Building Footprint Data#

For this assignment you will be working with the Google-Microsoft Open Buildings Dataset - Combined by VIDA which is available on Source Cooperative. Check out the Read Me of the dataset to understand the datasets and familiarize yourself with its metadata.

  • Write a Python function that receives the ISO code for a country and downloads the corresponding geoparquet file or files for that country (Note: you only need to download the data stored under by_country/).

  • Use this function to download the data for Haiti.

Load Geoparquet File(s)#

The geoparquet files in this dataset maybe large depending on the country you are working with. So, ideally you would like to benefit from dask_geopandas functionality. To load the data into dask_geopandas you need to first convert the geometries from WKB to geojson. The following function can help you load the data into dask_geopandas DataFrame. (Note: this is not an optimal solution as you have to load the data into geopandas DataFrame first, and then convert to dask_geopandas. In the optional section below, you will be asked to rewrite this function).

Use this function to load building footprints for Haiti.

from shapely import wkb
import pandas as pd
import geopandas as gpd
import dask_geopandas as dgpd

def read_geoparquet(path):
    """
    This function receives the path to a geoparquet file from the 
    Google-Microsoft Building Footprints dataset and returns a 
    dask_geopandas DataFrame of the data. 
    The geometry of each building in the original file is recorded
    in WKB format and should be converted to json to be able to 
    create a geopandas DataFrame. 
    
    Args:
      path: string containing the geoparquet file path
    
    Returns:
      ddf: a dask_geopandas DataFrame  
    """
    
    # Load Parquet file into a Pandas DataFrame
    df = pd.read_parquet(path)
    
    # Convert WKB geometry
    df['geometry'] = df['geometry'].apply(wkb.loads)
    
    # Load as GeoPandas dataframe
    gdf = gpd.GeoDataFrame(df, geometry='geometry')
    
    # Set the correct CRS
    gdf.set_crs(epsg=4326, inplace=True)
    
    #Convert to dask_geopandas df 
    ddf = dgpd.from_geopandas(gdf, chunksize = 100000)
    
    return ddf

Analyze the Data#

In this section, you will analyze the data using the functionality provided by Dask:

  • Plot the histogram of the area of all buildings provided by Microsoft as the source.

    • Note: this might have a very skewed distribution. Try passing arguments to your histogram function to create a move even histogram plot, and explain your approach.

  • Count the number of building footprints that intersects with each other

  • From the intersecting building footprints, calculate how many:

    • Google building footprints intersect another Google building footprint

    • Microsoft building footprints intersect another Microsoft building footprint

    • Google building footprints intersect a Microsoft building footprint

[Optional] Rewrite the read_geoparquet Function#

The function read_geoparquet provided above is not optimal as it loads all the data once into memory. Rewrite this function so that it does not load the data before it is needed for computation, and it can output the same dask_geopandas DataFrame.

Use this function, to load the data from The Philippines, and run the analyses from the previous section.