18. Introduction to Dask-GeoPandas#

Attribution: This notebook is a revised version of the Basic Introduction notebook from Dask-GeoPandas documentation.

This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on geopandas.GeoDataFrame and parallel dask_geopandas.GeoDataFrame.

You can access this notebook (in a Docker image) on this GitHub repo.

import os
import requests
import numpy as np
import geopandas as gpd
import dask_geopandas as dg
import cartopy

18.1. Download a Sample Dataset using Cartopy#

We are going to use the Natural Earth dataset. This dataset has several vector files for different physical and cultural boundaries at different spatial scales. We will use the 110m Admin 0 Countries dataset.

You can use cartopy to download this dataset locally. Note that the file will be downloaded to a local cache folder.

admin_shp = cartopy.io.shapereader.natural_earth(
    resolution='110m',
    category='cultural',
    name='admin_0_countries'
)

18.2. Creating a Dask-GeoPandas GeoDataFrame#

There are many ways to create a dask_geopandas.GeoDataFrame. If your initial data fits in memory, you can create it from a geopandas.GeoDataFrame using the from_geopandas function:

gdf = gpd.read_file(admin_shp)
ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/envs/vector_tutorial/share/proj failed
gdf.head()
featurecla scalerank LABELRANK SOVEREIGNT SOV_A3 ADM0_DIF LEVEL TYPE TLC ADMIN ... FCLASS_TR FCLASS_ID FCLASS_PL FCLASS_GR FCLASS_IT FCLASS_NL FCLASS_SE FCLASS_BD FCLASS_UA geometry
0 Admin-0 country 1 6 Fiji FJI 0 2 Sovereign country 1 Fiji ... None None None None None None None None None MULTIPOLYGON (((180 -16.06713, 180 -16.55522, ...
1 Admin-0 country 1 3 United Republic of Tanzania TZA 0 2 Sovereign country 1 United Republic of Tanzania ... None None None None None None None None None POLYGON ((33.90371 -0.95, 34.07262 -1.05982, 3...
2 Admin-0 country 1 7 Western Sahara SAH 0 2 Indeterminate 1 Western Sahara ... Unrecognized Unrecognized Unrecognized None None Unrecognized None None None POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 Admin-0 country 1 2 Canada CAN 0 2 Sovereign country 1 Canada ... None None None None None None None None None MULTIPOLYGON (((-122.84 49, -122.97421 49.0025...
4 Admin-0 country 1 2 United States of America US1 1 2 Country 1 United States of America ... None None None None None None None None None MULTIPOLYGON (((-122.84 49, -120 49, -117.0312...

5 rows × 169 columns

When creating dask_geopandas.GeoDataFrame we have to specify how to partittion using npartitons or chunksize. Here we use npartitions to split it into N equal chunks.

ddf = dg.from_geopandas(gdf, npartitions=4)
ddf
Dask-GeoPandas GeoDataFrame Structure:
featurecla scalerank LABELRANK SOVEREIGNT SOV_A3 ADM0_DIF LEVEL TYPE TLC ADMIN ADM0_A3 GEOU_DIF GEOUNIT GU_A3 SU_DIF SUBUNIT SU_A3 BRK_DIFF NAME NAME_LONG BRK_A3 BRK_NAME BRK_GROUP ABBREV POSTAL FORMAL_EN FORMAL_FR NAME_CIAWF NOTE_ADM0 NOTE_BRK NAME_SORT NAME_ALT MAPCOLOR7 MAPCOLOR8 MAPCOLOR9 MAPCOLOR13 POP_EST POP_RANK POP_YEAR GDP_MD GDP_YEAR ECONOMY INCOME_GRP FIPS_10 ISO_A2 ISO_A2_EH ISO_A3 ISO_A3_EH ISO_N3 ISO_N3_EH UN_A3 WB_A2 WB_A3 WOE_ID WOE_ID_EH WOE_NOTE ADM0_ISO ADM0_DIFF ADM0_TLC ADM0_A3_US ADM0_A3_FR ADM0_A3_RU ADM0_A3_ES ADM0_A3_CN ADM0_A3_TW ADM0_A3_IN ADM0_A3_NP ADM0_A3_PK ADM0_A3_DE ADM0_A3_GB ADM0_A3_BR ADM0_A3_IL ADM0_A3_PS ADM0_A3_SA ADM0_A3_EG ADM0_A3_MA ADM0_A3_PT ADM0_A3_AR ADM0_A3_JP ADM0_A3_KO ADM0_A3_VN ADM0_A3_TR ADM0_A3_ID ADM0_A3_PL ADM0_A3_GR ADM0_A3_IT ADM0_A3_NL ADM0_A3_SE ADM0_A3_BD ADM0_A3_UA ADM0_A3_UN ADM0_A3_WB CONTINENT REGION_UN SUBREGION REGION_WB NAME_LEN LONG_LEN ABBREV_LEN TINY HOMEPART MIN_ZOOM MIN_LABEL MAX_LABEL LABEL_X LABEL_Y NE_ID WIKIDATAID NAME_AR NAME_BN NAME_DE NAME_EN NAME_ES NAME_FA NAME_FR NAME_EL NAME_HE NAME_HI NAME_HU NAME_ID NAME_IT NAME_JA NAME_KO NAME_NL NAME_PL NAME_PT NAME_RU NAME_SV NAME_TR NAME_UK NAME_UR NAME_VI NAME_ZH NAME_ZHT FCLASS_ISO TLC_DIFF FCLASS_TLC FCLASS_US FCLASS_FR FCLASS_RU FCLASS_ES FCLASS_CN FCLASS_TW FCLASS_IN FCLASS_NP FCLASS_PK FCLASS_DE FCLASS_GB FCLASS_BR FCLASS_IL FCLASS_PS FCLASS_SA FCLASS_EG FCLASS_MA FCLASS_PT FCLASS_AR FCLASS_JP FCLASS_KO FCLASS_VN FCLASS_TR FCLASS_ID FCLASS_PL FCLASS_GR FCLASS_IT FCLASS_NL FCLASS_SE FCLASS_BD FCLASS_UA geometry
npartitions=4
0 string int32 int32 string string int32 int32 string string string string int32 string string int32 string string int32 string string string string string string string string string string string string string string int32 int32 int32 int32 float64 int32 int32 int32 int32 string string string string string string string string string string string string int32 int32 string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string int32 int32 string string string string int32 int32 int32 int32 int32 float64 float64 float64 float64 float64 int64 string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string string geometry
45 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
133 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
176 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: frompandas, 1 expression

Let’s try computation on a non-geometry column:

ddf["CONTINENT"].value_counts().compute()
CONTINENT
Africa                     51
Asia                       47
Seven seas (open ocean)     1
South America              13
Oceania                     7
Antarctica                  1
Europe                     39
North America              18
Name: count, dtype: int64[pyarrow]
/opt/conda/envs/vector_tutorial/lib/python3.12/site-packages/dask/dataframe/core.py:7175: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  df = func(*args, **kwargs)

And calling one of the geopandas-specific methods or attributes:

ddf.geometry.area
/opt/conda/envs/vector_tutorial/lib/python3.12/site-packages/dask_geopandas/expr.py:185: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  meta = getattr(self._meta, attr)
Dask Series Structure:
npartitions=4
0      float64
45         ...
89         ...
133        ...
176        ...
Dask Name: area, 3 expressions
Expr=MapPartitions(getattr)

As you can see, without calling compute(), the resulting Series does not yet contain any values. Also note the warning about area calculation. Since the crs of our dataset is EPSG:4326 the area calculation will result in area in degrees of latitude and longitude.

ddf.geometry.area.compute()
0         1.639511
1        76.301964
2         8.603984
3      1712.995228
4      1122.281921
          ...     
172       8.604719
173       1.479321
174       1.231641
175       0.639000
176      51.196106
Length: 177, dtype: float64

18.3. Timing comparison: Point-in-polygon with 10 million points#

The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a small dataframe), so let’s create a bigger point GeoSeries:

N = 10_000_000
points_df = gpd.GeoDataFrame(geometry=gpd.points_from_xy(np.random.randn(N),np.random.randn(N)))

And creating the dask-geopandas version of this series:

points_ddf = dg.from_geopandas(points_df, npartitions=16)

Let’s create a polygon and check if the points are located within this polygon:

import shapely.geometry
box = shapely.geometry.box(0, 0, 1, 1)

The within operation will result in a boolean Series:

points_ddf.within(box)
Dask Series Structure:
npartitions=16
0          bool
625000      ...
           ... 
9375000     ...
9999999     ...
Dask Name: within, 2 expressions
Expr=UFunc(within)

The relative number of the points within the polygon:

(points_ddf.within(box).sum() / len(points_ddf)).compute()
0.1164894

Let’s compare the time it takes to compute this:

%timeit points_df.within(box)
235 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit points_ddf.within(box).compute()
46.7 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)