15. Introduction to Dask-GeoPandas#

Attribution: This notebook is a revised version of the Basic Introduction notebook from Dask-GeoPandas documentation.

This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on geopandas.GeoDataFrame and parallel dask_geopandas.GeoDataFrame.

import numpy as np
import geopandas
import dask_geopandas

15.1. Creating a parallelized dask_geopandas.GeoDataFrame#

There are many ways how to create a parallelized dask_geopandas.GeoDataFrame. If your initial data fits in memory, you can create it from a geopandas.GeoDataFrame using the from_geopandas function:

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
/tmp/ipykernel_392/1252905281.py:1: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
df.head()
pop_est continent name iso_a3 gdp_md_est geometry
0 889953.0 Oceania Fiji FJI 5496 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 58005463.0 Africa Tanzania TZA 63177 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 603253.0 Africa W. Sahara ESH 907 POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 37589262.0 North America Canada CAN 1736425 MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 328239523.0 North America United States of America USA 21433226 MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

When creating dask_geopandas.GeoDataFrame we have to specify how to partittion using npartitons or chunksize. Here we use npartitions to split it into N equal chunks.

ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf
Dask-GeoPandas GeoDataFrame Structure:
pop_est continent name iso_a3 gdp_md_est geometry
npartitions=4
0 float64 object object object int64 geometry
45 ... ... ... ... ... ...
89 ... ... ... ... ... ...
133 ... ... ... ... ... ...
176 ... ... ... ... ... ...
Dask Name: from_pandas, 1 graph layer

Computation on a non-geometry column:

ddf["continent"].value_counts().compute()
continent
Africa                     51
Asia                       47
Europe                     39
North America              18
South America              13
Oceania                     7
Antarctica                  1
Seven seas (open ocean)     1
Name: count, dtype: int64

And calling one of the geopandas-specific methods or attributes:

ddf.geometry.area
/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_geopandas/core.py:125: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  meta = getattr(self._meta, attr)
Dask Series Structure:
npartitions=4
0      float64
45         ...
89         ...
133        ...
176        ...
dtype: float64
Dask Name: area, 3 graph layers

As you can see, without calling compute(), the resulting Series does not yet contain any values.

ddf.geometry.area.compute()
/srv/conda/envs/notebook/lib/python3.11/site-packages/dask/dataframe/core.py:7023: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  df = func(*args, **kwargs)
0         1.639511
1        76.301964
2         8.603984
3      1712.995228
4      1122.281921
          ...     
172       8.604719
173       1.479321
174       1.231641
175       0.639000
176      51.196106
Length: 177, dtype: float64

15.2. Timing comparison: Point-in-polygon with million points#

The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let’s create a bigger point GeoSeries:

N = 10_000_000
points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))

And creating the dask-geopandas version of this series:

dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)

A single polygon for which we will check if the points are located within this polygon:

import shapely.geometry
box = shapely.geometry.box(0, 0, 1, 1)

The within operation will result in a boolean Series:

dpoints.within(box)
Dask Series Structure:
npartitions=16
0          bool
625000      ...
           ... 
9375000     ...
9999999     ...
dtype: bool
Dask Name: within, 2 graph layers

The relative number of the points within the polygon:

(dpoints.within(box).sum() / len(dpoints)).compute()
0.1163134

Let’s compare the time it takes to compute this:

%timeit points.within(box)
820 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dpoints.within(box).compute()
362 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is run on a virtual machine with 8 Dask workers, and giving roughly a 2x speed-up using multithreading.