15. Introduction to Dask-GeoPandas#
Attribution: This notebook is a revised version of the Basic Introduction notebook from Dask-GeoPandas documentation.
This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on geopandas.GeoDataFrame
and parallel dask_geopandas.GeoDataFrame
.
import numpy as np
import geopandas
import dask_geopandas
15.1. Creating a parallelized dask_geopandas.GeoDataFrame
#
There are many ways how to create a parallelized dask_geopandas.GeoDataFrame
. If your initial data fits in memory, you can create it from a geopandas.GeoDataFrame
using the from_geopandas
function:
df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
/tmp/ipykernel_392/1252905281.py:1: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
df.head()
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
0 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
1 | 58005463.0 | Africa | Tanzania | TZA | 63177 | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
2 | 603253.0 | Africa | W. Sahara | ESH | 907 | POLYGON ((-8.66559 27.65643, -8.66512 27.58948... |
3 | 37589262.0 | North America | Canada | CAN | 1736425 | MULTIPOLYGON (((-122.84000 49.00000, -122.9742... |
4 | 328239523.0 | North America | United States of America | USA | 21433226 | MULTIPOLYGON (((-122.84000 49.00000, -120.0000... |
When creating dask_geopandas.GeoDataFrame
we have to specify how to partittion using npartitons
or chunksize
. Here we use npartitions
to split it into N equal chunks.
ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
npartitions=4 | ||||||
0 | float64 | object | object | object | int64 | geometry |
45 | ... | ... | ... | ... | ... | ... |
89 | ... | ... | ... | ... | ... | ... |
133 | ... | ... | ... | ... | ... | ... |
176 | ... | ... | ... | ... | ... | ... |
Computation on a non-geometry column:
ddf["continent"].value_counts().compute()
continent
Africa 51
Asia 47
Europe 39
North America 18
South America 13
Oceania 7
Antarctica 1
Seven seas (open ocean) 1
Name: count, dtype: int64
And calling one of the geopandas-specific methods or attributes:
ddf.geometry.area
/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_geopandas/core.py:125: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
meta = getattr(self._meta, attr)
Dask Series Structure:
npartitions=4
0 float64
45 ...
89 ...
133 ...
176 ...
dtype: float64
Dask Name: area, 3 graph layers
As you can see, without calling compute()
, the resulting Series does not yet contain any values.
ddf.geometry.area.compute()
/srv/conda/envs/notebook/lib/python3.11/site-packages/dask/dataframe/core.py:7023: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
df = func(*args, **kwargs)
0 1.639511
1 76.301964
2 8.603984
3 1712.995228
4 1122.281921
...
172 8.604719
173 1.479321
174 1.231641
175 0.639000
176 51.196106
Length: 177, dtype: float64
15.2. Timing comparison: Point-in-polygon with million points#
The GeoDataFrame
used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let’s create a bigger point GeoSeries
:
N = 10_000_000
points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))
And creating the dask-geopandas version of this series:
dpoints = dask_geopandas.from_geopandas(points, npartitions=16)
A single polygon for which we will check if the points are located within this polygon:
import shapely.geometry
box = shapely.geometry.box(0, 0, 1, 1)
The within
operation will result in a boolean Series:
dpoints.within(box)
Dask Series Structure:
npartitions=16
0 bool
625000 ...
...
9375000 ...
9999999 ...
dtype: bool
Dask Name: within, 2 graph layers
The relative number of the points within the polygon:
(dpoints.within(box).sum() / len(dpoints)).compute()
0.1163134
Let’s compare the time it takes to compute this:
%timeit points.within(box)
820 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dpoints.within(box).compute()
362 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is run on a virtual machine with 8 Dask workers, and giving roughly a 2x speed-up using multithreading.