Introduction to Dask-GeoPandas

15. Introduction to Dask-GeoPandas#

Attribution: This notebook is a revised version of the Basic Introduction notebook from Dask-GeoPandas documentation.

This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on geopandas.GeoDataFrame and parallel dask_geopandas.GeoDataFrame.

import numpy as np
import geopandas
import dask_geopandas

15.1. Creating a parallelized `dask_geopandas.GeoDataFrame`#

There are many ways how to create a parallelized dask_geopandas.GeoDataFrame. If your initial data fits in memory, you can create it from a geopandas.GeoDataFrame using the from_geopandas function:

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

/tmp/ipykernel_392/1252905281.py:1: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

df.head()

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
0	889953.0	Oceania	Fiji	FJI	5496	MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1	58005463.0	Africa	Tanzania	TZA	63177	POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2	603253.0	Africa	W. Sahara	ESH	907	POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3	37589262.0	North America	Canada	CAN	1736425	MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4	328239523.0	North America	United States of America	USA	21433226	MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

When creating dask_geopandas.GeoDataFrame we have to specify how to partittion using npartitons or chunksize. Here we use npartitions to split it into N equal chunks.

ddf = dask_geopandas.from_geopandas(df, npartitions=4)

ddf

Dask-GeoPandas GeoDataFrame Structure:

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
npartitions=4
0	float64	object	object	object	int64	geometry
45	...	...	...	...	...	...
89	...	...	...	...	...	...
133	...	...	...	...	...	...
176	...	...	...	...	...	...

Dask Name: from_pandas, 1 graph layer

Computation on a non-geometry column:

ddf["continent"].value_counts().compute()

continent
Africa                     51
Asia                       47
Europe                     39
North America              18
South America              13
Oceania                     7
Antarctica                  1
Seven seas (open ocean)     1
Name: count, dtype: int64

And calling one of the geopandas-specific methods or attributes:

ddf.geometry.area

/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_geopandas/core.py:125: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  meta = getattr(self._meta, attr)

Dask Series Structure:
npartitions=4
0      float64
45         ...
89         ...
133        ...
176        ...
dtype: float64
Dask Name: area, 3 graph layers

As you can see, without calling compute(), the resulting Series does not yet contain any values.

ddf.geometry.area.compute()

/srv/conda/envs/notebook/lib/python3.11/site-packages/dask/dataframe/core.py:7023: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  df = func(*args, **kwargs)

       1.639511
      76.301964
       8.603984
    1712.995228
    1122.281921
          ...     
     8.604719
     1.479321
     1.231641
     0.639000
    51.196106
Length: 177, dtype: float64

15.2. Timing comparison: Point-in-polygon with million points#

The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let’s create a bigger point GeoSeries:

N = 10_000_000

points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))

And creating the dask-geopandas version of this series:

dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)

A single polygon for which we will check if the points are located within this polygon:

import shapely.geometry
box = shapely.geometry.box(0, 0, 1, 1)

The within operation will result in a boolean Series:

dpoints.within(box)

Dask Series Structure:
npartitions=16
0          bool
625000      ...
           ... 
9375000     ...
9999999     ...
dtype: bool
Dask Name: within, 2 graph layers

The relative number of the points within the polygon:

(dpoints.within(box).sum() / len(dpoints)).compute()

0.1163134

Let’s compare the time it takes to compute this:

%timeit points.within(box)

820 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit dpoints.within(box).compute()

362 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is run on a virtual machine with 8 Dask workers, and giving roughly a 2x speed-up using multithreading.

Introduction to Dask-GeoPandas

Contents

15. Introduction to Dask-GeoPandas#

15.1. Creating a parallelized dask_geopandas.GeoDataFrame#

15.2. Timing comparison: Point-in-polygon with million points#

15.1. Creating a parallelized `dask_geopandas.GeoDataFrame`#