{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Dask-GeoPandas\n",
"\n",
"**Attribution**: *This notebook is a revised version of the [Basic Introduction](https://dask-geopandas.readthedocs.io/en/stable/guide/basic-intro.html) notebook from Dask-GeoPandas documentation.* \n",
"\n",
"This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on `geopandas.GeoDataFrame` and parallel `dask_geopandas.GeoDataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import numpy as np\n",
"import geopandas\n",
"import dask_geopandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a parallelized `dask_geopandas.GeoDataFrame`\n",
"\n",
"There are many ways how to create a parallelized `dask_geopandas.GeoDataFrame`. If your initial data fits in memory, you can create it from a `geopandas.GeoDataFrame` using the `from_geopandas` function:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_392/1252905281.py:1: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.\n",
" df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))\n"
]
}
],
"source": [
"df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop_est | \n",
" continent | \n",
" name | \n",
" iso_a3 | \n",
" gdp_md_est | \n",
" geometry | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 889953.0 | \n",
" Oceania | \n",
" Fiji | \n",
" FJI | \n",
" 5496 | \n",
" MULTIPOLYGON (((180.00000 -16.06713, 180.00000... | \n",
"
\n",
" \n",
" 1 | \n",
" 58005463.0 | \n",
" Africa | \n",
" Tanzania | \n",
" TZA | \n",
" 63177 | \n",
" POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... | \n",
"
\n",
" \n",
" 2 | \n",
" 603253.0 | \n",
" Africa | \n",
" W. Sahara | \n",
" ESH | \n",
" 907 | \n",
" POLYGON ((-8.66559 27.65643, -8.66512 27.58948... | \n",
"
\n",
" \n",
" 3 | \n",
" 37589262.0 | \n",
" North America | \n",
" Canada | \n",
" CAN | \n",
" 1736425 | \n",
" MULTIPOLYGON (((-122.84000 49.00000, -122.9742... | \n",
"
\n",
" \n",
" 4 | \n",
" 328239523.0 | \n",
" North America | \n",
" United States of America | \n",
" USA | \n",
" 21433226 | \n",
" MULTIPOLYGON (((-122.84000 49.00000, -120.0000... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop_est continent name iso_a3 gdp_md_est \\\n",
"0 889953.0 Oceania Fiji FJI 5496 \n",
"1 58005463.0 Africa Tanzania TZA 63177 \n",
"2 603253.0 Africa W. Sahara ESH 907 \n",
"3 37589262.0 North America Canada CAN 1736425 \n",
"4 328239523.0 North America United States of America USA 21433226 \n",
"\n",
" geometry \n",
"0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000... \n",
"1 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... \n",
"2 POLYGON ((-8.66559 27.65643, -8.66512 27.58948... \n",
"3 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... \n",
"4 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When creating `dask_geopandas.GeoDataFrame` we have to specify how to partittion using `npartitons` or `chunksize`. Here we use `npartitions` to split it into N equal chunks."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ddf = dask_geopandas.from_geopandas(df, npartitions=4)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"Dask-GeoPandas GeoDataFrame Structure:
\n",
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop_est | \n",
" continent | \n",
" name | \n",
" iso_a3 | \n",
" gdp_md_est | \n",
" geometry | \n",
"
\n",
" \n",
" npartitions=4 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" float64 | \n",
" object | \n",
" object | \n",
" object | \n",
" int64 | \n",
" geometry | \n",
"
\n",
" \n",
" 45 | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89 | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 133 | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 176 | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
"
\n",
"
\n",
"Dask Name: from_pandas, 1 graph layer
"
],
"text/plain": [
"Dask GeoDataFrame Structure:\n",
" pop_est continent name iso_a3 gdp_md_est geometry\n",
"npartitions=4 \n",
"0 float64 object object object int64 geometry\n",
"45 ... ... ... ... ... ...\n",
"89 ... ... ... ... ... ...\n",
"133 ... ... ... ... ... ...\n",
"176 ... ... ... ... ... ...\n",
"Dask Name: from_pandas, 1 graph layer"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Computation on a non-geometry column:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"continent\n",
"Africa 51\n",
"Asia 47\n",
"Europe 39\n",
"North America 18\n",
"South America 13\n",
"Oceania 7\n",
"Antarctica 1\n",
"Seven seas (open ocean) 1\n",
"Name: count, dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf[\"continent\"].value_counts().compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And calling one of the geopandas-specific methods or attributes:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_geopandas/core.py:125: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.\n",
"\n",
" meta = getattr(self._meta, attr)\n"
]
},
{
"data": {
"text/plain": [
"Dask Series Structure:\n",
"npartitions=4\n",
"0 float64\n",
"45 ...\n",
"89 ...\n",
"133 ...\n",
"176 ...\n",
"dtype: float64\n",
"Dask Name: area, 3 graph layers"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.geometry.area"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, without calling `compute()`, the resulting Series does not yet contain any values."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/srv/conda/envs/notebook/lib/python3.11/site-packages/dask/dataframe/core.py:7023: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.\n",
"\n",
" df = func(*args, **kwargs)\n"
]
},
{
"data": {
"text/plain": [
"0 1.639511\n",
"1 76.301964\n",
"2 8.603984\n",
"3 1712.995228\n",
"4 1122.281921\n",
" ... \n",
"172 8.604719\n",
"173 1.479321\n",
"174 1.231641\n",
"175 0.639000\n",
"176 51.196106\n",
"Length: 177, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.geometry.area.compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Timing comparison: Point-in-polygon with million points"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `GeoDataFrame` used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let's create a bigger point `GeoSeries`:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"N = 10_000_000"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And creating the dask-geopandas version of this series:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"dpoints = dask_geopandas.from_geopandas(points, npartitions=16)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A single polygon for which we will check if the points are located within this polygon:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import shapely.geometry\n",
"box = shapely.geometry.box(0, 0, 1, 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `within` operation will result in a boolean Series:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"Dask Series Structure:\n",
"npartitions=16\n",
"0 bool\n",
"625000 ...\n",
" ... \n",
"9375000 ...\n",
"9999999 ...\n",
"dtype: bool\n",
"Dask Name: within, 2 graph layers"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dpoints.within(box)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The relative number of the points within the polygon:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0.1163134"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(dpoints.within(box).sum() / len(dpoints)).compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compare the time it takes to compute this:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"820 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit points.within(box)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"362 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit dpoints.within(box).compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is run on a virtual machine with 8 Dask workers, and giving roughly a 2x speed-up using multithreading."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}