{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Dask-GeoPandas\n",
    "\n",
    "**Attribution**: *This notebook is a revised version of the [Basic Introduction](https://dask-geopandas.readthedocs.io/en/stable/guide/basic-intro.html) notebook from Dask-GeoPandas documentation.* \n",
    "\n",
    "This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on `geopandas.GeoDataFrame` and parallel `dask_geopandas.GeoDataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import geopandas\n",
    "import dask_geopandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a parallelized `dask_geopandas.GeoDataFrame`\n",
    "\n",
    "There are many ways how to create a parallelized `dask_geopandas.GeoDataFrame`. If your initial data fits in memory, you can create it from a `geopandas.GeoDataFrame` using the `from_geopandas` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_392/1252905281.py:1: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.\n",
      "  df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))\n"
     ]
    }
   ],
   "source": [
    "df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop_est</th>\n",
       "      <th>continent</th>\n",
       "      <th>name</th>\n",
       "      <th>iso_a3</th>\n",
       "      <th>gdp_md_est</th>\n",
       "      <th>geometry</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>889953.0</td>\n",
       "      <td>Oceania</td>\n",
       "      <td>Fiji</td>\n",
       "      <td>FJI</td>\n",
       "      <td>5496</td>\n",
       "      <td>MULTIPOLYGON (((180.00000 -16.06713, 180.00000...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>58005463.0</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Tanzania</td>\n",
       "      <td>TZA</td>\n",
       "      <td>63177</td>\n",
       "      <td>POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>603253.0</td>\n",
       "      <td>Africa</td>\n",
       "      <td>W. Sahara</td>\n",
       "      <td>ESH</td>\n",
       "      <td>907</td>\n",
       "      <td>POLYGON ((-8.66559 27.65643, -8.66512 27.58948...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>37589262.0</td>\n",
       "      <td>North America</td>\n",
       "      <td>Canada</td>\n",
       "      <td>CAN</td>\n",
       "      <td>1736425</td>\n",
       "      <td>MULTIPOLYGON (((-122.84000 49.00000, -122.9742...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>328239523.0</td>\n",
       "      <td>North America</td>\n",
       "      <td>United States of America</td>\n",
       "      <td>USA</td>\n",
       "      <td>21433226</td>\n",
       "      <td>MULTIPOLYGON (((-122.84000 49.00000, -120.0000...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       pop_est      continent                      name iso_a3  gdp_md_est  \\\n",
       "0     889953.0        Oceania                      Fiji    FJI        5496   \n",
       "1   58005463.0         Africa                  Tanzania    TZA       63177   \n",
       "2     603253.0         Africa                 W. Sahara    ESH         907   \n",
       "3   37589262.0  North America                    Canada    CAN     1736425   \n",
       "4  328239523.0  North America  United States of America    USA    21433226   \n",
       "\n",
       "                                            geometry  \n",
       "0  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...  \n",
       "1  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...  \n",
       "2  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...  \n",
       "3  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...  \n",
       "4  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When creating `dask_geopandas.GeoDataFrame` we have to specify how to partittion using `npartitons` or `chunksize`. Here we use `npartitions` to split it into N equal chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ddf = dask_geopandas.from_geopandas(df, npartitions=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask-GeoPandas GeoDataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop_est</th>\n",
       "      <th>continent</th>\n",
       "      <th>name</th>\n",
       "      <th>iso_a3</th>\n",
       "      <th>gdp_md_est</th>\n",
       "      <th>geometry</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=4</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>float64</td>\n",
       "      <td>object</td>\n",
       "      <td>object</td>\n",
       "      <td>object</td>\n",
       "      <td>int64</td>\n",
       "      <td>geometry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>133</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>176</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: from_pandas, 1 graph layer</div>"
      ],
      "text/plain": [
       "Dask GeoDataFrame Structure:\n",
       "               pop_est continent    name  iso_a3 gdp_md_est  geometry\n",
       "npartitions=4                                                        \n",
       "0              float64    object  object  object      int64  geometry\n",
       "45                 ...       ...     ...     ...        ...       ...\n",
       "89                 ...       ...     ...     ...        ...       ...\n",
       "133                ...       ...     ...     ...        ...       ...\n",
       "176                ...       ...     ...     ...        ...       ...\n",
       "Dask Name: from_pandas, 1 graph layer"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computation on a non-geometry column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "continent\n",
       "Africa                     51\n",
       "Asia                       47\n",
       "Europe                     39\n",
       "North America              18\n",
       "South America              13\n",
       "Oceania                     7\n",
       "Antarctica                  1\n",
       "Seven seas (open ocean)     1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf[\"continent\"].value_counts().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And calling one of the geopandas-specific methods or attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_geopandas/core.py:125: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.\n",
      "\n",
      "  meta = getattr(self._meta, attr)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Dask Series Structure:\n",
       "npartitions=4\n",
       "0      float64\n",
       "45         ...\n",
       "89         ...\n",
       "133        ...\n",
       "176        ...\n",
       "dtype: float64\n",
       "Dask Name: area, 3 graph layers"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.geometry.area"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, without calling `compute()`, the resulting Series does not yet contain any values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask/dataframe/core.py:7023: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.\n",
      "\n",
      "  df = func(*args, **kwargs)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0         1.639511\n",
       "1        76.301964\n",
       "2         8.603984\n",
       "3      1712.995228\n",
       "4      1122.281921\n",
       "          ...     \n",
       "172       8.604719\n",
       "173       1.479321\n",
       "174       1.231641\n",
       "175       0.639000\n",
       "176      51.196106\n",
       "Length: 177, dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.geometry.area.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Timing comparison: Point-in-polygon with million points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `GeoDataFrame` used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let's create a bigger point `GeoSeries`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "N = 10_000_000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And creating the dask-geopandas version of this series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A single polygon for which we will check if the points are located within this polygon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import shapely.geometry\n",
    "box = shapely.geometry.box(0, 0, 1, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `within` operation will result in a boolean Series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dask Series Structure:\n",
       "npartitions=16\n",
       "0          bool\n",
       "625000      ...\n",
       "           ... \n",
       "9375000     ...\n",
       "9999999     ...\n",
       "dtype: bool\n",
       "Dask Name: within, 2 graph layers"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dpoints.within(box)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The relative number of the points within the polygon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.1163134"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(dpoints.within(box).sum() / len(dpoints)).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the time it takes to compute this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "820 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit points.within(box)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "362 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit dpoints.within(box).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is run on a virtual machine with 8 Dask workers, and giving roughly a 2x speed-up using multithreading."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}