# Spatial Data¶

```
import pandas
import osmnx
import geopandas
import rioxarray
import xarray
import datashader
import contextily as cx
from shapely.geometry import box
import matplotlib.pyplot as plt
```

This chapter grounds the ideas discussed in the previous two chapters into a practical context. We consider how structures, and the models they represent, are implemented in Python, as well as how we can manipulate the data they contain. This will happen alongside the code used to manipulate the data in a single computational laboratory notebook. This, then, unites the two concepts.

Further, we will spend most of the chapter discussing how Python represents data *once read* from a file or database, rather than focusing on specific *file* formats used to store data. This is because the libraries we use will read any format into one of a few canonical data structures that we discuss in Chapter 1. We take this approach because these data structures are what we interact with during our data analysis: they our interface with the data. Indeed, part of the benefit of Python (and other computing languages) is *abstraction*. The complexities, particularities and quirks associated with each file format are removed as Python represents all data in a few standard ways, regardless of provenance. We take full advantage of this feature here.

We divide the chapter in two main parts. The first part looks at each of the three main data structures reviewed in Chapter 1 (*Geographic Thinking*): geographic tables, surfaces and spatial graphs. Second, we explore combinations of different data structures that depart from the traditional data model/structure matchings discussed in Chapter 2. We cover how one data in one structure can be effectively transferred to another, but also discuss why that might (or might not) be a good idea in some cases. A final note before we delve into the content of this book is in order: this is not a comprehensive account of *everything* that is possible with each of the data structures we present. Rather, you can think of it as a taster that we will build on throughout the book to showcase much of what is possible with Python.

## Fundamentals¶

As outlined in Chapter 2, there are a few main data structures that are used in geographic data science: geographic tables (which are generally matched to an object data models), rasters or surfaces (which are generally matched to a field data model), and spatial networks (which are generally matched to a graph data model). We discuss these in turn throughout this section.

### Geographic Tables¶

Geographic objects are usually matched to what we called the *geographic table*. Geographic tables can be thought of as a tab in a spreadsheet where one of the columns records geometric information. This data structure represents a single geographic object as a row of a table; each column in the table records information about the object, its attributes or features, as we will see below. Typically, there is a special column in this table that records the *geometry* of the object. Computer systems that use this data structure are intended to add geography into a *relational database*, such as PostgreSQL (through its PostGIS extension) or sqlite (through its spatialite extension). Beyond this, however, many data science languages (such as R, Julia, and Python), have packages that adopt this data structure as well (such as `sf`

, `ArchGDAL`

, and `geopandas`

), and it is rapidly becoming the main data structure for object-based geographic data.

Before proceeding, though, it helps to mention a quick clarification on terminology. Throughout this book, regardless of the data structure used, we will refer to a measurement about an observation as a *feature*. This is consistent with other work in data science and machine learning. Then, one set of measurements is a *sample*. For tables, this means a feature is a column and a sample is a row. Historically, though, geographic information scientists have used the word “feature” to mean an individual observation, since a “feature” in cartography is an entity on a map, and “attribute” to describe characteristics of that observation. Elsewhere, a *feature* may be called a *variable*, and a *sample* referred to as a record. So, consistent terminology is important: for this book, a *feature* is one measured trait pertaining to an observation (a column), and a *sample* is one set of measurements (a row).

To understand the structure of geographic tables, it will help to read in the `countries_clean.gpkg`

dataset included in this book that describes countries in the world. To read in this data, we can use the `read_file()`

method in `geopandas`

:^[We will generally use two curved brackets (such as `method_name()`

) to denote a *function*, and will omit them (such as `package`

) when referring to an object or package.]

```
gt_polygons = geopandas.read_file("../data/countries/countries_clean.gpkg")
```

And we can examine the top of the table with the `.head()`

method:

```
gt_polygons.head()
```

ADMIN | geometry | |
---|---|---|

0 | Indonesia | MULTIPOLYGON (((13102705.696 463877.598, 13102... |

1 | Malaysia | MULTIPOLYGON (((13102705.696 463877.598, 13101... |

2 | Chile | MULTIPOLYGON (((-7737827.685 -1979875.500, -77... |

3 | Bolivia | POLYGON ((-7737827.685 -1979875.500, -7737828.... |

4 | Peru | MULTIPOLYGON (((-7737827.685 -1979875.500, -77... |

Each row of this table is a single country. Each country only has two features: the administrative name of the country and the geometry of the country’s boundary. The name of the country is encoded in the `ADMIN`

column using the Python `str`

type, which is used to store text-based data. The geometry of the country’s boundary is stored in the `geometry`

column, and is encoded using a special class in Python that is used to represent geometric objects. As with other table-based data structures in Python, every row and column have an index that identifies them uniquely and is rendered in bold on the left-hand side of the table. This geographic table is an instance of the `geopandas.GeoDataFrame`

object, used throughout Python’s ecosystem to represent geographic data.

Geographic tables store geographic information as an additional column. But, how is this information encoded? To see, we can check the type of the object in the first row:

```
type(gt_polygons.geometry[0])
```

```
shapely.geometry.multipolygon.MultiPolygon
```

In `geopandas`

(as well as other packages representing geographic data), the `geometry`

column has special traits which a “normal” column, such as `ADMIN`

, does not. For example, when we plot the dataframe, the `geometry`

column is used as the main shape to use in the plot:

```
gt_polygons.plot()
```

```
<AxesSubplot:>
```

Changing geometries must be done carefully: since the `geometry`

column is special, there are special functions to adjust the geometry. For example, if we wanted to represent each country using its *centroid*, a point in the middle of the shape, then we must take care to make sure that a new geometry column was set properly using the `set_geometry()`

method. This can be useful when you want to work with two different geometric representations of the same underlying sample.

Let us make a map of both the boundary and the centroid of a country. First, to compute the centroid, we can use the `gt_polygons.geometry.centroid`

property. This gives us the point that minimizes the average distance from all other points on the boundary of the shape. Storing that back to a column, called `centroid`

:

```
gt_polygons['centroid'] = gt_polygons.geometry.centroid
```

We now have an additional feature:

```
gt_polygons.head()
```

ADMIN | geometry | centroid | |
---|---|---|---|

0 | Indonesia | MULTIPOLYGON (((13102705.696 463877.598, 13102... | POINT (13055431.810 -248921.141) |

1 | Malaysia | MULTIPOLYGON (((13102705.696 463877.598, 13101... | POINT (12211696.493 422897.505) |

2 | Chile | MULTIPOLYGON (((-7737827.685 -1979875.500, -77... | POINT (-7959811.948 -4915458.802) |

3 | Bolivia | POLYGON ((-7737827.685 -1979875.500, -7737828.... | POINT (-7200010.945 -1894653.148) |

4 | Peru | MULTIPOLYGON (((-7737827.685 -1979875.500, -77... | POINT (-8277554.831 -1032942.536) |

Despite the fact that `centroid`

is a geometry, it is not currently set as the geometry for our table. We can switch to the `centroid`

column using the `set_geometry()`

method. Finally, we can plot the centroid and the boundary of each country by switching the geometry column with `set_geometry`

:

```
# Plot centroids
ax = gt_polygons.set_geometry('centroid')\
.plot('ADMIN',
markersize=5
)
# Plot polygons without color filling
gt_polygons.plot('ADMIN',
ax=ax,
facecolor='none',
edgecolor='k',
linewidth=.2
)
```

```
<AxesSubplot:>
```

Note how we can create a map by calling `.plot()`

on a `GeoDataFrame`

. We can thematically color each feature based on a column by passing the name of that column to the plot method (as we do on with `ADMIN`

in this case).

Thus, as should now be clear, nearly any kind of geographic object can be represented in one (or more) geometry column(s). Thinking about the number of different kinds of shapes or geometries one could draw quickly boggles the mind. Fortunately the Open Geospatial Consortium (OGC) has defined a set of “abstract” types that can be used to define any kind of geometry. This specification, codified in ISO 19125-1—the “simple features” specification—defines the formal relationships between these types: a `Point`

is a zero-dimensional location with an x and y coordinate; a `LineString`

is a path composed of a set of more than one `Point`

, and a `Polygon`

is a surface that has at least one LineString that starts and stops with the same coordinate. All of these types *also* have `Multi`

variants that indicate a collection of multiple geometries of the same type. So, for instance, Bolivia is represented as a single polygon:

```
gt_polygons.query('ADMIN == "Bolivia"')
```

ADMIN | geometry | centroid | |
---|---|---|---|

3 | Bolivia | POLYGON ((-7737827.685 -1979875.500, -7737828.... | POINT (-7200010.945 -1894653.148) |

```
gt_polygons.query('ADMIN == "Bolivia"').plot();
```

while Indonesia is a `MultiPolygon`

containing many `Polygons`

for each individual island in the country:

```
gt_polygons.query('ADMIN == "Indonesia"')
```

ADMIN | geometry | centroid | |
---|---|---|---|

0 | Indonesia | MULTIPOLYGON (((13102705.696 463877.598, 13102... | POINT (13055431.810 -248921.141) |

```
gt_polygons.query('ADMIN == "Indonesia"').plot()
```

```
<AxesSubplot:>
```

In many cases, geographic tables will have geometries of a single type; records will *all* be `Point`

or `LineString`

, for instance. However, there is no formal requirement that a *geographic table* has geometries that all have the same type.

Throughout this book, we will use geographic tables extensively, storing polygons, but also points and lines. We will explore lines a bit more in the second part of this chapter but, for now, let us stop on points for a second. As mentioned above, these are the simplest type of feature in that they do not have any dimension, only a pair of coordinates attached to them. This means that points can sometimes be stored in a non-geographic table, simply using one column for each coordinate. We find an example of this on the Tokyo dataset we will use more later. The data is stored as a comma-separated value table, or `.csv`

:

```
gt_points = pandas.read_csv("../data/tokyo/tokyo_clean.csv")
```

Since we have read it with `pandas`

, the table is loaded as a `DataFrame`

, with no explicit spatial dimension:

```
type(gt_points)
```

```
pandas.core.frame.DataFrame
```

If we inspect the table, we find there is not a `geometry`

column:

```
gt_points.head()
```

user_id | longitude | latitude | date_taken | photo/video_page_url | x | y | |
---|---|---|---|---|---|---|---|

0 | 10727420@N00 | 139.700499 | 35.674000 | 2010-04-09 17:26:25.0 | http://www.flickr.com/photos/10727420@N00/4545... | 1.555139e+07 | 4.255856e+06 |

1 | 8819274@N04 | 139.766521 | 35.709095 | 2007-02-10 16:08:40.0 | http://www.flickr.com/photos/8819274@N04/26503... | 1.555874e+07 | 4.260667e+06 |

2 | 62068690@N00 | 139.765632 | 35.694482 | 2008-12-21 15:45:31.0 | http://www.flickr.com/photos/62068690@N00/3125... | 1.555864e+07 | 4.258664e+06 |

3 | 49503094041@N01 | 139.784391 | 35.548589 | 2011-11-11 05:48:54.0 | http://www.flickr.com/photos/49503094041@N01/6... | 1.556073e+07 | 4.238684e+06 |

4 | 40443199@N00 | 139.768753 | 35.671521 | 2006-04-06 16:42:49.0 | http://www.flickr.com/photos/40443199@N00/2482... | 1.555899e+07 | 4.255517e+06 |

Many point datasets are provided in this format. To make the most of them, it is convenient to convert them into `GeoDataFrame`

tables. There are two steps involved in this process:

Turn coordinates into geometries:

```
pt_geoms = geopandas.points_from_xy(x=gt_points["longitude"],
y=gt_points["latitude"],
crs="EPSG:4326"
)
```

Create a

`GeoDataFrame`

object:

```
gt_points = geopandas.GeoDataFrame(gt_points,
geometry=pt_geoms
)
```

And now `gt_points`

looks and feels exactly like the one of countries we have seen before, with the difference the `geometry`

column stores `POINT`

geometries:

```
gt_points.head()
```

user_id | longitude | latitude | date_taken | photo/video_page_url | x | y | geometry | |
---|---|---|---|---|---|---|---|---|

0 | 10727420@N00 | 139.700499 | 35.674000 | 2010-04-09 17:26:25.0 | http://www.flickr.com/photos/10727420@N00/4545... | 1.555139e+07 | 4.255856e+06 | POINT (139.70050 35.67400) |

1 | 8819274@N04 | 139.766521 | 35.709095 | 2007-02-10 16:08:40.0 | http://www.flickr.com/photos/8819274@N04/26503... | 1.555874e+07 | 4.260667e+06 | POINT (139.76652 35.70909) |

2 | 62068690@N00 | 139.765632 | 35.694482 | 2008-12-21 15:45:31.0 | http://www.flickr.com/photos/62068690@N00/3125... | 1.555864e+07 | 4.258664e+06 | POINT (139.76563 35.69448) |

3 | 49503094041@N01 | 139.784391 | 35.548589 | 2011-11-11 05:48:54.0 | http://www.flickr.com/photos/49503094041@N01/6... | 1.556073e+07 | 4.238684e+06 | POINT (139.78439 35.54859) |

4 | 40443199@N00 | 139.768753 | 35.671521 | 2006-04-06 16:42:49.0 | http://www.flickr.com/photos/40443199@N00/2482... | 1.555899e+07 | 4.255517e+06 | POINT (139.76875 35.67152) |

### Surfaces¶

Surfaces are used to record data from a field data model. In theory, a field is a continuous surface and thus has an infinite number of locations at which it could be measured. In reality however, fields are measured at a finite sample of locations that, to provide a sense of continuity and better conform with the field model, are uniformly structured across space. Surfaces thus, are represented as grids where each cell contains a sample. A grid can also be thought of as a table with rows and columns but, as we discussed in the previous chapter, both of them are directly tied to geographic location. This is in sharp contrast with geographic tables, where geography is confined to a single column.

To explore how Python represents surfaces, we will use an extract for the Brazilian city of Sao Paulo of a global population dataset. This dataset records population counts in cells of the same dimensions uniformly covering the surface of the Earth. Our extract is available as a GeoTIF file, a variation of the TIF image format that includes geographic information. We can use the `open_rasterio()`

method from the `xarray`

package to read in the GeoTIF:

```
pop = xarray.open_rasterio("../data/ghsl/ghsl_sao_paulo.tif")
```

This reads the data into a `DataArray`

object:

```
type(pop)
```

```
xarray.core.dataarray.DataArray
```

`xarray`

is a package to work with multi-dimensional labeled arrays. Let’s unpack this: we can use arrays of not only two dimensions as in a table with rows and columns, but with an arbitrary number of them; each of these dimensions are “tracked” by an index that makes it easy and efficient to manipulate. In `xarray`

, these indices are called coordinates, and they can be retrieved from our `DataArray`

through the `coords`

attribute:

```
pop.coords
```

```
Coordinates:
* band (band) int64 1
* y (y) float64 -2.822e+06 -2.822e+06 ... -2.926e+06 -2.926e+06
* x (x) float64 -4.482e+06 -4.482e+06 ... -4.365e+06 -4.365e+06
```

Interestingly, our surface has *three* dimensions: `x`

, `y`

, and `band`

. The former to track the latitude and longitude that each cell in our population grid covers. The third one has a single value (1) and, in this context, it is not very useful. But it is easy to imagine contexts where a third dimension would be useful. For example, an optical color image may have three bands: red, blue, and green. More powerful sensors may pick up additional bands, such as near infrared (NIR) or even radio bands. Or, a surface measured over time, like the geocubes that we discussed in Chapter 2, will have bands for each point in time at which the field is measured. A geographic surface will thus have two dimensions recording the location of cells (`x`

and `y`

), and at least one `band`

that records other dimensions pertaining to our data.

An `xarray.DataArray`

object contains additional information about the values stored under the `attrs`

attribute:

```
pop.attrs
```

```
{'transform': (250.0, 0.0, -4482000.0, 0.0, -250.0, -2822000.0),
'crs': '+proj=moll +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs=True',
'res': (250.0, 250.0),
'is_tiled': 0,
'nodatavals': (-200.0,),
'scales': (1.0,),
'offsets': (0.0,),
'AREA_OR_POINT': 'Area',
'grid_mapping': 'spatial_ref'}
```

In this case, we can see this includes information required to convert pixels in the array into locations on the Earth surface (e.g. `transform`

, and `crs`

), the resolution (250 meters by 250 meters), and other metadata that allows us to better understand where the data comes from and how it is stored.

Thus, our `DataArray`

has three dimensions:

```
pop.shape
```

```
(1, 416, 468)
```

A common operation will be to reduce this to only the two geographic ones. We can do this with the `sel`

operator, which allows us to select data by the value of their coordinates:

```
pop.sel(band=1)
```

<xarray.DataArray (y: 416, x: 468)> [194688 values with dtype=float32] Coordinates: band int64 1 * y (y) float64 -2.822e+06 -2.822e+06 ... -2.926e+06 -2.926e+06 * x (x) float64 -4.482e+06 -4.482e+06 ... -4.365e+06 -4.365e+06 Attributes: transform: (250.0, 0.0, -4482000.0, 0.0, -250.0, -2822000.0) crs: +proj=moll +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +... res: (250.0, 250.0) is_tiled: 0 nodatavals: (-200.0,) scales: (1.0,) offsets: (0.0,) AREA_OR_POINT: Area grid_mapping: spatial_ref

- y: 416
- x: 468

- ...
[194688 values with dtype=float32]

- band()int641
array(1)

- y(y)float64-2.822e+06 ... -2.926e+06
array([-2822125., -2822375., -2822625., ..., -2925375., -2925625., -2925875.])

- x(x)float64-4.482e+06 ... -4.365e+06
array([-4481875., -4481625., -4481375., ..., -4365625., -4365375., -4365125.])

- transform :
- (250.0, 0.0, -4482000.0, 0.0, -250.0, -2822000.0)
- crs :
- +proj=moll +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs=True
- res :
- (250.0, 250.0)
- is_tiled :
- 0
- nodatavals :
- (-200.0,)
- scales :
- (1.0,)
- offsets :
- (0.0,)
- AREA_OR_POINT :
- Area
- grid_mapping :
- spatial_ref

The resulting object is thus a two-dimensional array. Similar to geographic tables, we can quickly plot the values in our dataset:

```
pop.sel(band=1).plot()
```

```
<matplotlib.collections.QuadMesh at 0x7f0112b751f0>
```

This gives us a first overview of the distribution of population in the Sao Paulo region. However, if we inspect further, we can see that the map includes negative counts! How could this be? As it turns out, missing data are traditionally stored in surfaces not as a class of its own (e.g. `NaN`

) but with an impossible value. If we return to the `attrs`

printout above, we can see how the `nodatavals`

attribute specifies missing data recorded with -200. With that in mind, we can use the `where()`

method to select only values that are *not* -200:

```
pop.where(pop!=-200)\
.sel(band=1)\
.plot(cmap="RdPu")
```

```
<matplotlib.collections.QuadMesh at 0x7f0111f3edc0>
```

jupyter nbconvert –to notebook –execute captions.ipynb
cp ../book/infrastructure/logo/favicon.ico tmp_book/favicon.ico
jupyter-book build tmp_book –builder latex
sed -e ‘s/\section/\chapter/g’ -e ‘s/\subsection/\section/g’ -e ‘s/\subsubsection/\subsection/g’ tmp_book/_build/latex/python.tex > tmp_book/_build/latex/python_1.tex
jupyter nbconvert –to notebook –execute latex.ipynb
cp infrastructure/crc/Krantz_AuthoredBook_v1.18/Krantz_AuthoredBook_v1.18/krantz.cls tmp_book/_build/latex/krantz.cls
cp tmp_book/*.tex tmp_book/_build/latex/.
cp tmp_book/_build/latex/part1.tex tmp_book/_build/latex/part1.md
pandoc tmp_book/_build/latex/part1.md -o tmp_book/_build/latex/part1.tex
cp tmp_book/_build/latex/part2.tex tmp_book/_build/latex/part2.md
pandoc tmp_book/_build/latex/part2.md -o tmp_book/_build/latex/part2.tex
cp tmp_book/_build/latex/part3.tex tmp_book/_build/latex/part3.md
pandoc tmp_book/_build/latex/part3.md -o tmp_book/_build/latex/part3.tex
The colorbar now looks more sensible, and indicates *real* counts, rather than including the missing data placeholder values.

### Spatial graphs¶

Spatial graphs store connections between objects through space. These connections may derive from geographical topology (e.g. contiguity), distance, or more sophisticated dimensions such as interaction flows (e.g. commuting, trade, communication). Compared to geographic tables and surfaces, spatial graphs are rather different. First, in most cases they do not record measurements about a given phenomena, but instead focus on *connections*, on storing relationships between objects as they are facilitated (or impeded in their absence) by space. Second, because of this relational nature, the data are organized in a more unstructured fashion: while one sample may be connected to only one other sample, another one can display several links. This in stark contrast to geographic tables and surfaces, both of which have a clearly defined structure, shape and dimensionality in which data are organized. These particularities translate into a different set of Python data structures. Unlike the previous ones we have seen, there are different data structures to represent spatial graphs, each optimized for different contexts. One of such cases is the integration of spatial connections in statistical methods such as exploratory data analysis or regression. For this, the most common data structure are spatial weights matrices, to which we devote the next chapter.

In this chapter, we briefly review a different way of representing spatial graphs that is much closer to the mathematical concept of a graph. A graph is composed of *nodes* that are linked together by *edges*. In a spatial network, *nodes* may represent geographical places, and thus have a specific location; likewise, *edges* may represent geographical paths between these places. Networks require both *nodes* and *edges* to analyze their structure.

For illustration, we will rely on the `osmnx`

library, which can query data from OpenStreetMap. For example, we extract the street-based graph of Yoyogi Park, near our earlier data from Tokyo:

```
%%time
graph = osmnx.graph_from_place("Yoyogi Park, Shibuya, Tokyo, Japan")
```

```
CPU times: user 918 ms, sys: 15.2 ms, total: 933 ms
Wall time: 21.1 s
```

The code snippet above sends the query to the OpenStreetMap server to fetch the data. Note that the cell above *requires* internet connectivity to work. If you are working on the book *without* connectivity, a cached version of the graph is available on the data folder and can be read as:

```
graph = osmnx.load_graphml("../data/cache/yoyogi_park_graph.graphml")
```

Once the data is returned to `osmnx`

, it gets processed into the `graph`

Python representation:

```
type(graph)
```

```
networkx.classes.multidigraph.MultiDiGraph
```

We can have a quick inspection of the structure of the graph with the `plot_graph`

method:

```
osmnx.plot_graph(graph)
```

```
(<Figure size 576x576 with 1 Axes>, <AxesSubplot:>)
```

The resultant `graph`

object is actually a `MultiDiGraph`

from `networkx`

, a graph library written in Python. The graph here is stored as a collection of 106 nodes (street intersections):

```
len(graph.nodes)
```

```
106
```

and 287 edges (streets) that connect them:

```
len(graph.edges)
```

```
287
```

Each of these elements can be queried to obtain more information such as the location and ID of a node:

```
graph.nodes[1520546819]
```

```
{'y': 35.6711267, 'x': 139.6925951, 'street_count': 4}
```

The characteristics of an edge:

```
graph.edges[(1520546819, 3010293622, 0)]
```

```
{'osmid': 138670840,
'highway': 'footway',
'oneway': False,
'length': 59.113,
'geometry': <shapely.geometry.linestring.LineString at 0x7f0111fad970>}
```

Or how the different components of the graph relate to each other. For example, what other nodes are directly connected to node `1520546819`

?

```
list(graph.adj[1520546819].keys())
```

```
[3010293622, 5764960322, 1913626649, 1520546959]
```

Thus, networks are easy to represent in Python, and are one of the three main data structures in geographic data science.

## Hybrids¶

We have just seen how geographic tables, surfaces, and networks map onto `GeoDataFrame`

, `DataArray`

and `Graph`

objects in Python, respectively. These represent the conventional pairings that align data models to data structures with Python representations. However, Thus, while the conventional pairings are well-used, there are others in active use and many more to yet be developed. Interestingly, many new pairings are driven by new developments in technology, enabling approaches that were not possible in the past or creating situations (e.g. large datasets) that make the conventional approach limiting. Therefore, in this second section of the chapter, we step a bit “out of the box” to explore cases in which it may make sense to represent a dataset with a data structure that might not be the most obvious initial choice.

### Surfaces as tables¶

The first case we explore is treating surfaces as (geo-)tables. In this context, we shift from an approach where each dimension has a clear mapping to a spatial or temporal aspect of the dataset, to one where each sample, cell of the surface/cube is represented as a row in a table. This approach runs contrary to the general consensus that fields are best represented as surfaces or rasters because that allows us to index space and time “by default” based on the location of values within the data structure. Shifting to a tabular structure implies either losing that space-time reference, or having to build it manually with auxiliary objects (e.g. a spatial graph). In almost any case, operating on this format is less efficient than it *could* be if we had bespoke algorithms built around surface structures. Finally, from a more conceptual point of view, treating pixels as independent realizations of a process that we *know* is continuous is can be computationally inefficient and statistically flawed.

This perspective however also involves important benefits. First, sometimes we *don’t* need location for our particular application. Maybe we are interested in calculating overall descriptive statistics; or maybe we need to run an analysis that is entirely atomic in the sense that it operates on each sample in isolation from all the other ones. Second, by “going tabular” we recast our specialized, spatial data into the most common data structure available, for which a large amount of commodity technology is built. This means many new tools can be used for analysis. So called “big data” technologies, such as distributed systems, are much more common, robust, and tested for tabular data than for spatial surfaces. *If* we can translate our spatial challenge into a tabular challenge, we can immediately plug into technology that is more optimized and, in some cases, reliable. Further, some analytic toolboxes common in (geographic) data science are entirely built around tabular structures. Machine learning packages such as `scikit-learn`

, or some spatial analytics (such as most methods in `PySAL`

) are designed around this data structure. Converting our surfaces into tables thus allows us to plug into a much wider suite of (potentially) efficient tools and techniques.

We will see two ways of going from surfaces to tables: one converts every pixel into a table row, and another that aggregates pixels into predetermined polygons.

#### One pixel at a time¶

Technically, going from surface to table involves traversing from `xarray`

to `pandas`

objects. This is actually a well established bridge. To illustrate it with a example, let’s revisit the [population counts in Sao Paulo used earlier. We can read the surface into a `DataArray`

object with the `open_rasterio()`

method:

```
surface = xarray.open_rasterio("../data/ghsl/ghsl_sao_paulo.tif")
```

Transferring to a table is as simple as calling the `DataArray`

’s `to_series()`

method:

```
t_surface = surface.to_series()
```

The resulting object is a `pandas.Series`

object indexed on each of the dimensions of the original `DataArray`

:

```
t_surface.head()
```

```
band y x
1 -2822125.0 -4481875.0 -200.0
-4481625.0 -200.0
-4481375.0 -200.0
-4481125.0 -200.0
-4480875.0 -200.0
dtype: float32
```

At this point, everything we know about `pandas`

and tabular data applies! For example, it might be more convenient to express it as a `DataFrame`

:

```
t_surface = t_surface.reset_index().rename(columns={0: "Value"})
```

With the power of a tabular library, some queries and filter operations become much easier. For example, finding cells with more than 1,000 people can be done with the usual `query()`

method^[Although, if all you want to do is this type of query, `xarray`

is well equipped for this kind of task too.]:

```
t_surface.query("Value > 1000").info()
```

```
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7734 entries, 3785 to 181296
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 band 7734 non-null int64
1 y 7734 non-null float64
2 x 7734 non-null float64
3 Value 7734 non-null float32
dtypes: float32(1), float64(2), int64(1)
memory usage: 271.9 KB
```

The table we have built has no geometries associated with it, only rows representing pixels. It takes a bit more effort, but it is possible to convert it, or a subset of it, into a fully-fledged geographic table, where each pixel includes the grid geometry it represents. For this task, we develop a simple function that takes a row from our table and the resolution of the surface, and returns its geometry:

```
def row2cell(row, res_xy):
res_x, res_y = res_xy
minX = row["x"]
maxX = row["x"] + res_x
minY = row["y"] + res_y
maxY = row["y"]
poly = box(minX, minY, maxX, maxY)
return poly
```

For example:

```
row2cell(t_surface.loc[0, :], surface.attrs["res"])
```

One of the benefits of this approach is we do not require entirely filled surfaces and can only record pixels where we have data. For the example above or cells with more than 1,000 people, we could create the associated geo-table as follows:

```
max_polys = t_surface.query("Value > 1000")\
.apply(row2cell,
res_xy=surface.attrs["res"],
axis=1
)
max_polys = geopandas.GeoSeries(max_polys, crs=surface.attrs["crs"])
```

And generate a map with the same tooling that we use for any standard geo-table:

```
ax = max_polys.plot(edgecolor="red",
figsize=(9, 9)
)
cx.add_basemap(ax,
crs=surface.attrs["crs"],
source=cx.providers.CartoDB.Voyager
)
```

Finally, once we have operated on the data as a table, we may want to return to a surface-like data structure. This involves taking the same journey in the opposite direction as how we started. The sister method of `to_series`

in `xarray`

is `from_series`

:

```
new_da = xarray.DataArray.from_series(
t_surface.set_index(["band", "y", "x"])["Value"]
)
new_da
```

<xarray.DataArray 'Value' (band: 1, y: 416, x: 468)> array([[[-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.], ..., [-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.]]], dtype=float32) Coordinates: * band (band) int64 1 * y (y) float64 -2.926e+06 -2.926e+06 ... -2.822e+06 -2.822e+06 * x (x) float64 -4.482e+06 -4.482e+06 ... -4.365e+06 -4.365e+06

- band: 1
- y: 416
- x: 468

- -200.0 -200.0 -200.0 -200.0 -200.0 ... -200.0 -200.0 -200.0 -200.0
array([[[-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.], ..., [-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.], [-200., -200., -200., ..., -200., -200., -200.]]], dtype=float32)

- band(band)int641
array([1])

- y(y)float64-2.926e+06 ... -2.822e+06
array([-2925875., -2925625., -2925375., ..., -2822625., -2822375., -2822125.])

- x(x)float64-4.482e+06 ... -4.365e+06
array([-4481875., -4481625., -4481375., ..., -4365625., -4365375., -4365125.])

#### Pixels to polygons¶

A second use case involves moving surfaces directly into geographic tables by aggregating pixels into pre-specified geometries. For this illustration, we will use the DEM surface containing elevation for the San Diego (US) region, and the set of Census tracts. For an example, we will investigate the average altitude of each neighborhood.

Let’s start by reading the data. First, the elevation model:

```
dem = xarray.open_rasterio("../data/nasadem/nasadem_sd.tif").sel(band=1)
dem.where(dem > 0).plot.imshow()
```

```
<matplotlib.image.AxesImage at 0x7f00fca95ac0>
```

And the neighborhood areas (tracts) from the Census:

```
sd_tracts = geopandas.read_file("../data/sandiego/sandiego_tracts.gpkg")
sd_tracts.plot()
```

```
<AxesSubplot:>
```

There are several approaches to compute the average altitude of each neighborhood. We will use `rioxarray`

to clip parts of the surface *within* a given set of geometries. Let’s start with a single polygon. For the illustration, we will use the largest one, located on the eastern side of San Diego. We can find the ID of the polygon with:

```
largest_tract_id = sd_tracts.query(f"area_sqm == {sd_tracts['area_sqm'].max()}").index[0]
largest_tract_id
```

```
627
```

And then pull out the polygon itself for the illustration:

```
largest_tract = sd_tracts.loc[largest_tract_id, "geometry"]
```

Clipping the section of the surface that is within the polygon in the DEM can be achieved with the `rioxarray`

extension to clip surfaces based on geometries:

```
dem_clip = dem.rio.clip([largest_tract.__geo_interface__],
crs=sd_tracts.crs
)
dem_clip.where(dem_clip > 0).plot()
```

```
<matplotlib.collections.QuadMesh at 0x7f00fc1d5d90>
```

Once we have elevation measurements for all the pixels within the tract, the average one can be calculated with `mean()`

:

```
dem_clip.where(dem_clip > 0).mean()
```

<xarray.DataArray ()> array(585.11375946) Coordinates: band int64 1 spatial_ref int64 0

- 585.1
array(585.11375946)

- band()int641
array(1)

- spatial_ref()int640
- crs_wkt :
- GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Longitude",EAST],AXIS["Latitude",NORTH]]
- semi_major_axis :
- 6378137.0
- semi_minor_axis :
- 6356752.314245179
- inverse_flattening :
- 298.257223563
- reference_ellipsoid_name :
- WGS 84
- longitude_of_prime_meridian :
- 0.0
- prime_meridian_name :
- Greenwich
- geographic_crs_name :
- WGS 84
- grid_mapping_name :
- latitude_longitude
- spatial_ref :
- GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Longitude",EAST],AXIS["Latitude",NORTH]]
- GeoTransform :
- -116.61847222222221 0.0002777777777777725 0.0 33.42902777777778 0.0 -0.0002777777777777784

array(0)

Now, to scale this to the entire geo-table, there are several approaches, each with benefits and disadvantages. We opt for applying the method above to each row of the table. We define an auxiliary function that takes a row containing one of our tracts and returns its elevation:

```
def get_mean_elevation(row):
geom = row["geometry"].__geo_interface__
section = dem.rio.clip([geom], crs=sd_tracts.crs)
ele = float(section.where(section > 0).mean())
return ele
```

Applied to the same tract, it returns the same average elevation:

```
get_mean_elevation(sd_tracts.loc[largest_tract_id, :])
```

```
585.1137594576915
```

This method can then be run on each polygon in our series using the `apply`

:

```
elevations = sd_tracts.head().apply(get_mean_elevation, axis=1)
elevations
```

```
0 7.144268
1 35.648492
2 53.711389
3 91.358777
4 187.311972
dtype: float64
```

This simple approach illustrates the main idea well: find the cells that pertain to a given geometry and summarize their values in some manner. This can be done with any kind of geometry. Further, this simple method plays well with `xarray`

surface structures and is scalable in that it is not too involved to run in parallel and distributed with Dask, and can be extended using arbitrary Python functions.

However, this approach can be quite slow in big data. A more efficient alternative for our example is using the `rasterstats`

library. This is a purpose-built library to construct so-called “zonal statistics” from surfaces. Here, the “zones” are the polygons and the “surface” is our `DataArray`

. Generally, this library will be faster than the simpler approach used above, but may be more difficult to extend or adapt:

```
%%time
from rasterstats import zonal_stats
elevations2 = zonal_stats(sd_tracts.to_crs(dem.rio.crs),
"../data/nasadem/nasadem_sd.tif"
)
elevations2 = pandas.DataFrame(elevations2)
```

```
CPU times: user 2.06 s, sys: 239 ms, total: 2.3 s
Wall time: 2.32 s
```

```
elevations2.head()
```

min | max | mean | count | |
---|---|---|---|---|

0 | -12.0 | 18.0 | 3.538397 | 3594 |

1 | -2.0 | 94.0 | 35.616395 | 5709 |

2 | -5.0 | 121.0 | 48.742630 | 10922 |

3 | 31.0 | 149.0 | 91.358777 | 4415 |

4 | -32.0 | 965.0 | 184.284941 | 701973 |

To visualize these results, we can make an elevation map:

```
f, axs = plt.subplots(1, 3, figsize=(15, 5))
dem.where(dem > 0)\
.rio.reproject(sd_tracts.crs)\
.plot.imshow(ax=axs[0], add_colorbar=False)
sd_tracts.plot(ax=axs[1])
sd_tracts.assign(elevation=elevations2["mean"]).plot("elevation", ax=axs[2])
```

```
<AxesSubplot:>
```

### Tables as surfaces¶

The case for converting tables into surfaces is perhaps less controversial. This is an approach we can take in cases where we are interested in the *overall* distribution of objects (usually points) and we have so many that it is not only technically more efficient to represent them as a surface, but conceptually it is also easier to think about the points as uneven measurements from a continuous field. To illustrate this approach, we will use the dataset of Tokyo photographs we loaded above into `gt_points`

.

From a purely technical perspective, for datasets with too many points, representing every point in the data on a screen can be seriously overcrowded:

```
gt_points.plot()
```

```
<AxesSubplot:>
```

In the image above, it is hard to tell anything about the density of points in the center of the image due to *overplotting*: while points *theoretically* have no width, they must in order for us to see them! Therefore, point *markers* often plot on top of one another, obscuring the true pattern and density in dense areas. Converting the dataset from a geo-table into a surface involves laying out a grid and counting how many points fall within each cell. In one sense, this is the reverse operation to what we saw when computing zonal statistics in the previous section: instead of aggregating cells into objects, we aggregate objects into cells. Both operations, however, involve aggregation that reduces the amount of information present in order to make the (new) data more manageable.

In Python, we can rely on the `datashader`

library, which does all the computation in a very efficient way. This process involves two main steps. First, we set up the grid (or canvas, `cvs`

) into which we want to aggregate points:

```
cvs = datashader.Canvas(plot_width=60,
plot_height=60
)
```

Then we “transfer” the points into the grid:

```
grid = cvs.points(gt_points,
x="longitude",
y="latitude"
)
```

The resulting `grid`

is a standard `DataArray`

object that we can then manipulate as we have seen before. When plotted below, the amount of detail that the re-sampled data allows for is much greater than when the points were visualized alone:

```
f, axs = plt.subplots(1, 2, figsize=(14, 6))
gt_points.plot(ax=axs[0])
grid.plot(ax=axs[1]);
```

### Networks as graphs *and* tables¶

In the previous chapter, we saw networks as data structures that store *connections* between objects. We also discussed how this broad definition includes many interpretations that focus on different aspects of the networks. While spatial analytics may use graphs to record the topology of a table of objects such as polygons; transport applications may treat the network representation of the street layout as a set of objects itself, in this case lines. In this final section we show how one can flip back and forth between one representation and another, to take advantage of different aspects.

We start with the `graph`

object from the previous section. Remember this captures the street layout around Yoyogi park in Tokyo. We have seen how, stored under this data structure, it is easy to query which node is connected to which, and which ones are at the end of a given edge.

However, in some cases, we may want to convert the graph into a structure that allows us to operate on each component of the network independently. For example, we may want to map streets, calculate segment lengths, or draw buffers around each intersection. These are all operations that do not require topological information, that are standard for geo-tables and that are irrelevant to the graph structure. In this context, it makes sense to convert our `graph`

to two geo-tables, one for intersections (the graph nodes) and one for street segments (the graph edges). In `osmnx`

, we can do that with the built-in converter:

```
gt_intersections, gt_lines = osmnx.graph_to_gdfs(graph)
```

Now each of the resulting geo-tables behaves as a collection of geographic objects, because it is:

```
gt_intersections.head()
```

y | x | street_count | highway | geometry | |
---|---|---|---|---|---|

osmid | |||||

886196069 | 35.670087 | 139.694333 | 3 | NaN | POINT (139.69433 35.67009) |

886196073 | 35.669725 | 139.699508 | 3 | NaN | POINT (139.69951 35.66972) |

886196100 | 35.669442 | 139.699708 | 3 | NaN | POINT (139.69971 35.66944) |

886196106 | 35.670422 | 139.698564 | 4 | NaN | POINT (139.69856 35.67042) |

886196117 | 35.671256 | 139.697470 | 3 | NaN | POINT (139.69747 35.67126) |

```
gt_lines.info()
```

```
<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 287 entries, (886196069, 1520546857, 0) to (7684088896, 3010293702, 0)
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 osmid 287 non-null object
1 highway 287 non-null object
2 oneway 287 non-null bool
3 length 287 non-null float64
4 geometry 287 non-null geometry
5 bridge 8 non-null object
6 name 9 non-null object
7 access 2 non-null object
dtypes: bool(1), float64(1), geometry(1), object(5)
memory usage: 26.9+ KB
```

If we were in the opposite situation, where we had a set of street segments and their intersections in geo-table form, we can generate the graph representation with the `graph_from_gdfs`

sister method:

```
new_graph = osmnx.graph_from_gdfs(gt_intersections, gt_lines)
```

The resulting object will behave in the same was as our original `graph`

.

## Conclusion¶

In conclusion, this chapter provides an overview of the mappings between data models, presented in Chapter 2, and data structures that are common in Python. Beyond the data structures discussed here, the Python ecosystem is vast, deep, and ever-changing. Part of this is the ease with which you (yes, you!) can create your own representations to express different aspects of a problem at hand. However, by focusing on our shared representations and the interfaces between these representations, you can generally conduct any analysis you need. By creating unique, bespoke representations, your analysis might be more efficient, but you can also inadvertently isolate it from other developers and render useful tools inoperable. Therefore, a solid understanding of the basic data structures (the `GeoDataFrame`

, `DataArray`

, and `Graph`

) will be sufficient to support nearly any analysis you need to conduct.

## Questions¶

One way to convert from

`Multi-`

type geometries into many individual geometries is using the`explode()`

method of a GeoDataFrame. Using the`explode()`

method, can you find out how many islands are in Indonesia?Using

`osmnx`

, are you able to extract the street graph for your hometown?As you have seen with the

`osmnx.graph_to_gdfs()`

method, it is possible to convert a graph into the constituent nodes and edges. Graphs have many other kinds of non-geographical representations. Many of these are provided in`networkx`

methods that start with`to_`

. How many representations of graphs are currently supported?Using

`networkx.to_edgelist()`

, what “extra” information does`osmnx`

include when building the dataframe for edges?Instead of computing the average elevation for each neighborhood in San Diego, can you answer the following queries?

What neighborhood (or neighborhoods) have the

*the highest average elevation*?What neighborhood (or neighborhoods) have

*the highest point single point*?Can you find the neighborhood (or neighborhoods) with

*the largest elevation change*?