Until recently, most data science books didn't have a section on geospatial data. It was considered a specialist form of research best left to GIS technicians who tended to use proprietary (and overpriced) tools like ArcGIS. This has changed significantly in the past five years, but you'll still be hard pressed to find an introduction to the subject which strays very far from a few simple data sets (mostly of the USA) and relatively uncomplicated geospatial operations. I actually first began learning R, back in 2013, right when open source geospatial research tools were beginning to be developed with quite a lot more energy and geospatial data is my personal favourite data science playground, so in this book we're going to go much deeper than is usual. There are also good reasons to take things a few steps further in the particular forms of data and inquiry that religion takes us into.
Geospatial data is, in the most basic form, working with maps. This means that most of your data can be a quite simple dataframe, e.g. just a list of names or categories associated with a set of X and Y coordinates. Once you have a set of items, however, things get interesting very quickly, as you can layer data sets on top of one another. We're going to begin this chapter by developing a geolocated data set of churches in the UK. This information is readily and freely available online thanks to the UK Ordnance Survey, a quasi-governmental agency which maintains the various (now digital) maps of Britain. Lucky for us, the Ordnance Survey has an open data product that anyone can use.
Before we begin, there are some key things we should note about geospatial data. Geospatial data tends to fall into one of two kinds: points and polygons. Points can be any kind of feature: a house, a church, a pub, someone's favourite ancient oak tree, or some kind of sacred relic. Polygons tend to be associated with wider areas, and as such can be used to describe large features, e.g. an Ocean, a local authority, or a mountain, or also demographic features, like a Census Output Area with associated census summaries. Points are very simple data representations, an X and Y coordinate. Polygons are much more complex, often containing dozens or even thousands of points. To be fair, if you zoom in far enough, every point *should* become a polygon, as that small building has a shape and a footprint, but it's a much simpler way of representing the information and thus quite popular.
The most complex aspect of point data relates to the ways that coordinates are encoded, as they will always need to be associated with a coordinate reference system (CRS) which describes how they are situated with respect to the planet earth. The most common CRS is the WGS, though for our data sets we'll also come into contact with the BGS, a specifically British coordinate reference system. There are dozens of CRS, usually mapping onto a specific geographical region. Bearing in mind the way that you need to use a CRS to understand how coordinates can be associated with specific parts of the earth, you can see how this is a bit like survey data, where you also need a "codebook" to understand what the specific response values map onto, e.g. a "1" means "strongly agree" and so on. We also saw, in a previous chapter, how some forms of data have the codebook already baked into the factor data, simplifying the process of interpreting the data. In a similar way, some types of geospatial data sets can also come with CRS "baked in" while we'll need to define CRS for other types. Here are some of the most common types of geospatial data files:
- CSV: "comma separated values" a plain text file containing various coordinates
- Shapefile: a legacy file format, often still in use, but being replaced by others for a variety of good reasons. For more on this see [http://switchfromshapefile.org/]
- Geopackage: one of the more recent ways of packaging up geospatial data. Geopackages can contain a wide variety of different data and are easily portable.
- GeoJSON: a file format commonly used in other forms of coding, the "JSON" (an acronym for JavaScript Object Notation) is meant to be easily interchangeable across various platforms. GeoJSON is an augmented version of JSON data with coordinates added in.
A good starting point is to aquire some "adminstrative" data. This is a way of referring to political boundaries, whether country borders or those of a local authority or some other "administrative" unit. For our purposes, we're going to import several different types of administrative boundary which will be used at different points in our visualisations below. It's worth noting that the data we use here was prepared to support the 2011 census, and make use of the shapefile format.
[You can read more about the different UK administrative boundaries, ["here"](https://statistics.ukdataservice.ac.uk/dataset/2011-census-geography-boundaries-local-authorities).]{.aside}
Before we move on, let's plot a simple map and have a look at one of our administrative layers. We can use ggplot with a new type of shape `geom_sf()` to plot the contents of a geospatial data file with polygons which is loaded as a `simplefeature` in R.
It's worth noting that the way that you load geospatial data in R has changed quite dramatically since 2020 with the introduction of the simplefeature class in R. Much of the documentation you will come across "out there" will make reference to a set of functions which are no longer used and are worth avoiding.
We could go further with ggplot(), but for this chapter, we're going to primarily use a tool called tmap(), which works a lot like gpplot, but is much better adapted for geospatial data. As you'll see, tmap() also works by adding layers of data and visual instructions one at a time. So we might begin with `tm_shape(uk)` instead of `ggplot(uk) + geom_sf()`. Whereas ggplot() asks us to define the raw data and the shapes to use, tmap() makes some assumptions about the shapes.
In the example above shown in @figure-tmap1a you can see we've just added a polygon with a border. We can do something similar with point data and dots as shown in @figure-tmap1b:
From either of these basic starting points (or both), we stack on additional instructions, defining the different visual attributes or aesthetics, just like in `ggplot`. If you want to fill polygons with colour, we'll add `tm_fill` and if you want to adjust the default lines on your polygons, define this with `tm_borders` like we have in @figure-tmap2 below with an alpha and line width (lwd) instruction. We can also add more shapes on top with an additional `tm_shape` instruction and a follow-on `tm_borders` instruction. To add a bit of flourish, you can drop on a scale bar (`tm_scalebar`) or share licensing information with prospective readers and add a figure label or title.
Our next step here will be to add all the churches to our map, but there's a problem we need to address first, which is that there are a lot of churches in that dataset. As you may have noticed in @figure-churches above there are so many dots that some parts of the map are just blocks of grey. Let's have a look at how things are with `tmap`:
You'll recall that in previous chapters, we tried some experiments modifying scatterplots with a similar problem (e.g. the many dots problem). One solution was to add a bit of "alpha" (transparency) or colour, which we can also do with tmap:
That's about as good as we can get visualising points as dots with a dataset this large I think. You can get a sense of how large the dataset is with a quick length calculation on any of our columns with consistent data in them:
```{R}
length(os_openmap_pow$classification)
```
At nearly 50k points, we're going to need to find an alternative if we want to help someone visualise this data clearly. Lucky for us, we can use R to do some computation for us towards a different kind of map called a choropleth map. You'll probably already have seen many of these before without realising what they're called. Think of it as a kind of heatmap, like we used with our scatterplot before, except in this case the shapes that are being coloured in come from a set of polygons we specify. Our administrative map shape data is perfect for this kind of use.
The sf() library has a host of tools for geospatial data analysis, including the st_covers() calculation which will filter a dataset based on whether points (or shapes) are located inside polygons from another dataset. I'll walk you through what we've done above. First, we want to add a new column with our totals to the administrative shapes. I've used lengths() to fill this column with a simple count of the number of items in a new dataset, which in turn consists of a simple calculation (using `st_covers`) of how many points (from `os_openmap_pow`) are inside each polygon inside that `uk_countries` dataset. Sometimes it's nice to have percentages close to hand, so I've added another column for this `churches_percent` using the very handy `prop.table` command. We can do the same thing for any set of polygons, including our `local_authorities` data:
Now let's visualise this data using tmap, which (now that we have that new column) we can achieve using `tm_fill` specifying the name of our new column:
There are some issues here with normalising data, just like we've observed in previous chapters. However, normalising geospatial data is a bit more complex. For this section, it's worth asking: should we assume that the frequency of church buildings is the same as population data? Probably not, but even if we did, when should we draw that population data from, given many of these buildings were erected more than a century ago. A bit further down, we'll explore some potential ways to think about normalising buildings using a few different examples.
[You can read more about the various customisations available using tm_fill, ["here"](https://r-tmap.github.io/tmap/reference/tm_polygons.html).]{.aside}
We can do the same for our more granular local authorities data and this is already a bit more comprehensible, showing not just that the concentration is England vs. Wales and Scotland, but actually some specific high-population regions:
If we're looking for visual outliers, e.g. places where there are more or less of a feature than we might expect, we need to think carefully about the baseline that we're using to set that expectation. One option is to use a polygon that is already adjusted to population, like UK Output Areas, which are designed to have roughly the same population (so larger or smaller depending on population density).
If you want to keep the same polygons, we can also normalise using population data. This will require us to import some data to our dataframe as the admin polygon shapefiles provided by the UK government don't have population figures in them.
But I wonder if it's more interesting to compare this type of building, e.g. a place of worship, to other kinds of buildings, like grocery stores or pubs. Let's draw in some data we can work with here:
- If you'd like to do an even deeper dive into geospatial operations in R, we recommend another open access textbook [Geocomputation with R](https://r.geocompx.org/) which has a [dedicated chapter on TMap](https://r.geocompx.org/adv-map#adv-map)