From ba66c83a060af0e3b4fa2c1ff9c30663f24956cd Mon Sep 17 00:00:00 2001 From: Jeremy Kidwell Date: Thu, 5 Oct 2023 14:42:06 +0100 Subject: [PATCH] updated ch1 --- docs/chapter_1.html | 46 +++++++++++++++++++++------ docs/chapter_2.html | 6 ++-- docs/search.json | 11 +++++-- hacking_religion/_book/chapter_1.html | 46 +++++++++++++++++++++------ hacking_religion/chapter_1.qmd | 17 +++++++++- 5 files changed, 102 insertions(+), 24 deletions(-) diff --git a/docs/chapter_1.html b/docs/chapter_1.html index edce4f2..2a5fc6f 100644 --- a/docs/chapter_1.html +++ b/docs/chapter_1.html @@ -219,7 +219,8 @@ div.csl-indent {
  • 2.4.2 GGPlot
  • 2.5 Is your chart accurate? Telling the truth in data science
  • -
  • 2.6 Multifactor Visualisation
  • +
  • 2.6 Making our script reproducible
  • +
  • 2.7 Multifactor Visualisation
  • References
  • @@ -578,7 +579,7 @@ i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all
    2
    -We’ll re-order the column by size. +We’ll re-order the column by size.
    @@ -601,19 +602,19 @@ i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all
    1
    -First, remove the column with region names and the totals for the regions as we want just integer data. +First, remove the column with region names and the totals for the regions as we want just integer data.
    2
    -Second calculate the totals. In this example we use the tidyverse library dplyr(), but you can also do this using base R with colsums() like this: uk_census_2021_religion_totals <- colSums(uk_census_2021_religion_totals, na.rm = TRUE). The downside with base R is that you’ll also need to convert the result into a dataframe for ggplot like this: uk_census_2021_religion_totals <- as.data.frame(uk_census_2021_religion_totals) +Second calculate the totals. In this example we use the tidyverse library dplyr(), but you can also do this using base R with colsums() like this: uk_census_2021_religion_totals <- colSums(uk_census_2021_religion_totals, na.rm = TRUE). The downside with base R is that you’ll also need to convert the result into a dataframe for ggplot like this: uk_census_2021_religion_totals <- as.data.frame(uk_census_2021_religion_totals)
    3
    -In order to visualise this data using ggplot, we need to shift this data from wide to long format. This is a quick job using gather() +In order to visualise this data using ggplot, we need to shift this data from wide to long format. This is a quick job using gather()
    4
    -Now plot it out and have a look! +Now plot it out and have a look!
    @@ -691,8 +692,12 @@ i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all

    Change orientation of X axis labels + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

    Relabel fields Simplify y-axis labels Add percentage text to bars (or maybe save for next chapter?)

    -
    -

    2.6 Multifactor Visualisation

    +
    +

    2.6 Making our script reproducible

    +

    Let’s take a moment to review our hacker code. I’ve just spent some time addressing how we can be truthful in our data science work. We haven’t done much yet to talk abour reproducibility.

    +
    +
    +

    2.7 Multifactor Visualisation

    One element of R data analysis that can get really interesting is working with multiple variables. Above we’ve looked at the breakdown of religious affiliation across the whole of England and Wales (Scotland operates an independent census), and by placing this data alongside a specific region, we’ve already made a basic entry into working with multiple variables but this can get much more interesting. Adding an additional quantative variable (also known as bivariate data) into the mix, however can also generate a lot more information and we have to think about visualising it in different ways which can still communicate with visual clarity in spite of the additional visual noise which is inevitable with enhanced complexity. Let’s have a look at the way that religion in England and Wales breaks down by ethnicity.

    library(nomisr)
    @@ -729,7 +734,30 @@ $ description.en <chr> "value", "percent"
    # Filter down to simplified dataset with England / Wales and percentages without totals uk_census_2011_religion_ethnicitity <- filter(uk_census_2011_religion_ethnicitity, GEOGRAPHY_NAME=="England and Wales" & C_RELPUK11_NAME != "All categories: Religion" & C_ETHPUK11_NAME != "All categories: Ethnic group") # Simplify data to only include general totals and omit subcategories -uk_census_2011_religion_ethnicitity <- uk_census_2011_religion_ethnicitity %>% filter(grepl('Total', C_ETHPUK11_NAME))
    +uk_census_2011_religion_ethnicitity <- uk_census_2011_religion_ethnicitity %>% filter(grepl('Total', C_ETHPUK11_NAME)) + +ggplot(uk_census_2011_religion_ethnicitity, aes(fill=C_ETHPUK11_NAME, x=C_RELPUK11_NAME, y=OBS_VALUE)) + geom_bar(position="dodge", stat ="identity", colour = "black") + scale_fill_brewer(palette = "Set1") + ggtitle("Religious Affiliation in the 2021 Census of England and Wales") + xlab("") + ylab("") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
    +
    +

    +
    + +

    The trouble with using grouped bars here, as you can see, is that there are quite sharp disparities which make it hard to compare in meaningful ways. We could use logarithmic rather than linear scaling as an option, but this is hard for many general public audiences to apprecaite without guidance. One alternative quick fix is to extract data from “white” respondents which can then be placed in a separate chart with a different scale.

    +
    +
    # Filter down to simplified dataset with England / Wales and percentages without totals
    +uk_census_2011_religion_ethnicitity_white <- filter(uk_census_2011_religion_ethnicitity, C_ETHPUK11_NAME == "White: Total")
    +uk_census_2011_religion_ethnicitity_nonwhite <- filter(uk_census_2011_religion_ethnicitity, C_ETHPUK11_NAME != "White: Total")
    +
    +ggplot(uk_census_2011_religion_ethnicitity_nonwhite, aes(fill=C_ETHPUK11_NAME, x=C_RELPUK11_NAME, y=OBS_VALUE)) + geom_bar(position="dodge", stat ="identity", colour = "black") + scale_fill_brewer(palette = "Set1") + ggtitle("Religious Affiliation in the 2021 Census of England and Wales") + xlab("") + ylab("") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
    +
    +

    +
    +
    +

    This still doesn’t quite render with as much visual clarity and communication as I’d like. For a better look, we can use a technique in R called “faceting” to create a series of small charts which can be viewed alongside one another.

    +
    +
    ggplot(uk_census_2011_religion_ethnicitity_nonwhite, aes(x=C_RELPUK11_NAME, y=OBS_VALUE)) + geom_bar(position="dodge", stat ="identity", colour = "black") + facet_wrap(~C_ETHPUK11_NAME, ncol = 2) + scale_fill_brewer(palette = "Set1") + ggtitle("Religious Affiliation in the 2011 Census of England and Wales") + xlab("") + ylab("") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
    +
    +

    +