Start from a folder, in which an R project is generated. This will help you work within the project, and create relative paths.
In order to work from the root, we will also make use of a package
called here
.
We will install it first:
install.packages("here")
library(here)
## here() starts at /Users/bvreede/Projects/fellowship/repro-r-workshop
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
First, we have to describe where the data is.
We can do this in two ways: a relative and an absolute way.
data_relative <- "../data/urban_population.csv"
print(data_relative)
## [1] "../data/urban_population.csv"
data_absolute <- here("data", "urban_population.csv")
print(data_absolute)
## [1] "/Users/bvreede/Projects/fellowship/repro-r-workshop/data/urban_population.csv"
We always prefer the relative path, because that means that the code is transferrable. But, here we are using a special tool, which may create an absolute path, but the code itself does not contain the absolute path’s information. So in this case, the absolute path is the one we will use!
Then we are going to load the data.
We will take a look at the Import Dataset option to get an idea about the function to use:
Environment pane > Import Dataset > From Text (readr).
This gives us the code we can use:
df <- read_csv(data_absolute)
## Rows: 13392 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): geo, continent, country, city_size
## dbl (4): code, year, population_in_cities, percentage_of_population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at the different columns and ascertain at this point that they are the right type!
We can do many things at this point.
For example, we can make a selection
df_europe <- filter(df, continent == "EUROPE")
We can make a new column:
df_newccol <- mutate(df, population_in_1000 = population_in_cities/1000)
We can even make categories:
df_categorized <- mutate(df,
data_type = case_when(
year < 2020 ~ "data",
year >= 2020 ~ "prediction"
)
)
We can summarize the data, but for this, let’s look at a neat feature: the pipe
df_filtered1 <- df |> filter(year < 2020)
df_filtered2 <- filter(df, year < 2020)
Let’s use this trick:
df_grouped <- df |>
group_by(city_size) |>
summarise(population=sum(population_in_cities))
We can add things, like a separation per continent:
df_grouped <- df |>
group_by(continent, city_size) |>
summarise(population=sum(population_in_cities))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
Or even an additional calculation:
df_grouped <- df |>
group_by(continent, city_size) |>
summarise(population=sum(population_in_cities),
average_percentage=mean(percentage_of_population)
)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
And with the pipe structure, we can add filters to this workflow:
df_grouped <- df |>
filter(year==2020) |>
group_by(continent, city_size) |>
summarise(population=sum(population_in_cities),
average_percentage=mean(percentage_of_population)
)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
df_grouped |>
ggplot(aes(x = city_size,
y = average_percentage,
fill = continent)) +
geom_bar(position = position_dodge(),
stat="identity")
df |>
filter(year==1950) |>
group_by(continent, city_size) |>
summarise(average_percentage=mean(percentage_of_population)) |>
mutate(city_size=fct_relevel(city_size,c("small", "medium", "large", "very large"))) |>
ggplot(aes(x = city_size,
y = average_percentage,
fill = continent)) +
geom_bar(position = position_dodge(),
stat="identity") +
labs(title = "Average percentage of population in cities in 1950")
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
df |>
filter(year==2020) |>
group_by(continent, city_size) |>
summarise(average_percentage=mean(percentage_of_population)) |>
mutate(city_size=fct_relevel(city_size,c("small", "medium", "large", "very large"))) |>
ggplot(aes(x = city_size,
y = average_percentage,
fill = continent)) +
geom_bar(position = position_dodge(),
stat="identity") +
labs(title = "Average percentage of population in cities in 2020")
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
library(viridis)
## Loading required package: viridisLite
df |>
mutate(city_size=fct_relevel(city_size,c("small", "medium", "large", "very large"))) |>
ggplot(aes(x = year,
y = percentage_of_population,
color = city_size)) +
scale_color_viridis(discrete=T, direction=-1) +
geom_point(alpha=0.1) +
geom_smooth(method = "glm") +
facet_wrap("continent", ncol=2) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(
title = "Change in city-dwelling populations on different continents",
y = "Percentage of population living in cities (per type)",
color = "Size of cities"
)
## `geom_smooth()` using formula 'y ~ x'
Finally, let’s save these results:
ggsave("../results/city-dwellers-over-time.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
# or
ggsave(here("results", "city-dwellers-with-here.png"))
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'