3.3 Tidyverse 2/2: reshaping data
Goal: Understand when reshaping a dataframe is necessary and how to do it using tidyverse functions gather()
and spread()
3.3.1 Why tidy data
So far we’ve used two datasets: irisz
and msleep
. Even though there is some missing data in these datasets, for the most part, they are quite neat. Notice how each line represents a single observation and each column represents a single variable. There is no redundancy in the rows or columns. Sometimes datasets will have columns with columns that mean the same thing but apply to only certain subsets of the data. Columns like this can be collapsed.
On the other hand, sometimes data will be recorded together in the same column, when they are in fact two different types of data. Imagine if conservation status in the msleep
dataset were appended to the name. If we wanted to perform an analysis based on conservation status, we would need to tidy that up by making sure it’s in a separate columns.
3.3.2 How to tidy data
The reality is that datasets don’t always come in neat format. Before and during data analysis, we will often need to “tidy” datasets. tidyverse has 4 important functions for tidying data:
gather()
: change data from a wide to long format - i.e. reduce extra columnsspread()
: change data from a long to wide format - i.e. expand a column into multiple columnsseparate()
: spearate a column into 2 or more columns based on a pattern or delimeterunite()
: unite multiple columns into 1
In lieu of re-inventing the wheel, here’s an excellent explanation and illustration of the concept of tidy data and how to apply these functions to make data tidy.
Read this excerpt from Data Science in R by Garret Grolemand https://garrettgman.github.io/tidying/
For the asynchronous portion, let’s apply these principles and functions to a new dataset. I’ll set up a few example datasets based on the original. Answer the questions starting from the appropriate example dataset (e.g. df_1
, df_2
, df_3
).
3.3.3 Data exploration with tidyverse
The dataset: New York Times COVID-19 cases
Even though we’re just using this dataset to learn about tidyverse, it’s always important to read about the dataset. Downloading data can be difficult and we don’t want to dive into that if the data is not what we want. So always read about it first! Read the introduction on Github. Based on the description of the dataset, try to answer these questions: (no need to look at the actual data, the description should be sufficient)
- Why was this data collected?
- When was the data last updated?
- What kind of data is it? Is it genetic data, imaging data, text data? Is it descriptive, numeric, categorical?
- How was the data collected? One or many sources?
- Are there imperfections in the data? Limitations? E.g. missing data, different data meanings?
We’ll read in the data straight from github. This is a lot of data (every county in the U.S.) so to make it more manageable downstream, we’ll filter the dataset to only include:
- A few states: California, Washington, Illinois, Nebraska, and Florida. Feel free to add your own state!
- The last total count of cases and deaths
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.3.2 ✔ purrr 0.3.4
## ✔ tibble 3.0.4 ✔ dplyr 1.0.2
## ✔ tidyr 1.1.2 ✔ stringr 1.4.0
## ✔ readr 1.4.0 ✔ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
data_url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
favorite_states = c("California", "Washington", "Illinois", "Nebraska", "Florida")
df <- read.csv(data_url,
header = T) %>%
filter(state %in% favorite_states) %>%
mutate(date = as.Date(date,format="%Y-%m-%d")) %>%
filter(date == max(date))
Test your knowledge! Use functions we learned in the last section to answer these questions:
- What is the last date for recorded cases? Is it the same for all states/counties?
- Are all of the states listed in
favorite_states
in the dataset? - How many counties are there for each of the states?
3.3.3.1 Example datasets
Use the following three datasets to complete the coding exercises in today’s worksheet. The datasets are based on the New York Times Data, which is loaded above.
Example Dataset #1:df_1
Three columns: county, cases, deaths
df_1 <- df %>%
tidyr::gather(key = "metric", value = "count", c(cases, deaths)) %>%
arrange(state,county)
head(df_1)
## date county state fips metric count
## 1 2021-06-13 Alameda California 6001 cases 89408
## 2 2021-06-13 Alameda California 6001 deaths 1719
## 3 2021-06-13 Alpine California 6003 cases 89
## 4 2021-06-13 Alpine California 6003 deaths 0
## 5 2021-06-13 Amador California 6005 cases 3720
## 6 2021-06-13 Amador California 6005 deaths 47
Example Dataset #2:df_2
Five columns: date, fips, metric corresponding to count, count, location (county, state)
df_2 <- df %>%
tidyr::gather(key = "metric",
value = "count",
c(cases, deaths)) %>%
mutate(county_state_fips = paste0(county,",", state, ",", fips)) %>%
select(-county,-state)
head(df_2)
## date fips metric count county_state_fips
## 1 2021-06-13 6001 cases 89408 Alameda,California,6001
## 2 2021-06-13 6003 cases 89 Alpine,California,6003
## 3 2021-06-13 6005 cases 3720 Amador,California,6005
## 4 2021-06-13 6007 cases 12576 Butte,California,6007
## 5 2021-06-13 6009 cases 2195 Calaveras,California,6009
## 6 2021-06-13 6011 cases 2276 Colusa,California,6011
Example Dataset #3:df_3
A column with the county name, followed by columns for each state with the number of cases for that state/county pair.
df_3 <- df %>%
select(county,state,cases) %>%
tidyr::spread(key = "state", value = "cases")
head(df_3)
## county California Florida Illinois Nebraska Washington
## 1 Adams NA NA 8654 3181 2212
## 2 Alachua NA 25293 NA NA NA
## 3 Alameda 89408 NA NA NA NA
## 4 Alexander NA NA 473 NA NA
## 5 Alpine 89 NA NA NA NA
## 6 Amador 3720 NA NA NA NA
Use these three datasets to complete the coding exercises in today’s worksheet
Licensed Creative Commons Attribution-NonCommercial-ShareAlike 4.0