3.2 Tidyverse 1/2: functions & pipes
Goal: Understand the tidyverse framework of pipes and functions, as well as how to apply it to tidy data frame cleaning and manipulation.
3.2.1 Welcome to the tidyverse!
Tidyverse is a suite of R packages that are very useful for data science when we need to manipulate data frames and visualize data. The most important packages in tidyverse for our workshop will be dplyr and ggplot2. It builds on the base R functions and data types we’ve studied so far. It just provides a different design framework for working with data in R. The main two features of tidyverse are
- A new suite of very useful, intuitive functions
- The pipe
%>%
which will allow us to go send data from one function to the next without using intermediate steps or nesting functions together
3.2.1.1 Preview: new functions in tidyverse
Let’s say we have a dataframe, df
, which has the name and age (in years) of everyone in a class. We want to create a new variable of age in months. In base R, we would do it this way:
df$age_months <- df$age_years * 12
In tidyverse, we can use the function mutate()
to create a new variable. As with most tidyverse functions, mutate()
takes the dataset as the first argument and the operation as the second. Here the operation is to create the new variable.
mutate(df, age_months = age_years * 12)
As you will see with more complex examples, these functions can make dataset operations more readable, especially when strung together with a pipe.
3.2.1.2 Preview: the pipe in tidyverse
Recall the head()
function used in the previous section. If you have a data frame, df
, instead of peaking at the first few lines with head(df)
, you can pipe it into the head()
function like this: df %>% head()
. This might seem like more work than just calling the function around the data frame object, but when you need to run a bunch of functions on a data frame, it becomes much TIDIER!
In the previous example, we could call the function using a pipe like so:
df %>% mutate(age_months = age_years * 12)
Feeling motivated to accelerate data cleaning and analysis using the tidyverse?
3.2.2 Load tidyverse
Load the tidyverse library
If tidyverse is already installed, loading the library should not return an error message. It is ok if it says there are package conflicts - this happens when different packages that are loaded have functions with the same name (but probably do different things). Just take note of the conflicts. If you use those other packages/functions then you’ll have to specify which package the function you’re using comes from. For example if mutate()
was also in another package, to use the tidyverse version I would write tidyr::mutate()
instead of just mutate()
If loading the library returns an error message saying it’s not installed, then go to the SSRP Pre-installation guide and follow the instructions for installing tidyverse
3.2.3 How to pipe %>%
The tidyverse pipe can be used with most base R functions and with all tidyverse-specific functions, which we’ll learn more about soon. Although we’ll focus on using tidyverse pipe and functions on data frames, the pipe can also be used on vectors and even scalars.
3.2.3.0.1 Pipes with dataframes
Let’s read in the iris dataset used in the previous section. Peak at it with head()
to check that the columns look intact.
# Load the dataset and peak at it with the head() function
irisz <- read.csv("data/iris.csv",
na.strings = c("", "n/a"),
stringsAsFactors = TRUE)
head(irisz)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Location
## 1 5.1 3.5 1.4 0.2 setosa Korea
## 2 4.9 3.0 1.4 0.2 setosa China
## 3 4.7 3.2 1.3 0.2 setosa Korea
## 4 4.6 3.1 1.5 0.2 setosa China
## 5 5.0 3.6 1.4 0.2 setosa China
## 6 5.4 NA 1.7 0.4 setosa Canada
Now let’s try this using the pipe %>%
.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Location
## 1 5.1 3.5 1.4 0.2 setosa Korea
## 2 4.9 3.0 1.4 0.2 setosa China
## 3 4.7 3.2 1.3 0.2 setosa Korea
## 4 4.6 3.1 1.5 0.2 setosa China
## 5 5.0 3.6 1.4 0.2 setosa China
## 6 5.4 NA 1.7 0.4 setosa Canada
Practice this with other functions we’ve used so far: summary()
, unique()
, str()
, dim()
, table()
, and even hist()
!. Do it with and without the pipe.
Pretty simple right?
Notice how the %>%
pipe is sending the output from before it to the function after it. This %>%
is similar to the |
we use on the command line to send data from one operation to the next: sort data.txt | uniq -c
(sort a data file and count the number of unique lines)
Pipes can string together functions
In the previous section, we learned how to read in a file with read.csv()
, get unique values with unique()
, and get the dimensions of a data frame with dim()
. Let’s try these operations using the pipe instead of performing them as separate operations.
## [1] 150 6
Notice how each operation is on one line so it’s easy to read the sequence of operations.
Pipes with scalars
To take the square root of a number x
and round it to y
digits, we would use the sqrt()
and round()
functions
Here’s what this looks like with and without the pipe:
## [1] 10.58
## [1] 10.58
## [1] 10.58
Pipes with vectors
We can apply the same piping method to vectors. Think about the vector that is the column Sepal.Length
in the dataframe irisz
which we have been using.
We’ll use pipes to string together the following operations starting with the vector irisz$Sepal.Length
: sort it, make the values numeric, and make a histogram. (These three functions were also introduced in the previous section.)
Hopefully by now you have a sense for how the pipe in tidyverse works and how it can work with functions we already know and can be applied to data frames, vectors, and scalars. Another big contribution tidyverse makes to the data science in R is the addition of new functions!
3.2.4 Essential tidyverse functions
Let’s start with the essential functions of tidyverse
. All of these are from the sub-package dplyr
select()
: subset columnsfilter()
: subset rows on conditionsarrange()
: sort resultsmutate()
: create new columns by using information from other columnsgroup_by()
andsummarize()
: create summary statistics on grouped datacount()
: count discrete values
Reminder: if you ever want to know more about these functions, just use the help function - ?
followed by the function of interest. For example: ?select
3.2.4.1 Selecting, deleting, reordering columns
Selecting columns
In base R, if we wanted to select certain columns (e.g. column names c1, c2, c6, c7) from a dataframe, df
, we would do something like this: df[c("c1", "c2", "c6", "c7")]
. With tidyverse, we can use the select(c1, c2, c6, c7)
function (from dplyr
).
Deleting columns
To delete a column, just add a -
before the column name inside of select()
.
Reordering columns
Select can also be used to reorder columns. Just specify the order of columns in select()
.
Selecting and deselecting in the irisz
dataset
# Let's select just the location
select(irisz, Location)
# Remove the Location column
select(irisz, -Location)
Selecting multiple columns
Let’s select both the Location and Species columns and see what unique combinations are present in the dataset
## Location Species
## 1 Korea setosa
## 2 China setosa
## 6 Canada setosa
## 8 China <NA>
## 9 Russia setosa
## 10 Japan setosa
## 46 <NA> setosa
## 51 <NA> versicolor
## 54 USA versicolor
## 61 Canada versicolor
## 101 USA virginica
## 107 <NA> virginica
## 143 <NA> <NA>
# Note that the order returned is the order specified in select()
# Originally Location came after Species, but now they are reordered.
We can also do this in with the pipe.
## Location Species
## 1 Korea setosa
## 2 China setosa
## 6 Canada setosa
## 8 China <NA>
## 9 Russia setosa
## 10 Japan setosa
## 46 <NA> setosa
## 51 <NA> versicolor
## 54 USA versicolor
## 61 Canada versicolor
## 101 USA virginica
## 107 <NA> virginica
## 143 <NA> <NA>
Note that select()
subsets the dataframe to the specified columns. The output is a data frame (not a vector), even if only one column is selected
What if we want to extract one column as a vector?
In base R, we would use the $
operation to select a single column as a vector. With tidyverse, we can use the pull()
function (from dplyr
). Notice that pull() will only accept a single column name and will return a vector
3.2.4.2 Filtering dataframes
The basics of filtering data frames in tidyverse are the filter()
function and boolean operators (>
, <
, ==
) learned in section 2.4.3.
Let’s try two examples:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Location
## 1 7.1 3.0 5.9 2.1 virginica USA
## 2 7.6 3.0 6.6 2.1 virginica USA
## 3 7.2 3.6 6.1 2.5 virginica USA
## 4 7.7 3.8 6.7 2.2 virginica USA
## 5 7.7 2.6 6.9 23.0 virginica USA
## 6 7.7 2.8 6.7 2.0 virginica USA
## 7 7.2 3.2 6.0 1.8 virginica USA
## 8 7.2 3.0 5.8 1.6 virginica USA
## 9 7.4 2.8 6.1 1.9 virginica USA
## 10 7.9 3.8 6.4 2.0 virginica USA
## 11 7.7 3.0 6.1 2.3 virginica USA
# Filter to get observations of setosa species from Korea
irisz %>%
filter(Species == "setosa",
Location == "Korea")
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Location
## 1 5.1 3.50 1.4 0.2 setosa Korea
## 2 4.7 3.20 1.3 0.2 setosa Korea
## 3 4.8 3.00 1.4 0.1 setosa Korea
## 4 5.4 3.90 1.3 0.4 setosa Korea
## 5 5.1 3.50 1.4 0.3 setosa Korea
## 6 5.1 3.80 1.5 0.3 setosa Korea
## 7 5.1 3.70 1.5 0.4 setosa Korea
## 8 5.5 0.35 1.3 0.2 setosa Korea
## 9 4.4 3.00 1.3 0.2 setosa Korea
## 10 NA NA 1.6 0.6 setosa Korea
3.2.4.3 Creating and renaming columns
Creating columns in tidyverse is simple with the mutate()
function.
Let’s say we wanted to create a new column for the irisz
dataset which has the row number.
In base R, we would do something like irisz$row_number <- seq(1,nrow(irisz))
. Here, seq()
creates a sequence of numbers from 1 to the number of rows in irisz
.
In tidyverse we can do it using the mutate()
and row_number()
functions. So it doesn’t output the whole data frame, we’ll pipe it directly into head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Location row_number
## 1 5.1 3.5 1.4 0.2 setosa Korea 1
## 2 4.9 3.0 1.4 0.2 setosa China 2
## 3 4.7 3.2 1.3 0.2 setosa Korea 3
## 4 4.6 3.1 1.5 0.2 setosa China 4
## 5 5.0 3.6 1.4 0.2 setosa China 5
## 6 5.4 NA 1.7 0.4 setosa Canada 6
See the row number column?
Mutate can also be used to update a column without creating a new one.
# Change sepal length to be 10x what it currently is
irisz %>%
mutate(Sepal.Length = 10*Sepal.Length ) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Location
## 1 51 3.5 1.4 0.2 setosa Korea
## 2 49 3.0 1.4 0.2 setosa China
## 3 47 3.2 1.3 0.2 setosa Korea
## 4 46 3.1 1.5 0.2 setosa China
## 5 50 3.6 1.4 0.2 setosa China
## 6 54 NA 1.7 0.4 setosa Canada
What if all the setosa measurements were actually supposed to be the virginica species?
Let’s use the ifelse()
function which accepts a conditional test, value if true, value if false.
irisz %>%
# If Species is setosa, change it to virginica, otherwise use the value in Species
mutate(Species = ifelse(Species == "setosa", "virginica", Species )) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Location
## 1 5.1 3.5 1.4 0.2 virginica Korea
## 2 4.9 3.0 1.4 0.2 virginica China
## 3 4.7 3.2 1.3 0.2 virginica Korea
## 4 4.6 3.1 1.5 0.2 virginica China
## 5 5.0 3.6 1.4 0.2 virginica China
## 6 5.4 NA 1.7 0.4 virginica Canada
Renaming columns
Sometimes, we just need to rename an existing column. Use rename()
to specify the new and old column names, rename(new_name = old_name)
. Let’s rename the location column to be “Country”. Let’s just pipe it directly into colnames()
which will give us the column names of the data frame.
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [6] "Country"
Want to change row names into a column or a column into a row name? Check out rownames_to_column()
and column_to_rownames()
.
3.2.4.4 Summarizing data
Summarizing data is very important in data analysis. It allows you to get a big picture overview of a new dataset: type, range of values, etc. These are also types of summaries to think about including when in the exploratory stages or a project: before running analyses on a dataset, create a summary of the dataset. Present this during your update meetings for your summer project and keep these summaries on hand in case you or others have questions about the dataset throughout the course of your project.
For summarizing data in tidyverse, let’s explore the functions
count()
group_by()
summarize()
The function count()
does what it sounds like - it counts the number of unique observations for the specified column(s)
Count on one variable in the data frame
## Location n
## 1 Canada 35
## 2 China 6
## 3 Japan 11
## 4 Korea 10
## 5 Russia 11
## 6 USA 63
## 7 <NA> 14
Count on more than one variable in the data frame
# Count the number of observations for each combination
# of location and species
irisz %>%
count(Location, Species)
## Location Species n
## 1 Canada setosa 11
## 2 Canada versicolor 24
## 3 China setosa 5
## 4 China <NA> 1
## 5 Japan setosa 11
## 6 Korea setosa 10
## 7 Russia setosa 11
## 8 USA versicolor 20
## 9 USA virginica 43
## 10 <NA> setosa 1
## 11 <NA> versicolor 6
## 12 <NA> virginica 6
## 13 <NA> <NA> 1
Can you think of another (perhaps longer) way to count using other functions, such as select()
, unique()
, n()
, pull()
, or table()
?
Summarizing WITHIN groups
What if we want to do more than just count the number of occurrences for a given group? group_by
takes a variable and groups the data by that variable. Anything that happens after that is only calculated within the group variable. To turn off grouping, use ungroup()
. It’s always good practice to ungroup()
after you’re done grouping. Although it doesn’t matter if you won’t be doing any more operations on the dataset.
# Here's an example that produces a similar result
# as in the count() example above
irisz %>%
group_by(Location) %>%
summarize(n = n()) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
## Location n
## <fct> <int>
## 1 Canada 35
## 2 China 6
## 3 Japan 11
## 4 Korea 10
## 5 Russia 11
## 6 USA 63
## 7 <NA> 14
Try taking away the ungroup()
and see if it changes the result.
Let’s build on this. We also want to calculate the mean()
and sd()
for a new variable called size for each location. Size is going to be length * width
of the sepal. Our process is:
- Create a new column for size using
mutate()
- Group the data by Location using
group_by()
- Summarize the mean and standard deviation of the group using
summarize()
# Since mean() and sd() require numeric inputs,
# We'll also tell mean() and sd() to ignore NAs in the calculation
irisz %>%
mutate(size = as.numeric(Sepal.Length) * as.numeric(Sepal.Width)) %>%
group_by(Location) %>%
summarize(n = n(),
mean_size = mean(Petal.Length, na.rm = T),
mean_sd = sd(Petal.Length, na.rm = T)) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 4
## Location n mean_size mean_sd
## <fct> <int> <dbl> <dbl>
## 1 Canada 35 3.39 1.30
## 2 China 6 1.52 0.194
## 3 Japan 11 1.43 0.205
## 4 Korea 10 1.4 0.105
## 5 Russia 11 1.48 0.218
## 6 USA 63 5.21 0.826
## 7 <NA> 14 4.51 0.990
For a further challenge, let’s clean up this table. Arrange the data so the countries with largest number of observations are at the top. Also, remove observations with no location.
3.2.4.5 Recap: Tidyverse vs Base R
To review, tidyverse is just another way to code in R with new functions and a framework based on the pipe %>%
. You can always use a combination of both base R functions and tidyverse functions.
There are some great resources on tidyverse functions. Here are a couple:
- Beginner’s tidyverse cheat sheet. The
dplyr
section includes the functions we focused on today. Theggplot2
section is what we’ll go over in the next section. - Introduction to tidyverse includes a more comprehensive overview of tidyverse, from reading in data to applying essential functions (what we did today) to more advanced topics such as reshaping data frames.
Licensed Creative Commons Attribution-NonCommercial-ShareAlike 4.0