The {tidyverse} data manipulation functions have been a boon to analysts’ productivity. This resource is built with the idea that we sometimes just need to see how functions work in action to understand them; just link the materials for each section to see the functions “in-action”. The materials here can be used as an introduction or for reference. The {tidyverse} is an open source project in R lead by Hadley Wickham; the {tidyverse} itself contains several packages designed to work together in a consistent, logical, and human-friendly fashion.

Introduction

One-stream data manipulation

function action
filter() keep rows (if true)
select() keep variables (or drop them -var)
mutate() create a new variable
case_when() is used for “recoding” variable, often used with mutate()
rename() renaming variables
arrange() order rows based on a variable
slice() *keep or drop rows based on row number

Logical operators

Especially in filtering, you are likely to use Boolean operatores ==, !=, >, <, >=, <=, which make a comparison and return TRUE or FALSE. These can also be combine with & (both conditions must be met), | (either condition may be met), or %in% (if the left hand side element is found among the right hand side elements, TRUE is returned).

Summarizing

function action
summarize() summarize the data, by groups if they have been declared
group_by() declare subsets in data
distinct() returns only rows that are unique
tally() & count() counting (by groups if group_by() applied)

Base-R functions that you are might use in summarizing include mean(), median(), sd(), IQR(), min(), max() etc.

The function summary() can also be used to request summary statistcs for an entire data set.

Shape transformation (wide <—> long)

function action
gather() or pivot_longer() from wide to long
spread() or pivot_wider() from long to wide

Bringing streams together: Joins and Binds

function action
full_join() keeps all rows
inner_join() keeps overlapping rows
left_join() keeps all left-hand side rows
right_join() keeps all right-hand side rows
anti_join() removes rows if there is a match on right-hand side
crossing() for each left-hand side row, include the entire data set of the right-hand side
bind_rows() stack datasets, finding consistent column names
bind_cols() glue datasets together side-by-side

External Resources

Data Cleaning

I personally think of data cleaning as a separate, but related topic to data manipulation - which is the central topic of the present resource. A separate data cleaning resource can be found here.

  • Regular Expressions, character string manipulation with {stringr}
  • Unique Identifiers for Countries with {countrycode}
  • Date Manipulation {lubridate}

Data Visualization with {ggplot2}

Another part of the {tidyverse} not covered here is {ggplot2} the popular data visualization package authored by Hadley Wickham and contributors.

Cool with Statistical Independence

This a resource we’ve looked at for thinking about what strength of relatshionships we might observe just by chance.