The {tidyverse} data manipulation functions have been a boon to analysts’ productivity. This resource is built with the idea that we sometimes just need to see how functions work in action to understand them; just link the materials for each section to see the functions “in-action”. The materials here can be used as an introduction or for reference. The {tidyverse} is an open source project in R lead by Hadley Wickham; the {tidyverse} itself contains several packages designed to work together in a consistent, logical, and human-friendly fashion.
| function | action |
|---|---|
| filter() | keep rows (if true) |
| select() | keep variables (or drop them -var) |
| mutate() | create a new variable |
| case_when() | is used for “recoding” variable, often used with mutate() |
| rename() | renaming variables |
| arrange() | order rows based on a variable |
| slice() | *keep or drop rows based on row number |
Especially in filtering, you are likely to use Boolean operatores ==, !=, >, <, >=, <=, which make a comparison and return TRUE or FALSE. These can also be combine with & (both conditions must be met), | (either condition may be met), or %in% (if the left hand side element is found among the right hand side elements, TRUE is returned).
| function | action |
|---|---|
| summarize() | summarize the data, by groups if they have been declared |
| group_by() | declare subsets in data |
| distinct() | returns only rows that are unique |
| tally() & count() | counting (by groups if group_by() applied) |
Base-R functions that you are might use in summarizing include mean(), median(), sd(), IQR(), min(), max() etc.
The function summary() can also be used to request summary statistcs for an entire data set.
| function | action |
|---|---|
| gather() or pivot_longer() | from wide to long |
| spread() or pivot_wider() | from long to wide |
| function | action |
|---|---|
| full_join() | keeps all rows |
| inner_join() | keeps overlapping rows |
| left_join() | keeps all left-hand side rows |
| right_join() | keeps all right-hand side rows |
| anti_join() | removes rows if there is a match on right-hand side |
| crossing() | for each left-hand side row, include the entire data set of the right-hand side |
| bind_rows() | stack datasets, finding consistent column names |
| bind_cols() | glue datasets together side-by-side |
I personally think of data cleaning as a separate, but related topic to data manipulation - which is the central topic of the present resource. A separate data cleaning resource can be found here.
{stringr}{countrycode}{lubridate}Another part of the {tidyverse} not covered here is {ggplot2} the popular data visualization package authored by Hadley Wickham and contributors.
This a resource we’ve looked at for thinking about what strength of relatshionships we might observe just by chance.