The {tidyverse}
data manipulation functions have been a boon to analysts’ productivity. This resource is built with the idea that we sometimes just need to see how functions work in action to understand them; just link the materials for each section to see the functions “in-action”. The materials here can be used as an introduction or for reference. The {tidyverse}
is an open source project in R lead by Hadley Wickham; the {tidyverse}
itself contains several packages designed to work together in a consistent, logical, and human-friendly fashion.
function | action |
---|---|
filter() | keep rows (if true) |
select() | keep variables (or drop them - var) |
mutate() | create a new variable |
case_when() | is used for “recoding” variable, often used with mutate() |
rename() | renaming variables |
arrange() | order rows based on a variable |
slice() | *keep or drop rows based on row number |
Especially in filtering, you are likely to use Boolean operatores ==
, !=
, >
, <
, >=
, <=
, which make a comparison and return TRUE
or FALSE
. These can also be combine with &
(both conditions must be met), |
(either condition may be met), or %in%
(if the left hand side element is found among the right hand side elements, TRUE
is returned).
function | action |
---|---|
summarize() | summarize the data, by groups if they have been declared |
group_by() | declare subsets in data |
distinct() | returns only rows that are unique |
tally() & count() | counting (by groups if group_by() applied) |
Base-R functions that you are might use in summarizing include mean()
, median()
, sd()
, IQR()
, min()
, max()
etc.
The function summary()
can also be used to request summary statistcs for an entire data set.
function | action |
---|---|
gather() or pivot_longer() | from wide to long |
spread() or pivot_wider() | from long to wide |
function | action |
---|---|
full_join() | keeps all rows |
inner_join() | keeps overlapping rows |
left_join() | keeps all left-hand side rows |
right_join() | keeps all right-hand side rows |
anti_join() | removes rows if there is a match on right-hand side |
crossing() | for each left-hand side row, include the entire data set of the right-hand side |
bind_rows() | stack datasets, finding consistent column names |
bind_cols() | glue datasets together side-by-side |
I personally think of data cleaning as a separate, but related topic to data manipulation - which is the central topic of the present resource. A separate data cleaning resource can be found here.
{stringr}
{countrycode}
{lubridate}
Another part of the {tidyverse}
not covered here is {ggplot2}
the popular data visualization package authored by Hadley Wickham and contributors.
This a resource we’ve looked at for thinking about what strength of relatshionships we might observe just by chance.