This section cover some very basics including:
value types: numeric, character, integer, date, logical, factor
arithmetic operators: + - / () ^
data structures: scalar, vector, matrix, data frame (with rows and columns)
objects and assignment: = <- ->
functions: mean(), median(), mode(), log10()
Click on the slides below. Then use -> or <- arrows keys to navigate, click on arrows, or expand to full size
The {tidyverse} data manipulation functions have been a boon to analysts’ productivity. The {tidyverse} is an open source project in R led by Hadley Wickham and supported by RStudio; the {tidyverse} contains several packages designed to work together in a consistent, logical, and human-friendly fashion - including {dplyr} and {tidyr}. For most of the work that follows, you’ll need to have the tidyverse attached as follows:
| function | action |
|---|---|
| filter() | keep rows (if true) |
| select() | keep variables (or drop them -var) |
| mutate() | create a new variable |
| case_when() | is used for “recoding” variable, often used with mutate() |
| rename() | renaming variables |
| arrange() | order rows based on a variable |
| slice() | *keep or drop rows based on row number |
Especially in filtering, you are likely to use Boolean operatores ==, !=, >, <, >=, <=, which make a comparison and return TRUE or FALSE. These can also be combine with & (both conditions must be met), | (either condition may be met), or %in% (if the left hand side element is found among the right hand side elements, TRUE is returned).
You may be interested in summaries of variables in your data — or perhaps knowing the summaries for variables within different groups.
| function | action |
|---|---|
| summarize() | summarize the data, by groups if they have been declared |
| group_by() | declare subsets in data |
| distinct() | returns only rows that are unique |
| tally() & count() | counting (by groups if group_by() applied) |
| n() | return number of rows |
| across() | summarize a number of variables |
Base-R functions that you are might use in summarizing include mean(), median(), sd(), IQR(), min(), max() etc.
The function summary() can also be used to request summary statistcs for an entire data set.
| function | action |
|---|---|
| pivot_longer() | from wide to long |
| pivot_wider() | from long to wide |
| function | action |
|---|---|
| full_join() | keeps all rows |
| inner_join() | keeps common id rows |
| left_join() | keeps all left-hand side rows |
| right_join() | keeps all right-hand side rows |
| anti_join() | removes rows if there is a match on right-hand side |
| crossing() | for each left-hand side row, include the entire data set of the right-hand side |
| bind_rows() | stack datasets, finding consistent column names |
| bind_cols() | glue datasets together side-by-side |
{data.table} is data manipulation package. It’s “blazing fast” and very popular; but this section is under construction — so limited in its examples.
You can also do a lot of data manipulation without any external packages at all.
What’s the workflow from reading in data to analysis-ready? It really depends! Data cleaning is hard to teach because data can be messy/untidy in a lot of different ways. The data cleaning flipbook will walk you through some real-life examples, with special focus on string manipulation, country codes and date manipulation.
{stringr}{countrycode}{lubridate}You might also be interested related tools for data visualization and statistical analysis.