The very basics

Arithmetic operations, value types, and data structures

This section cover some very basics including:

value types: numeric, character, integer, date, logical, factor
arithmetic operators: + - / () ^
data structures: scalar, vector, matrix, data frame (with rows and columns)
objects and assignment: = <- ->
functions: mean(), median(), mode(), log10()

Click on the slides below. Then use -> or <- arrows keys to navigate, click on arrows, or expand to full size

Data manipulation with the tidyverse

The {tidyverse} data manipulation functions have been a boon to analysts’ productivity. The {tidyverse} is an open source project in R led by Hadley Wickham and supported by RStudio; the {tidyverse} contains several packages designed to work together in a consistent, logical, and human-friendly fashion - including {dplyr} and {tidyr}. For most of the work that follows, you’ll need to have the tidyverse attached as follows:

One-stream data manipulation

function	action
filter()	keep rows (if true)
select()	keep variables (or drop them `-`var)
mutate()	create a new variable
case_when()	is used for “recoding” variable, often used with mutate()
rename()	renaming variables
arrange()	order rows based on a variable
slice()	*keep or drop rows based on row number

Logical operators

Especially in filtering, you are likely to use Boolean operatores ==, !=, >, <, >=, <=, which make a comparison and return TRUE or FALSE. These can also be combine with & (both conditions must be met), | (either condition may be met), or %in% (if the left hand side element is found among the right hand side elements, TRUE is returned).

Summarizing

You may be interested in summaries of variables in your data — or perhaps knowing the summaries for variables within different groups.

function	action
summarize()	summarize the data, by groups if they have been declared
group_by()	declare subsets in data
distinct()	returns only rows that are unique
tally() & count()	counting (by groups if group_by() applied)
n()	return number of rows
across()	summarize a number of variables

Base-R functions that you are might use in summarizing include mean(), median(), sd(), IQR(), min(), max() etc.

The function summary() can also be used to request summary statistcs for an entire data set.

Shape transformation (wide <—> long)

function	action
pivot_longer()	from wide to long
pivot_wider()	from long to wide

Bringing streams together: Joins and Binds

function	action
full_join()	keeps all rows
inner_join()	keeps common id rows
left_join()	keeps all left-hand side rows
right_join()	keeps all right-hand side rows
anti_join()	removes rows if there is a match on right-hand side
crossing()	for each left-hand side row, include the entire data set of the right-hand side
bind_rows()	stack datasets, finding consistent column names
bind_cols()	glue datasets together side-by-side

Manipulation with data.table

{data.table} is data manipulation package. It’s “blazing fast” and very popular; but this section is under construction — so limited in its examples.

Manipulation with base R and logical indexing

You can also do a lot of data manipulation without any external packages at all.

Data Cleaning Examples & Intro to String Manipulation, Country Identifiers, and Date Manipulation

What’s the workflow from reading in data to analysis-ready? It really depends! Data cleaning is hard to teach because data can be messy/untidy in a lot of different ways. The data cleaning flipbook will walk you through some real-life examples, with special focus on string manipulation, country codes and date manipulation.

Regular expressions, character string manipulation with {stringr}
Unique identifiers for countries with {countrycode}
Date manipulation with {lubridate}

Data Manipulation

Gina Reynolds