This section cover some very basics including:
value types: numeric, character, integer, date, logical, factor
arithmetic operators: +
-
/
()
^
data structures: scalar, vector, matrix, data frame (with rows and columns)
objects and assignment: =
<-
->
functions: mean()
, median()
, mode()
, log10()
Click on the slides below. Then use ->
or <-
arrows keys to navigate, click on arrows, or expand to full size
The {tidyverse}
data manipulation functions have been a boon to analysts’ productivity. The {tidyverse}
is an open source project in R led by Hadley Wickham and supported by RStudio; the {tidyverse}
contains several packages designed to work together in a consistent, logical, and human-friendly fashion - including {dplyr} and {tidyr}. For most of the work that follows, you’ll need to have the tidyverse attached as follows:
function | action |
---|---|
filter() | keep rows (if true) |
select() | keep variables (or drop them - var) |
mutate() | create a new variable |
case_when() | is used for “recoding” variable, often used with mutate() |
rename() | renaming variables |
arrange() | order rows based on a variable |
slice() | *keep or drop rows based on row number |
Especially in filtering, you are likely to use Boolean operatores ==
, !=
, >
, <
, >=
, <=
, which make a comparison and return TRUE
or FALSE
. These can also be combine with &
(both conditions must be met), |
(either condition may be met), or %in%
(if the left hand side element is found among the right hand side elements, TRUE
is returned).
You may be interested in summaries of variables in your data — or perhaps knowing the summaries for variables within different groups.
function | action |
---|---|
summarize() | summarize the data, by groups if they have been declared |
group_by() | declare subsets in data |
distinct() | returns only rows that are unique |
tally() & count() | counting (by groups if group_by() applied) |
n() | return number of rows |
across() | summarize a number of variables |
Base-R functions that you are might use in summarizing include mean()
, median()
, sd()
, IQR()
, min()
, max()
etc.
The function summary()
can also be used to request summary statistcs for an entire data set.
function | action |
---|---|
pivot_longer() | from wide to long |
pivot_wider() | from long to wide |
function | action |
---|---|
full_join() | keeps all rows |
inner_join() | keeps common id rows |
left_join() | keeps all left-hand side rows |
right_join() | keeps all right-hand side rows |
anti_join() | removes rows if there is a match on right-hand side |
crossing() | for each left-hand side row, include the entire data set of the right-hand side |
bind_rows() | stack datasets, finding consistent column names |
bind_cols() | glue datasets together side-by-side |
{data.table} is data manipulation package. It’s “blazing fast” and very popular; but this section is under construction — so limited in its examples.
You can also do a lot of data manipulation without any external packages at all.
What’s the workflow from reading in data to analysis-ready? It really depends! Data cleaning is hard to teach because data can be messy/untidy in a lot of different ways. The data cleaning flipbook will walk you through some real-life examples, with special focus on string manipulation, country codes and date manipulation.
{stringr}
{countrycode}
{lubridate}
You might also be interested related tools for data visualization and statistical analysis.