This section cover some very basics including:

*value types*: numeric, character, integer, date, logical, factor*arithmetic operators*:`+`

`-`

`/`

`()`

`^`

*data structures*: scalar, vector, matrix, data frame (with rows and columns)*objects and assignment*:`=`

`<-`

`->`

*functions*:`mean()`

,`median()`

,`mode()`

,`log10()`

*Click on the slides below. Then use -> or <- arrows keys to navigate, click on arrows, or expand to full size*

The `{tidyverse}`

data manipulation functions have been a boon to analysts’ productivity. The `{tidyverse}`

is an open source project in R led by Hadley Wickham and supported by RStudio; the `{tidyverse}`

contains several packages designed to work together in a consistent, logical, and human-friendly fashion - including {dplyr} and {tidyr}. For most of the work that follows, you’ll need to have the tidyverse attached as follows:

function | action |
---|---|

filter() | keep rows (if true) |

select() | keep variables (or drop them `-` var) |

mutate() | create a new variable |

case_when() | is used for “recoding” variable, often used with mutate() |

rename() | renaming variables |

arrange() | order rows based on a variable |

slice() | *keep or drop rows based on row number |

Especially in filtering, you are likely to use Boolean operatores `==`

, `!=`

, `>`

, `<`

, `>=`

, `<=`

, which make a comparison and return `TRUE`

or `FALSE`

. These can also be combine with `&`

(both conditions must be met), `|`

(either condition may be met), or `%in%`

(if the left hand side element is found *among* the right hand side elements, `TRUE`

is returned).

You may be interested in summaries of variables in your data — or perhaps knowing the summaries for variables within different groups.

function | action |
---|---|

summarize() | summarize the data, by groups if they have been declared |

group_by() | declare subsets in data |

distinct() | returns only rows that are unique |

tally() & count() | counting (by groups if group_by() applied) |

n() | return number of rows |

across() | summarize a number of variables |

Base-R functions that you are might use in summarizing include `mean()`

, `median()`

, `sd()`

, `IQR()`

, `min()`

, `max()`

etc.

The function `summary()`

can also be used to request summary statistcs for an entire data set.

function | action |
---|---|

pivot_longer() | from wide to long |

pivot_wider() | from long to wide |

function | action |
---|---|

full_join() | keeps all rows |

inner_join() | keeps common id rows |

left_join() | keeps all left-hand side rows |

right_join() | keeps all right-hand side rows |

anti_join() | removes rows if there is a match on right-hand side |

crossing() | for each left-hand side row, include the entire data set of the right-hand side |

bind_rows() | stack datasets, finding consistent column names |

bind_cols() | glue datasets together side-by-side |

{data.table} is data manipulation package. It’s “blazing fast” and very popular; but this section is under construction — so limited in its examples.

You can also do a lot of data manipulation without any external packages at all.

What’s the workflow from reading in data to analysis-ready? It really depends! Data cleaning is hard to teach because data can be messy/untidy in a lot of different ways. The data cleaning flipbook will walk you through some real-life examples, with special focus on string manipulation, country codes and date manipulation.

- Regular expressions, character string manipulation with
`{stringr}`

- Unique identifiers for countries with
`{countrycode}`

- Date manipulation with
`{lubridate}`

You might also be interested related tools for data visualization and statistical analysis.