data cleaning flipbook

class: left, top, inverse
background-image: url(broom.jpg)
background-size: cover

# .large[Data Cleaning]
### Walk-throughs <br>with {flipbookr}<br>and {xaringan}

#### .right[Gina Reynolds<br>Photo Credit: Zoya Loonohod]

# Introduction

Data cleaning is the pits.  Or is it?  The cleaning tools of the `tidyverse` and other packages like `janitor` might help such tasks become less terrible -- and more fun! Let's look at a couple of examples.

---

# Getting started with R and RStudio

The tools demonstrated in this book are implemented in the statistical software, R which is opensource and freely available.  RStudio, an integrated development environment (IDE), is a nice environment for working in R.  The following links may help you get R and RStudio up and running.

- [Install Windows](https://www.youtube.com/watch?v=aCRMhAWmtpw)
- [Install Mac](https://www.youtube.com/watch?v=GLLZhc_5enQ)

---

# Acknowledgements

I'm grateful to the tidyverse team for their tireless work on developing and maintaining this wonderful toolkit and to RStudio for supporting the development.  Emi Tanaka (@statsgen) and Garrick Aden-Buie (@grrrck) helped by writing code to made the flipbook style possible. I'm *very* grateful to them for their work on this as well as to Yihui Xie and others for their work on the Xaringan package --- the extraordinary platform for creating the slides for this book.  I'm thankful too to my students who refresh my perspective on teaching and learning new tools.  Finally, thanks to David who is supportive of data cleaning and of me.

---

# Preliminaries

This book covers some examples of getting data into R and cleaning it.  There are about 5 steps involved as follows:

- Step 0: Identifying data of interest
- Step 1: Saving this data to your raw_data folder
- Step 2: Scanning the data into R and assigning an object like *the_data_raw*
- Step 3: Cleaning the data and assigning to an object like *the_data*
- Step 4: Saving out the clean data in a data_products folder
- Step 5: Loading your clean data - ready for analysis!

This workflow assumes that you are using a project structure.  So let's make sure that you have that set up.

---

# Open your project in RStudio

It's helpful to use RStudio projects. You should have a final_project created already.  In RStudio, you can navigate to it from the right hand corner icon.  In the example, here, I'm currently in a project called `flipbooks` but I need to navigate to the project in the list called `final_project`. You should navigate there too!

![](figures/navigate_to_project.png)

---

# Project structure

![](figures/project_file_structure.png)

Steps 1-4 (maybe just 2-4) will happen in the 01_cleaning_data.Rmd, and Step 5 will happen in 02_analysis.Rmd.

---

# Project structure

Another way to show this is with a directory tree as shown here:

```
final_project
├── 01_data_cleaning.Rmd   <- work files
├── 02_analysis.Rmd        <- work files
├── final_project.Rproj    <- you declared your project
├── raw_data        <- Folder
├── data_products   <- Folder
├── figures         <- Folder
├── literature      <- Folder
```

---

# Example 1: Fisheries data from Wikipedia

This book will cover a few examples.  The first is about fisheries around the world.  A clean version of this data is used in a lab that we might do in the future about conflict over resources; the lab is adapted from a lab originally written by Angela Zoss and Eric Monson.

---

## Step 0: find data!

The date for this first example comes from the second table on the Wikipedia page [Fishing Industry by Country](https://en.wikipedia.org/wiki/Fishing_industry_by_country).

---

## Step 1: get data

The first step saving your file of interest  to your raw data folder.  You can do this manually or use *download.file()* with a web address, as we do here.

```r
download.file(url = "https://en.wikipedia.org/wiki/Fishing_industry_by_country", 
              destfile = "raw_data/wikipedia_fisheries.html")
```

---

## Step 2:  scan in data and create an object *like `fisheries_2015_raw`*

Once you have saved your raw data, you need to scan/read the data into R.  There are a variety of functions to do this depending on your data type.  In this example, we are wanting to read in a table embedded in an html file.  We can read in the data with the htmltab function from the htmltab package.

---

```r
# using the htmltab package to scan html file
*htmltab::htmltab(doc = "raw_data/wikipedia_fisheries.html",
# the second table object is what we want
*                which = 2)
```
]
 
.panel2-read_in_fisheries-auto[

```
                               Country    Capture Aquaculture       Total
  2                              China 17,800,000  63,700,000  81,500,000
  3                          Indonesia  6,584,419  16,600,000  23,200,000
  4                              India  5,082,332   5,703,002  10,800,000
  5                            Vietnam  2,785,940   3,634,531   6,420,471
  6                      United States  4,931,017     444,369   5,375,386
  7                             Russia  4,773,413     173,840   4,947,253
  8                              Japan  3,275,263   1,067,994   4,343,257
  9                        Philippines  2,027,992   2,200,914   4,228,906
  10                              Peru  3,811,802     100,187   3,911,989
  11                        Bangladesh  1,674,770   2,203,554   3,878,324
  12                            Norway  2,203,360   1,326,216   3,529,576
  13                       South Korea  1,395,951   1,859,220   3,255,171
  14                           Myanmar  2,072,390   1,017,644   3,090,034
  15                             Chile  1,829,238   1,050,117   2,879,355
  16                          Thailand  1,530,583     962,571   2,493,154
  17                          Malaysia  1,584,371     407,887   1,992,258
  18                            Mexico  1,524,467     221,328   1,745,795
  19                             Egypt    335,614   1,370,660   1,706,274
  20                           Morocco  1,454,105       1,142   1,455,247
  21                            Brazil    705,000     581,230   1,286,230
  22                             Spain    915,137     283,831   1,198,968
  23                           Ecuador    715,495     451,090   1,166,585
  24                           Iceland  1,085,176      15,129   1,100,305
  25                              Iran    695,407     398,129   1,093,536
  26                            Canada    874,727     200,765   1,075,492
  27                           Nigeria    734,731     306,727   1,041,458
  28                    United Kingdom    702,405     194,492     896,897
  29                          Cambodia    629,950     172,500     802,450
  30                       North Korea    209,000     554,100     763,100
  31                         Argentina    755,226       3,673     758,900
  32                            France    561,173     166,640     727,813
  33                           Denmark    670,344      36,337     706,681
  34                          Pakistan    513,156     156,430     669,586
  35                     Faroe Islands    568,435      83,300     651,735
  36                      South Africa    622,070       7,994     630,064
  37                        Mauritania    609,754           0     609,754
  38                            Turkey    335,326     250,331     585,657
  39                         Sri Lanka    521,637      30,974     552,611
  40                       New Zealand    424,791     109,016     533,807
  41                            Uganda    389,244     118,051     507,295
  42                           Namibia    503,878         591     504,469
  43                            Angola    486,490         655     487,145
  44                           Senegal    474,162       2,079     476,241
  45                       Netherlands    370,274      62,940     433,214
  46                          Tanzania    370,966      12,547     383,514
  47                             Ghana    327,457      52,480     379,937
  48                             Italy    198,130     157,109     355,239
  49                  Papua New Guinea    309,245       6,200     315,445
  50                           Germany    271,185      41,721     312,906
  51                         Venezuela    284,175      25,998     310,173
  52                        Mozambique    299,591       1,180     300,771
  53                           Ireland    259,845      40,244     300,089
  54                              Oman    279,606         103     279,709
  55                         Greenland    273,175           0     273,175
  56                         Australia    174,629      96,847     271,477
  57                            Poland    218,115      38,300     256,415
  58  Democratic Republic of the Congo    237,372       3,161     240,533
  59                          Cameroon    233,190       2,315     235,505
  60                            Sweden    208,783      15,747     224,530
  61                           Finland    192,065      14,412     206,477
  62                      Sierra Leone    202,100          75     202,175
  63                            Greece     76,362     123,410     199,772
  64                          Portugal    186,950       9,787     196,737
  65                             Kenya    171,391      15,360     186,751
  66                          Colombia     86,344      96,970     183,314
  67                              Laos     70,915     109,835     180,750
  68                          Kiribati    172,822       3,652     176,474
  69                        Madagascar    142,333      25,998     168,331
  70                            Malawi    152,852       7,646     160,498
  71                             Yemen    154,450           0     154,450
  72                            Panama    144,450       8,808     153,258
  73                         Hong Kong    142,775       4,258     147,033
  74                           Tunisia    115,064      16,165     131,229
  75                          Maldives    129,191           0     129,191
  76                            Guinea    128,000         250     128,250
  77                        Seychelles    127,128           0     127,128
  78                            Latvia    114,806         788     115,594
  79                            Zambia     83,918      30,285     114,203
  80                         Lithuania    106,945       4,393     111,338
  81                              Chad    110,000          94     110,094
  82                      Saudi Arabia     68,082      39,920     108,002
  83                              Mali    102,486       4,194     106,680
  84                           Ukraine     75,743      21,425      97,168
  85                           Algeria     95,000       1,361      96,361
  86                            Belize     91,432         953      92,385
  87    Federated States of Micronesia     88,397           0      88,397
  88                           Croatia     72,312      15,805      88,116
  89             Republic of the Congo     86,748         177      86,925
  90                   Solomon Islands     66,445      10,582      77,027
  91                           Estonia     75,931         868      76,799
  92              United Arab Emirates     73,000       1,241      74,241
  93                       Ivory Coast     67,500       4,701      72,201
  94                             Nepal     21,500      49,043      70,543
  95                         Nicaragua     45,500      22,530      68,030
  96             Saint Kitts and Nevis     65,734           1      65,735
  97                        Uzbekistan     27,267      38,055      65,322
  98                  Marshall Islands     64,795           5      64,800
  99                          Honduras     10,600      53,100      63,700
  100                      El Salvador     54,084       7,956      62,040
  101                           Gambia     58,261           5      58,296
  102                             Iraq     28,000      28,835      56,835
  103                            Benin     49,806       3,080      52,886
  104                             Cuba     23,574      29,185      52,759
  105                          Uruguay     51,500          70      51,570
  106                         Suriname     47,013         102      47,115
  107                         Ethiopia     45,500          95      45,595
  108                             Fiji     44,663         754      45,417
  109                        Guatemala     19,011      26,268      45,279
  110                          Vanuatu     44,002          16      44,018
  111                       Kazakhstan     41,335       1,878      43,213
  112                           Guyana     42,142         337      42,478
  113                            Sudan     33,002       4,500      37,502
  114                       Costa Rica     14,750      22,421      37,171
  115                          Curaçao     35,534           0      35,534
  116                      South Sudan     35,000          20      35,020
  117                            Niger     34,592         300      34,892
  118                             Togo     31,891          98      31,989
  119                            Gabon     31,000          45      31,045
  120                          Georgia     30,078         670      30,748
  121                            Libya     30,002          10      30,012
  122                          Somalia     30,000           0      30,000
  123         Central African Republic     29,000         190      29,190
  124                          Belgium     26,970          44      27,014
  125                           Rwanda     25,013         580      26,593
  126                         Zimbabwe     15,711      10,085      25,796
  127                         Paraguay     17,000       8,500      25,500
  128                          Romania     12,728      12,574      25,302
  129                   Czech Republic      3,507      20,952      24,459
  130                         Bulgaria      8,614      15,762      24,376
  131                          Burundi     21,805       1,330      23,135
  132 Saint Vincent and the Grenadines     23,077           0      23,077
  133                     Burkina Faso     22,070         470      22,540
  134                          Hungary      5,048      16,248      21,296
  135                           Israel      1,758      18,914      20,672
  136                          Armenia      3,758      16,381      20,139
  137                       Cape Verde     19,900           0      19,900
  138                        Mauritius     18,062       1,021      19,083
  139                            Haiti     16,510       1,220      17,730
  140                          Jamaica     16,800         620      17,420
  141               Dominican Republic     14,640       2,285      16,925
  142                          Comoros     16,407           0      16,407
  143                 French Polynesia     13,754       1,343      15,097
  144                     Turkmenistan     15,000          30      15,030
  145                          Bahrain     15,000           6      15,006
  146                          Liberia     14,700          40      14,740
  147                            Qatar     14,516          10      14,526
  148                           Brunei     13,292         948      14,240
  149              Trinidad and Tobago     13,027          11      13,038
  150                          Moldova         50      16,011      12,061
  151                          Belarus        686      11,199      11,885
  152            São Tomé and Príncipe     11,750           0      11,750
  153                          Bahamas     11,625           8      11,633
  154                          Bolivia      7,000       3,000      10,000
  155                           Serbia      2,067       6,878       8,945
  156                          Albania      7,886         950       8,836
  157                            Samoa      8,801          10       8,811
  158                            Malta      2,420       6,073       8,493
  159                           Cyprus      1,507       6,625       8,132
  160                Equatorial Guinea      8,000          15       8,015
  161                           Tuvalu      7,684           3       7,687
  162                        Singapore      1,234       6,112       7,346
  163                      Isle of Man      7,040           0       7,040
  164                            Syria      4,500       3,500       7,000
  165                    Guinea-Bissau      6,700           0       6,700
  166                           Kuwait      5,493         196       5,689
  167                    New Caledonia      3,815       1,587       5,402
  168                          Lebanon      4,291       1,015       5,306
  169           Bosnia and Herzegovina        305       4,564       4,869
  170                      Timor-Leste      3,200       1,560       4,760
  171             Jersey and  Guernsey      2,985       1,499       4,483
  172                          Eritrea      4,300           0       4,300
  173                         Slovakia      1,866       2,169       4,035
  174                          Austria        350       3,483       3,833
  175                        Palestine      3,306         280       3,586
  176                      Switzerland      1,851       1,733       3,584
  177                   American Samoa      3,047          20       3,067
  178              Antigua and Barbuda      3,000          10       3,010
  179         Turks and Caicos Islands      2,780           0       2,780
  180                          Grenada      2,550           0       2,550
  181                       Montenegro      1,595         929       2,524
  182                         Djibouti      2,220           0       2,220
  183                      Afghanistan      1,000       1,200       2,200
  184                         Slovenia        311       1,844       2,154
  185                      Saint Lucia      2,097          32       2,129
  186                       Kyrgyzstan         89       1,931       2,020
  187                      Puerto Rico      1,901          20       1,921
  188                         Barbados      1,735          26       1,761
  189                           Jordan        873         885       1,758
  190                            Tonga      1,697           3       1,700
  191                       Tajikistan      1,100         450       1,550
  192                             Guam      1,391         110       1,501
  193                            Macao      1,500           0       1,500
  194                       Azerbaijan        676         640       1,316
  195                  North Macedonia        306         986       1,292
  196           British Virgin Islands      1,200           0       1,200
  197                          Lesotho         52       1,050       1,102
  198         Northern Mariana Islands        950          42         992
  199                            Palau        818          23         841
  200                         Dominica        770           6         776
  201                US Virgin Islands        551           8         559
  202                            Nauru        530           0         530
  203                          Bermuda        410           0         410
  204                     Sint Maarten        253           0         253
  205                         Eswatini         65         100         165
  206                           Bhutan          7         150         157
  207                            Aruba        142           0         142
  208                   Cayman Islands        125           0         125
  209                           France         90           0          90
  210                         Botswana         38          15          53
  211                         Mongolia         15           0          15
  212                           Monaco          1           0           1
  213                        Gibraltar          1           0           1
  214                          Andorra          0           0           0
  215                    Liechtenstein          0           0           0
  216                       Luxembourg          0           0           0
  217                       San Marino          0           0           0
  218                     Total (2015) 93,736,944 101,084,799 199,741,129
```
]

---
count: false

```r
# using the htmltab package to scan html file
htmltab::htmltab(doc = "raw_data/wikipedia_fisheries.html",
# the second table object is what we want
                 which = 2) ->
# assign this to an object with the _raw suffix
*fisheries_2015_raw
```
]
 
.panel2-read_in_fisheries-auto[

]

---

## Step 3:  Load tidyverse, Clean data and create an cleaned object *like fisheries_2015*

We'll do most of the manipulation with dplyr and tidyr which are subpackages of the tidyverse.

---

```r
*fisheries_2015_raw
```
]
 
.panel2-clean_wikipedia_fisheries-auto[

---
count: false

```r
fisheries_2015_raw %>%
* as_tibble()   # show data in "tidyverse" way, as_tibble
```
]
 
.panel2-clean_wikipedia_fisheries-auto[

```
  # A tibble: 217 x 4
     Country       Capture    Aquaculture Total     
     <chr>         <chr>      <chr>       <chr>     
   1 China         17,800,000 63,700,000  81,500,000
   2 Indonesia     6,584,419  16,600,000  23,200,000
   3 India         5,082,332  5,703,002   10,800,000
   4 Vietnam       2,785,940  3,634,531   6,420,471 
   5 United States 4,931,017  444,369     5,375,386 
   6 Russia        4,773,413  173,840     4,947,253 
   7 Japan         3,275,263  1,067,994   4,343,257 
   8 Philippines   2,027,992  2,200,914   4,228,906 
   9 Peru          3,811,802  100,187     3,911,989 
  10 Bangladesh    1,674,770  2,203,554   3,878,324 
  # … with 207 more rows
```
]

---
count: false

```
  # A tibble: 217 x 4
     country       capture    aquaculture total     
     <chr>         <chr>      <chr>       <chr>     
   1 China         17,800,000 63,700,000  81,500,000
   2 Indonesia     6,584,419  16,600,000  23,200,000
   3 India         5,082,332  5,703,002   10,800,000
   4 Vietnam       2,785,940  3,634,531   6,420,471 
   5 United States 4,931,017  444,369     5,375,386 
   6 Russia        4,773,413  173,840     4,947,253 
   7 Japan         3,275,263  1,067,994   4,343,257 
   8 Philippines   2,027,992  2,200,914   4,228,906 
   9 Peru          3,811,802  100,187     3,911,989 
  10 Bangladesh    1,674,770  2,203,554   3,878,324 
  # … with 207 more rows
```
]

---
count: false

```
  # A tibble: 217 x 4
     country        capture aquaculture total     
     <chr>            <dbl> <chr>       <chr>     
   1 China         17800000 63,700,000  81,500,000
   2 Indonesia      6584419 16,600,000  23,200,000
   3 India          5082332 5,703,002   10,800,000
   4 Vietnam        2785940 3,634,531   6,420,471 
   5 United States  4931017 444,369     5,375,386 
   6 Russia         4773413 173,840     4,947,253 
   7 Japan          3275263 1,067,994   4,343,257 
   8 Philippines    2027992 2,200,914   4,228,906 
   9 Peru           3811802 100,187     3,911,989 
  10 Bangladesh     1674770 2,203,554   3,878,324 
  # … with 207 more rows
```
]

---
count: false

```
  # A tibble: 217 x 4
     country        capture aquaculture total     
     <chr>            <dbl>       <dbl> <chr>     
   1 China         17800000    63700000 81,500,000
   2 Indonesia      6584419    16600000 23,200,000
   3 India          5082332     5703002 10,800,000
   4 Vietnam        2785940     3634531 6,420,471 
   5 United States  4931017      444369 5,375,386 
   6 Russia         4773413      173840 4,947,253 
   7 Japan          3275263     1067994 4,343,257 
   8 Philippines    2027992     2200914 4,228,906 
   9 Peru           3811802      100187 3,911,989 
  10 Bangladesh     1674770     2203554 3,878,324 
  # … with 207 more rows
```
]

---
count: false

```
  # A tibble: 217 x 4
     country        capture aquaculture    total
     <chr>            <dbl>       <dbl>    <dbl>
   1 China         17800000    63700000 81500000
   2 Indonesia      6584419    16600000 23200000
   3 India          5082332     5703002 10800000
   4 Vietnam        2785940     3634531  6420471
   5 United States  4931017      444369  5375386
   6 Russia         4773413      173840  4947253
   7 Japan          3275263     1067994  4343257
   8 Philippines    2027992     2200914  4228906
   9 Peru           3811802      100187  3911989
  10 Bangladesh     1674770     2203554  3878324
  # … with 207 more rows
```
]

---
count: false

---
count: false

```r
fisheries_2015_raw %>%
  as_tibble() %>%  # show data in "tidyverse" way, as_tibble
# you may need to install the janitor package for this next step
# if you have not already done so
  janitor::clean_names(case = "snake") %>%  # snake is the default too
# converting columns to numeric 'dbl', which were character 'cha'
  mutate(capture =
           readr::parse_number(x = capture)) %>%
  mutate(aquaculture =
           readr::parse_number(x = aquaculture)) %>%
  mutate(total =
           readr::parse_number(x = total)) %>%
# removing white spaces before and after a string
  mutate(country = str_trim(string = country)) %>%
# creating region variable
# using countrycode function from countrycode package
# if you haven't done so,
* mutate(region =
*          countrycode::countrycode(
*            sourcevar = country,
*            origin = "country.name",
*            dest = "region"))
```
]
 
.panel2-clean_wikipedia_fisheries-auto[

```
  # A tibble: 217 x 5
     country        capture aquaculture    total region            
     <chr>            <dbl>       <dbl>    <dbl> <chr>             
   1 China         17800000    63700000 81500000 Eastern Asia      
   2 Indonesia      6584419    16600000 23200000 South-Eastern Asia
   3 India          5082332     5703002 10800000 Southern Asia     
   4 Vietnam        2785940     3634531  6420471 South-Eastern Asia
   5 United States  4931017      444369  5375386 Northern America  
   6 Russia         4773413      173840  4947253 Eastern Europe    
   7 Japan          3275263     1067994  4343257 Eastern Asia      
   8 Philippines    2027992     2200914  4228906 South-Eastern Asia
   9 Peru           3811802      100187  3911989 South America     
  10 Bangladesh     1674770     2203554  3878324 Southern Asia     
  # … with 207 more rows
```
]

---
count: false

```r
fisheries_2015_raw %>%
  as_tibble() %>%  # show data in "tidyverse" way, as_tibble
# you may need to install the janitor package for this next step
# if you have not already done so
  janitor::clean_names(case = "snake") %>%  # snake is the default too
# converting columns to numeric 'dbl', which were character 'cha'
  mutate(capture =
           readr::parse_number(x = capture)) %>%
  mutate(aquaculture =
           readr::parse_number(x = aquaculture)) %>%
  mutate(total =
           readr::parse_number(x = total)) %>%
# removing white spaces before and after a string
  mutate(country = str_trim(string = country)) %>%
# creating region variable
# using countrycode function from countrycode package
# if you haven't done so,
  mutate(region =
           countrycode::countrycode(
             sourcevar = country,
             origin = "country.name",
             dest = "region")) %>%
# country codes are useful for merging data
* mutate(iso3c =
*          countrycode::countrycode(
*            sourcevar = country,
*            origin = "country.name",
*            dest = "iso3c"))
```
]
 
.panel2-clean_wikipedia_fisheries-auto[

```
  # A tibble: 217 x 6
     country        capture aquaculture    total region             iso3c
     <chr>            <dbl>       <dbl>    <dbl> <chr>              <chr>
   1 China         17800000    63700000 81500000 Eastern Asia       CHN  
   2 Indonesia      6584419    16600000 23200000 South-Eastern Asia IDN  
   3 India          5082332     5703002 10800000 Southern Asia      IND  
   4 Vietnam        2785940     3634531  6420471 South-Eastern Asia VNM  
   5 United States  4931017      444369  5375386 Northern America   USA  
   6 Russia         4773413      173840  4947253 Eastern Europe     RUS  
   7 Japan          3275263     1067994  4343257 Eastern Asia       JPN  
   8 Philippines    2027992     2200914  4228906 South-Eastern Asia PHL  
   9 Peru           3811802      100187  3911989 South America      PER  
  10 Bangladesh     1674770     2203554  3878324 Southern Asia      BGD  
  # … with 207 more rows
```
]

---
count: false

```r
fisheries_2015_raw %>%
  as_tibble() %>%  # show data in "tidyverse" way, as_tibble
# you may need to install the janitor package for this next step
# if you have not already done so
  janitor::clean_names(case = "snake") %>%  # snake is the default too
# converting columns to numeric 'dbl', which were character 'cha'
  mutate(capture =
           readr::parse_number(x = capture)) %>%
  mutate(aquaculture =
           readr::parse_number(x = aquaculture)) %>%
  mutate(total =
           readr::parse_number(x = total)) %>%
# removing white spaces before and after a string
  mutate(country = str_trim(string = country)) %>%
# creating region variable
# using countrycode function from countrycode package
# if you haven't done so,
  mutate(region =
           countrycode::countrycode(
             sourcevar = country,
             origin = "country.name",
             dest = "region")) %>%
# country codes are useful for merging data
  mutate(iso3c =
           countrycode::countrycode(
             sourcevar = country,
             origin = "country.name",
             dest = "iso3c")) %>%
# one row is an aggregation - deleting this
* filter(country != "Total (2015)")
```
]
 
.panel2-clean_wikipedia_fisheries-auto[

```
  # A tibble: 216 x 6
     country        capture aquaculture    total region             iso3c
     <chr>            <dbl>       <dbl>    <dbl> <chr>              <chr>
   1 China         17800000    63700000 81500000 Eastern Asia       CHN  
   2 Indonesia      6584419    16600000 23200000 South-Eastern Asia IDN  
   3 India          5082332     5703002 10800000 Southern Asia      IND  
   4 Vietnam        2785940     3634531  6420471 South-Eastern Asia VNM  
   5 United States  4931017      444369  5375386 Northern America   USA  
   6 Russia         4773413      173840  4947253 Eastern Europe     RUS  
   7 Japan          3275263     1067994  4343257 Eastern Asia       JPN  
   8 Philippines    2027992     2200914  4228906 South-Eastern Asia PHL  
   9 Peru           3811802      100187  3911989 South America      PER  
  10 Bangladesh     1674770     2203554  3878324 Southern Asia      BGD  
  # … with 206 more rows
```
]

---
count: false

```r
fisheries_2015_raw %>%
  as_tibble() %>%  # show data in "tidyverse" way, as_tibble
# you may need to install the janitor package for this next step
# if you have not already done so
  janitor::clean_names(case = "snake") %>%  # snake is the default too
# converting columns to numeric 'dbl', which were character 'cha'
  mutate(capture =
           readr::parse_number(x = capture)) %>%
  mutate(aquaculture =
           readr::parse_number(x = aquaculture)) %>%
  mutate(total =
           readr::parse_number(x = total)) %>%
# removing white spaces before and after a string
  mutate(country = str_trim(string = country)) %>%
# creating region variable
# using countrycode function from countrycode package
# if you haven't done so,
  mutate(region =
           countrycode::countrycode(
             sourcevar = country,
             origin = "country.name",
             dest = "region")) %>%
# country codes are useful for merging data
  mutate(iso3c =
           countrycode::countrycode(
             sourcevar = country,
             origin = "country.name",
             dest = "iso3c")) %>%
# one row is an aggregation - deleting this
  filter(country != "Total (2015)") ->
*fisheries_2015
```
]
 
.panel2-clean_wikipedia_fisheries-auto[

]

---

## Step 4: Saving out clean data:

Now you can save out your work in your data_products folder as follows:

```r
save(fisheries_2015, file = "data_products/fisheries_2015.Rdata")
```

This will typically be the last step in your 01_data_cleaning.Rmd file.

---

## Step 5: Time for Analysis!!! Yay!

The first code chunk in 02_analysis.Rmd should look like this.  This makes the tidyverse and the cleaned data available at the start of your analysis.

```r
library(tidyverse) # for wrangling (dplyr and tidyr) and plotting (ggplot)
load("data_products/fisheries_2015.Rdata")
```

---

## Project overview

Now your'll have some files in raw_data and data_products folders:

```r
final_project
├── 01_data_cleaning.Rmd   <- work files
├── 02_analysis.Rmd        <- work files
├── final_project.Rproj    <- you declared your project
└── raw_data        <- Folder
    └── wikipedia_fisheries.html  <- New! html file, containing table
├── data_products   <- Folder
│   └── fisheries_2015.Rdata      <- New! cleaned data
├── figures         <- Folder
├── literature      <- Folder
```

---

# Example 2:

## Step 0: find data!

You might be interested in % of land area in a country that is forest use.  This is an indicator available at worldbank.org.

---

## Step 1: get data

This time, I just downloaded the data from the website, saving it to my *raw_data* folder.

---

## Step 2: scan in the data and create `wb_percent_forest_raw`

### Step 2a: Use RStudio importer tool

Navigate to your raw_data folder in the "files" pane. Then click on data and then choose "Import Dataset"

![](figures/import_dataset.png)

---

### Step 2b: Make adjustments if needed

This is what I see initially.

---

### Step 2b: Make adjustments if needed

There is 'meta data' in the first three rows.  The dataset doesn't really get start until row 4.  Therefore, in the **import options** section, I'll adjust the option for **Skip** to *3*.  I then click **update** in the righthand corner.

---

### Step 2c: Copy everything **except** View() code from **Code Preview**

Copy everything *but* View() code from **Code Preview** and paste this into a code chunk in the file `01_data_cleaning.Rmd`.

![](figures/copy_code_from_import_tool.png)

---

### Step 2c: Change default name

You'll paste code that looks like this:

```r
library(readxl)
API_AG_LND_FRST_ZS_DS2_en_excel_v2_248616 <- read_excel("raw_data/API_AG.LND.FRST.ZS_DS2_en_excel_v2_248616.xls", 
    skip = 3)
```

But this is a bit of a long object name that is automatically generated, so you might modify this to something simpler, with the '_raw' suffix.

```r
library(readxl)
wb_percent_forest_raw <- read_excel("raw_data/API_AG.LND.FRST.ZS_DS2_en_excel_v2_248616.xls", 
    skip = 3)
```

Execute the code!

---

## Step 3: Load tidyverse and clean the data

Now, starting with the object you created by scanning in the data and assigning it to the object `wb_percent_forest_raw`, you are ready to perform cleaning steps.

---

```r
*wb_percent_forest_raw
```
]
 
.panel2-wb_cleaning-auto[

```
  # A tibble: 264 x 63
     `Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
     <chr>          <chr>          <chr>            <chr>            <lgl>  <lgl> 
   1 Aruba          ABW            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   2 Afghanistan    AFG            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   3 Angola         AGO            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   4 Albania        ALB            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   5 Andorra        AND            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   6 Arab World     ARB            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   7 United Arab E… ARE            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   8 Argentina      ARG            Forest area (% … AG.LND.FRST.ZS   NA     NA    
   9 Armenia        ARM            Forest area (% … AG.LND.FRST.ZS   NA     NA    
  10 American Samoa ASM            Forest area (% … AG.LND.FRST.ZS   NA     NA    
  # … with 254 more rows, and 57 more variables: `1962` <lgl>, `1963` <lgl>,
  #   `1964` <lgl>, `1965` <lgl>, `1966` <lgl>, `1967` <lgl>, `1968` <lgl>,
  #   `1969` <lgl>, `1970` <lgl>, `1971` <lgl>, `1972` <lgl>, `1973` <lgl>,
  #   `1974` <lgl>, `1975` <lgl>, `1976` <lgl>, `1977` <lgl>, `1978` <lgl>,
  #   `1979` <lgl>, `1980` <lgl>, `1981` <lgl>, `1982` <lgl>, `1983` <lgl>,
  #   `1984` <lgl>, `1985` <lgl>, `1986` <lgl>, `1987` <lgl>, `1988` <lgl>,
  #   `1989` <lgl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
  #   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
  #   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>,
  #   `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
  #   `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
  #   `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <lgl>, `2018` <lgl>
```
]

---
count: false

```r
wb_percent_forest_raw %>%
# change names to snake case
* janitor::clean_names()
```
]
 
.panel2-wb_cleaning-auto[

```
  # A tibble: 264 x 63
     country_name country_code indicator_name indicator_code x1960 x1961 x1962
     <chr>        <chr>        <chr>          <chr>          <lgl> <lgl> <lgl>
   1 Aruba        ABW          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   2 Afghanistan  AFG          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   3 Angola       AGO          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   4 Albania      ALB          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   5 Andorra      AND          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   6 Arab World   ARB          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   7 United Arab… ARE          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   8 Argentina    ARG          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
   9 Armenia      ARM          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
  10 American Sa… ASM          Forest area (… AG.LND.FRST.ZS NA    NA    NA   
  # … with 254 more rows, and 56 more variables: x1963 <lgl>, x1964 <lgl>,
  #   x1965 <lgl>, x1966 <lgl>, x1967 <lgl>, x1968 <lgl>, x1969 <lgl>,
  #   x1970 <lgl>, x1971 <lgl>, x1972 <lgl>, x1973 <lgl>, x1974 <lgl>,
  #   x1975 <lgl>, x1976 <lgl>, x1977 <lgl>, x1978 <lgl>, x1979 <lgl>,
  #   x1980 <lgl>, x1981 <lgl>, x1982 <lgl>, x1983 <lgl>, x1984 <lgl>,
  #   x1985 <lgl>, x1986 <lgl>, x1987 <lgl>, x1988 <lgl>, x1989 <lgl>,
  #   x1990 <dbl>, x1991 <dbl>, x1992 <dbl>, x1993 <dbl>, x1994 <dbl>,
  #   x1995 <dbl>, x1996 <dbl>, x1997 <dbl>, x1998 <dbl>, x1999 <dbl>,
  #   x2000 <dbl>, x2001 <dbl>, x2002 <dbl>, x2003 <dbl>, x2004 <dbl>,
  #   x2005 <dbl>, x2006 <dbl>, x2007 <dbl>, x2008 <dbl>, x2009 <dbl>,
  #   x2010 <dbl>, x2011 <dbl>, x2012 <dbl>, x2013 <dbl>, x2014 <dbl>,
  #   x2015 <dbl>, x2016 <dbl>, x2017 <lgl>, x2018 <lgl>
```
]

---
count: false

```r
wb_percent_forest_raw %>%
# change names to snake case
  janitor::clean_names() %>%
# get rid of columns that are not variables
# indicator name and indicator code
* select(-indicator_name, -indicator_code)
```
]
 
.panel2-wb_cleaning-auto[

```
  # A tibble: 264 x 61
     country_name country_code x1960 x1961 x1962 x1963 x1964 x1965 x1966 x1967
     <chr>        <chr>        <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
   1 Aruba        ABW          NA    NA    NA    NA    NA    NA    NA    NA   
   2 Afghanistan  AFG          NA    NA    NA    NA    NA    NA    NA    NA   
   3 Angola       AGO          NA    NA    NA    NA    NA    NA    NA    NA   
   4 Albania      ALB          NA    NA    NA    NA    NA    NA    NA    NA   
   5 Andorra      AND          NA    NA    NA    NA    NA    NA    NA    NA   
   6 Arab World   ARB          NA    NA    NA    NA    NA    NA    NA    NA   
   7 United Arab… ARE          NA    NA    NA    NA    NA    NA    NA    NA   
   8 Argentina    ARG          NA    NA    NA    NA    NA    NA    NA    NA   
   9 Armenia      ARM          NA    NA    NA    NA    NA    NA    NA    NA   
  10 American Sa… ASM          NA    NA    NA    NA    NA    NA    NA    NA   
  # … with 254 more rows, and 51 more variables: x1968 <lgl>, x1969 <lgl>,
  #   x1970 <lgl>, x1971 <lgl>, x1972 <lgl>, x1973 <lgl>, x1974 <lgl>,
  #   x1975 <lgl>, x1976 <lgl>, x1977 <lgl>, x1978 <lgl>, x1979 <lgl>,
  #   x1980 <lgl>, x1981 <lgl>, x1982 <lgl>, x1983 <lgl>, x1984 <lgl>,
  #   x1985 <lgl>, x1986 <lgl>, x1987 <lgl>, x1988 <lgl>, x1989 <lgl>,
  #   x1990 <dbl>, x1991 <dbl>, x1992 <dbl>, x1993 <dbl>, x1994 <dbl>,
  #   x1995 <dbl>, x1996 <dbl>, x1997 <dbl>, x1998 <dbl>, x1999 <dbl>,
  #   x2000 <dbl>, x2001 <dbl>, x2002 <dbl>, x2003 <dbl>, x2004 <dbl>,
  #   x2005 <dbl>, x2006 <dbl>, x2007 <dbl>, x2008 <dbl>, x2009 <dbl>,
  #   x2010 <dbl>, x2011 <dbl>, x2012 <dbl>, x2013 <dbl>, x2014 <dbl>,
  #   x2015 <dbl>, x2016 <dbl>, x2017 <lgl>, x2018 <lgl>
```
]

---
count: false

```
  # A tibble: 264 x 8
     country_name         country_code x2000 x2001 x2002 x2003 x2004 x2005
     <chr>                <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
   1 Aruba                ABW           2.33  2.33  2.33  2.33  2.33  2.33
   2 Afghanistan          AFG           2.07  2.07  2.07  2.07  2.07  2.07
   3 Angola               AGO          47.9  47.8  47.7  47.6  47.5  47.4 
   4 Albania              ALB          28.1  28.2  28.3  28.4  28.5  28.6 
   5 Andorra              AND          34.0  34.0  34.0  34.0  34.0  34.0 
   6 Arab World           ARB           1.68  1.68  1.68  1.69  1.69  1.69
   7 United Arab Emirates ARE           4.36  4.37  4.38  4.38  4.39  4.39
   8 Argentina            ARG          11.6  11.5  11.4  11.3  11.2  11.0 
   9 Armenia              ARM          11.7  11.7  11.7  11.7  11.7  11.7 
  10 American Samoa       ASM          90.2  90.1  89.9  89.8  89.5  89.4 
  # … with 254 more rows
```
]

---
count: false

```
  # A tibble: 1,584 x 4
     country_name         country_code year  percent_forest
     <chr>                <chr>        <chr>          <dbl>
   1 Aruba                ABW          x2000           2.33
   2 Afghanistan          AFG          x2000           2.07
   3 Angola               AGO          x2000          47.9 
   4 Albania              ALB          x2000          28.1 
   5 Andorra              AND          x2000          34.0 
   6 Arab World           ARB          x2000           1.68
   7 United Arab Emirates ARE          x2000           4.36
   8 Argentina            ARG          x2000          11.6 
   9 Armenia              ARM          x2000          11.7 
  10 American Samoa       ASM          x2000          90.2 
  # … with 1,574 more rows
```
]

---
count: false

```
  # A tibble: 1,584 x 4
     country_name         country_code year  percent_forest
     <chr>                <chr>        <chr>          <dbl>
   1 Aruba                ABW          2000            2.33
   2 Afghanistan          AFG          2000            2.07
   3 Angola               AGO          2000           47.9 
   4 Albania              ALB          2000           28.1 
   5 Andorra              AND          2000           34.0 
   6 Arab World           ARB          2000            1.68
   7 United Arab Emirates ARE          2000            4.36
   8 Argentina            ARG          2000           11.6 
   9 Armenia              ARM          2000           11.7 
  10 American Samoa       ASM          2000           90.2 
  # … with 1,574 more rows
```
]

---
count: false

```
  # A tibble: 1,584 x 4
     country_name         country_code  year percent_forest
     <chr>                <chr>        <dbl>          <dbl>
   1 Aruba                ABW           2000           2.33
   2 Afghanistan          AFG           2000           2.07
   3 Angola               AGO           2000          47.9 
   4 Albania              ALB           2000          28.1 
   5 Andorra              AND           2000          34.0 
   6 Arab World           ARB           2000           1.68
   7 United Arab Emirates ARE           2000           4.36
   8 Argentina            ARG           2000          11.6 
   9 Armenia              ARM           2000          11.7 
  10 American Samoa       ASM           2000          90.2 
  # … with 1,574 more rows
```
]

---
count: false

```r
wb_percent_forest_raw %>%
# change names to snake case
  janitor::clean_names() %>%
# get rid of columns that are not variables
# indicator name and indicator code
  select(-indicator_name, -indicator_code) %>%
# suppose you are interested only in 2000 to 2005
  select(country_name, country_code, x2000:x2005) %>%
# data is "wide", but to use for analysis
# we transform it to "long"
  gather(key = "year",
         value = "percent_forest",
         x2000:x2005) %>%
  mutate(year = str_remove(year, "x")) %>%
  mutate(year = as.numeric(year)) %>%
# reordering data if you like
* arrange(-percent_forest)
```
]
 
.panel2-wb_cleaning-auto[

```
  # A tibble: 1,584 x 4
     country_name          country_code  year percent_forest
     <chr>                 <chr>        <dbl>          <dbl>
   1 Suriname              SUR           2000           98.7
   2 Suriname              SUR           2001           98.6
   3 Suriname              SUR           2002           98.6
   4 Suriname              SUR           2003           98.6
   5 Suriname              SUR           2004           98.6
   6 Suriname              SUR           2005           98.5
   7 Micronesia, Fed. Sts. FSM           2005           91.4
   8 Micronesia, Fed. Sts. FSM           2004           91.4
   9 Micronesia, Fed. Sts. FSM           2003           91.3
  10 Micronesia, Fed. Sts. FSM           2002           91.3
  # … with 1,574 more rows
```
]

---
count: false

```r
wb_percent_forest_raw %>%
# change names to snake case
  janitor::clean_names() %>%
# get rid of columns that are not variables
# indicator name and indicator code
  select(-indicator_name, -indicator_code) %>%
# suppose you are interested only in 2000 to 2005
  select(country_name, country_code, x2000:x2005) %>%
# data is "wide", but to use for analysis
# we transform it to "long"
  gather(key = "year",
         value = "percent_forest",
         x2000:x2005) %>%
  mutate(year = str_remove(year, "x")) %>%
  mutate(year = as.numeric(year)) %>%
# reordering data if you like
  arrange(-percent_forest) %>%
# rename variable
* rename(wb_country_code = country_code)
```
]
 
.panel2-wb_cleaning-auto[

```
  # A tibble: 1,584 x 4
     country_name          wb_country_code  year percent_forest
     <chr>                 <chr>           <dbl>          <dbl>
   1 Suriname              SUR              2000           98.7
   2 Suriname              SUR              2001           98.6
   3 Suriname              SUR              2002           98.6
   4 Suriname              SUR              2003           98.6
   5 Suriname              SUR              2004           98.6
   6 Suriname              SUR              2005           98.5
   7 Micronesia, Fed. Sts. FSM              2005           91.4
   8 Micronesia, Fed. Sts. FSM              2004           91.4
   9 Micronesia, Fed. Sts. FSM              2003           91.3
  10 Micronesia, Fed. Sts. FSM              2002           91.3
  # … with 1,574 more rows
```
]

---
count: false

```r
wb_percent_forest_raw %>%
# change names to snake case
  janitor::clean_names() %>%
# get rid of columns that are not variables
# indicator name and indicator code
  select(-indicator_name, -indicator_code) %>%
# suppose you are interested only in 2000 to 2005
  select(country_name, country_code, x2000:x2005) %>%
# data is "wide", but to use for analysis
# we transform it to "long"
  gather(key = "year",
         value = "percent_forest",
         x2000:x2005) %>%
  mutate(year = str_remove(year, "x")) %>%
  mutate(year = as.numeric(year)) %>%
# reordering data if you like
  arrange(-percent_forest) %>%
# rename variable
  rename(wb_country_code = country_code) %>%
* mutate(continent =
*          countrycode::countrycode(
*            sourcevar = wb_country_code,
*            origin = "wb",
*            destination = "continent"
*          ))
```
]
 
.panel2-wb_cleaning-auto[

```
  # A tibble: 1,584 x 5
     country_name          wb_country_code  year percent_forest continent
     <chr>                 <chr>           <dbl>          <dbl> <chr>    
   1 Suriname              SUR              2000           98.7 Americas 
   2 Suriname              SUR              2001           98.6 Americas 
   3 Suriname              SUR              2002           98.6 Americas 
   4 Suriname              SUR              2003           98.6 Americas 
   5 Suriname              SUR              2004           98.6 Americas 
   6 Suriname              SUR              2005           98.5 Americas 
   7 Micronesia, Fed. Sts. FSM              2005           91.4 Oceania  
   8 Micronesia, Fed. Sts. FSM              2004           91.4 Oceania  
   9 Micronesia, Fed. Sts. FSM              2003           91.3 Oceania  
  10 Micronesia, Fed. Sts. FSM              2002           91.3 Oceania  
  # … with 1,574 more rows
```
]

---
count: false

```r
wb_percent_forest_raw %>%
# change names to snake case
  janitor::clean_names() %>%
# get rid of columns that are not variables
# indicator name and indicator code
  select(-indicator_name, -indicator_code) %>%
# suppose you are interested only in 2000 to 2005
  select(country_name, country_code, x2000:x2005) %>%
# data is "wide", but to use for analysis
# we transform it to "long"
  gather(key = "year",
         value = "percent_forest",
         x2000:x2005) %>%
  mutate(year = str_remove(year, "x")) %>%
  mutate(year = as.numeric(year)) %>%
# reordering data if you like
  arrange(-percent_forest) %>%
# rename variable
  rename(wb_country_code = country_code) %>%
  mutate(continent =
           countrycode::countrycode(
             sourcevar = wb_country_code,
             origin = "wb",
             destination = "continent"
           )) ->
*wb_percent_forest
```
]
 
.panel2-wb_cleaning-auto[

]

---

## Step 4: Save Clean Data

```r
save(wb_percent_forest, file = "data_products/wb_percent_forest.Rdata")
```

---

## Step 5: Time for Analysis!

The first code chunk in 02_analysis.Rmd should look like this.  This makes the tidyverse and the cleaned data available at the start of your analysis.

```r
library(tidyverse)
load(file = "data_products/wb_percent_forest.Rdata")
```

---

# Deeper Dives: country codes, dates, string manipulation

The examples above should give you some intuition for the look and feel of some real data cleaning pipelines.  But they of course are not very comprehensive. In this section, let's focus manipulation using a few packages.

- countrycode - translates between one name or id for a country to another
- lubridate - working with dates
- stringr - character manipulation

---

## Country Code

Countries can have many names.  Consider the United States.  You can call it the US, the USA, the United States of America and so on.  The package countrycode can be helpful in translating between different ways to refer to a specific country. Let's see how you might use it.

---

```r
*library(gapminder)
```
]
 
.panel2-countrycode-auto[

]

---
count: false

```r
library(gapminder)
*library(countrycode)
```
]
 
.panel2-countrycode-auto[

]

---
count: false

```r
library(gapminder)
library(countrycode)
*gapminder
```
]
 
.panel2-countrycode-auto[

```
  # A tibble: 1,704 x 6
     country     continent  year lifeExp      pop gdpPercap
     <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
   1 Afghanistan Asia       1952    28.8  8425333      779.
   2 Afghanistan Asia       1957    30.3  9240934      821.
   3 Afghanistan Asia       1962    32.0 10267083      853.
   4 Afghanistan Asia       1967    34.0 11537966      836.
   5 Afghanistan Asia       1972    36.1 13079460      740.
   6 Afghanistan Asia       1977    38.4 14880372      786.
   7 Afghanistan Asia       1982    39.9 12881816      978.
   8 Afghanistan Asia       1987    40.8 13867957      852.
   9 Afghanistan Asia       1992    41.7 16317921      649.
  10 Afghanistan Asia       1997    41.8 22227415      635.
  # … with 1,694 more rows
```
]

---
count: false

```r
library(gapminder)
library(countrycode)
gapminder %>%
* distinct(country)
```
]
 
.panel2-countrycode-auto[

```
  # A tibble: 142 x 1
     country    
     <fct>      
   1 Afghanistan
   2 Albania    
   3 Algeria    
   4 Angola     
   5 Argentina  
   6 Australia  
   7 Austria    
   8 Bahrain    
   9 Bangladesh 
  10 Belgium    
  # … with 132 more rows
```
]

---
count: false

```r
library(gapminder)
library(countrycode)
gapminder %>%
  distinct(country) %>%
* mutate(iso3c = countrycode(
*   sourcevar = country,
*   origin = "country.name",
*   destination = "iso3c"
*          ))
```
]
 
.panel2-countrycode-auto[

```
  # A tibble: 142 x 2
     country     iso3c
     <fct>       <chr>
   1 Afghanistan AFG  
   2 Albania     ALB  
   3 Algeria     DZA  
   4 Angola      AGO  
   5 Argentina   ARG  
   6 Australia   AUS  
   7 Austria     AUT  
   8 Bahrain     BHR  
   9 Bangladesh  BGD  
  10 Belgium     BEL  
  # … with 132 more rows
```
]

---
count: false

```
  # A tibble: 142 x 3
     country     iso3c cowc 
     <fct>       <chr> <chr>
   1 Afghanistan AFG   AFG  
   2 Albania     ALB   ALB  
   3 Algeria     DZA   ALG  
   4 Angola      AGO   ANG  
   5 Argentina   ARG   ARG  
   6 Australia   AUS   AUL  
   7 Austria     AUT   AUS  
   8 Bahrain     BHR   BAH  
   9 Bangladesh  BGD   BNG  
  10 Belgium     BEL   BEL  
  # … with 132 more rows
```
]

---
count: false

```r
library(gapminder)
library(countrycode)
gapminder %>%
  distinct(country) %>%
  mutate(iso3c = countrycode(
    sourcevar = country,
    origin = "country.name",
    destination = "iso3c"
           )) %>%
  mutate(cowc = countrycode(
    s = country,
    origin = "country.name",
    d = "cowc")
    ) %>%
* mutate(cown = countrycode(
*   s = cowc,
*   origin = "cowc",
*   d = "cown")
*   )
```
]
 
.panel2-countrycode-auto[

```
  # A tibble: 142 x 4
     country     iso3c cowc   cown
     <fct>       <chr> <chr> <int>
   1 Afghanistan AFG   AFG     700
   2 Albania     ALB   ALB     339
   3 Algeria     DZA   ALG     615
   4 Angola      AGO   ANG     540
   5 Argentina   ARG   ARG     160
   6 Australia   AUS   AUL     900
   7 Austria     AUT   AUS     305
   8 Bahrain     BHR   BAH     692
   9 Bangladesh  BGD   BNG     771
  10 Belgium     BEL   BEL     211
  # … with 132 more rows
```
]

---
count: false

```r
library(gapminder)
library(countrycode)
gapminder %>%
  distinct(country) %>%
  mutate(iso3c = countrycode(
    sourcevar = country,
    origin = "country.name",
    destination = "iso3c"
           )) %>%
  mutate(cowc = countrycode(
    s = country,
    origin = "country.name",
    d = "cowc")
    ) %>%
  mutate(cown = countrycode(
    s = cowc,
    origin = "cowc",
    d = "cown")
    ) %>%
* mutate(eu28 = countrycode(
*   s = country,
*   origin = "country.name",
*   d = "eu28")
*   )
```
]
 
.panel2-countrycode-auto[

```
  # A tibble: 142 x 5
     country     iso3c cowc   cown eu28 
     <fct>       <chr> <chr> <int> <chr>
   1 Afghanistan AFG   AFG     700 <NA> 
   2 Albania     ALB   ALB     339 <NA> 
   3 Algeria     DZA   ALG     615 <NA> 
   4 Angola      AGO   ANG     540 <NA> 
   5 Argentina   ARG   ARG     160 <NA> 
   6 Australia   AUS   AUL     900 <NA> 
   7 Austria     AUT   AUS     305 EU   
   8 Bahrain     BHR   BAH     692 <NA> 
   9 Bangladesh  BGD   BNG     771 <NA> 
  10 Belgium     BEL   BEL     211 EU   
  # … with 132 more rows
```
]

---
count: false

```r
library(gapminder)
library(countrycode)
gapminder %>%
  distinct(country) %>%
  mutate(iso3c = countrycode(
    sourcevar = country,
    origin = "country.name",
    destination = "iso3c"
           )) %>%
  mutate(cowc = countrycode(
    s = country,
    origin = "country.name",
    d = "cowc")
    ) %>%
  mutate(cown = countrycode(
    s = cowc,
    origin = "cowc",
    d = "cown")
    ) %>%
  mutate(eu28 = countrycode(
    s = country,
    origin = "country.name",
    d = "eu28")
    ) %>%
* mutate(eu28 = ifelse(
*   is.na(eu28), "Not EU", "EU")
*   )
```
]
 
.panel2-countrycode-auto[

```
  # A tibble: 142 x 5
     country     iso3c cowc   cown eu28  
     <fct>       <chr> <chr> <int> <chr> 
   1 Afghanistan AFG   AFG     700 Not EU
   2 Albania     ALB   ALB     339 Not EU
   3 Algeria     DZA   ALG     615 Not EU
   4 Angola      AGO   ANG     540 Not EU
   5 Argentina   ARG   ARG     160 Not EU
   6 Australia   AUS   AUL     900 Not EU
   7 Austria     AUT   AUS     305 EU    
   8 Bahrain     BHR   BAH     692 Not EU
   9 Bangladesh  BGD   BNG     771 Not EU
  10 Belgium     BEL   BEL     211 EU    
  # … with 132 more rows
```
]

---

## Lubridate

Dates are a special class of data, and there are many ways to express dates.  Let's have a look at how to use the lubridate package to "move around" between dates. We'll look at a very small dataset of four US holidays.

```r
holidays_2019
```

```
  # A tibble: 4 x 2
    date       holiday               
    <chr>      <chr>                 
  1 January 1  New Year's Day        
  2 January 21 Martin Luther King Day
  3 January 24 Belly Laugh Day       
  4 February 2 Groundhog Day
```

---

```r
*library(lubridate)
```
]
 
.panel2-lubridate-auto[

]

---
count: false

```r
library(lubridate)
*holidays_2019
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 2
    date       holiday               
    <chr>      <chr>                 
  1 January 1  New Year's Day        
  2 January 21 Martin Luther King Day
  3 January 24 Belly Laugh Day       
  4 February 2 Groundhog Day
```
]

---
count: false

```r
library(lubridate)
holidays_2019 %>%
* mutate(date = paste0(date, ", 2019"))
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 2
    date             holiday               
    <chr>            <chr>                 
  1 January 1, 2019  New Year's Day        
  2 January 21, 2019 Martin Luther King Day
  3 January 24, 2019 Belly Laugh Day       
  4 February 2, 2019 Groundhog Day
```
]

---
count: false

```r
library(lubridate)
holidays_2019 %>%
  mutate(date = paste0(date, ", 2019")) %>%
* mutate(date = mdy(date))
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 2
    date       holiday               
    <date>     <chr>                 
  1 2019-01-01 New Year's Day        
  2 2019-01-21 Martin Luther King Day
  3 2019-01-24 Belly Laugh Day       
  4 2019-02-02 Groundhog Day
```
]

---
count: false

```r
library(lubridate)
holidays_2019 %>%
  mutate(date = paste0(date, ", 2019")) %>%
  mutate(date = mdy(date)) %>%
* mutate(day = day(date))
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 3
    date       holiday                  day
    <date>     <chr>                  <int>
  1 2019-01-01 New Year's Day             1
  2 2019-01-21 Martin Luther King Day    21
  3 2019-01-24 Belly Laugh Day           24
  4 2019-02-02 Groundhog Day              2
```
]

---
count: false

```r
library(lubridate)
holidays_2019 %>%
  mutate(date = paste0(date, ", 2019")) %>%
  mutate(date = mdy(date)) %>%
  mutate(day = day(date)) %>%
* mutate(month = month(date))
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 4
    date       holiday                  day month
    <date>     <chr>                  <int> <dbl>
  1 2019-01-01 New Year's Day             1     1
  2 2019-01-21 Martin Luther King Day    21     1
  3 2019-01-24 Belly Laugh Day           24     1
  4 2019-02-02 Groundhog Day              2     2
```
]

---
count: false

```
  # A tibble: 4 x 5
    date       holiday                  day month  year
    <date>     <chr>                  <int> <dbl> <dbl>
  1 2019-01-01 New Year's Day             1     1  2019
  2 2019-01-21 Martin Luther King Day    21     1  2019
  3 2019-01-24 Belly Laugh Day           24     1  2019
  4 2019-02-02 Groundhog Day              2     2  2019
```
]

---
count: false

```
  # A tibble: 4 x 6
    date       holiday                  day month  year day_of_week
    <date>     <chr>                  <int> <dbl> <dbl> <chr>      
  1 2019-01-01 New Year's Day             1     1  2019 Tuesday    
  2 2019-01-21 Martin Luther King Day    21     1  2019 Monday     
  3 2019-01-24 Belly Laugh Day           24     1  2019 Thursday   
  4 2019-02-02 Groundhog Day              2     2  2019 Saturday
```
]

---
count: false

```
  # A tibble: 4 x 4
    date         day month  year
    <date>     <int> <dbl> <dbl>
  1 2019-01-01     1     1  2019
  2 2019-01-21    21     1  2019
  3 2019-01-24    24     1  2019
  4 2019-02-02     2     2  2019
```
]

---
count: false

```r
library(lubridate)
holidays_2019 %>%
  mutate(date = paste0(date, ", 2019")) %>%
  mutate(date = mdy(date)) %>%
  mutate(day = day(date)) %>%
  mutate(month = month(date)) %>%
  mutate(year = year(date)) %>%
  mutate(day_of_week = weekdays(date)) %>%
# making space for more manipulation
  select(-holiday, -day_of_week) %>%
* mutate(euro_ordering = paste(day, month, year, sep = "/"))
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 5
    date         day month  year euro_ordering
    <date>     <int> <dbl> <dbl> <chr>        
  1 2019-01-01     1     1  2019 1/1/2019     
  2 2019-01-21    21     1  2019 21/1/2019    
  3 2019-01-24    24     1  2019 24/1/2019    
  4 2019-02-02     2     2  2019 2/2/2019
```
]

---
count: false

```r
library(lubridate)
holidays_2019 %>%
  mutate(date = paste0(date, ", 2019")) %>%
  mutate(date = mdy(date)) %>%
  mutate(day = day(date)) %>%
  mutate(month = month(date)) %>%
  mutate(year = year(date)) %>%
  mutate(day_of_week = weekdays(date)) %>%
# making space for more manipulation
  select(-holiday, -day_of_week) %>%
  mutate(euro_ordering = paste(day, month, year, sep = "/")) %>%
* mutate(date_0 = dmy(euro_ordering))
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 6
    date         day month  year euro_ordering date_0    
    <date>     <int> <dbl> <dbl> <chr>         <date>    
  1 2019-01-01     1     1  2019 1/1/2019      2019-01-01
  2 2019-01-21    21     1  2019 21/1/2019     2019-01-21
  3 2019-01-24    24     1  2019 24/1/2019     2019-01-24
  4 2019-02-02     2     2  2019 2/2/2019      2019-02-02
```
]

---
count: false

```r
library(lubridate)
holidays_2019 %>%
  mutate(date = paste0(date, ", 2019")) %>%
  mutate(date = mdy(date)) %>%
  mutate(day = day(date)) %>%
  mutate(month = month(date)) %>%
  mutate(year = year(date)) %>%
  mutate(day_of_week = weekdays(date)) %>%
# making space for more manipulation
  select(-holiday, -day_of_week) %>%
  mutate(euro_ordering = paste(day, month, year, sep = "/")) %>%
  mutate(date_0 = dmy(euro_ordering)) %>%
* mutate(since_jan_1 = date - ymd("2019-01-01"))
```
]
 
.panel2-lubridate-auto[

```
  # A tibble: 4 x 7
    date         day month  year euro_ordering date_0     since_jan_1
    <date>     <int> <dbl> <dbl> <chr>         <date>     <drtn>     
  1 2019-01-01     1     1  2019 1/1/2019      2019-01-01  0 days    
  2 2019-01-21    21     1  2019 21/1/2019     2019-01-21 20 days    
  3 2019-01-24    24     1  2019 24/1/2019     2019-01-24 23 days    
  4 2019-02-02     2     2  2019 2/2/2019      2019-02-02 32 days
```
]

---

## String (character) manipulation

Character manipulation is also referred to as "regular expressions", or "Regex".  The package stringr seeks to provide a unified framework for string manipulation. Let's get a taste, using the holidays_2019 dataset again.  (The section is not particularly inspired - sorry!)

---

```r
library(stringr)
holidays_2019 %>%
  mutate(var_1 =
           str_detect(holiday, "u")) %>%
  mutate(var_1 =
           tolower(holiday)) %>%
  mutate(var_1 =
*          str_replace(holiday, "u", "REPLACED"))
```
]
 
.panel2-stringr-rotate[

```
  # A tibble: 4 x 3
    date       holiday                var_1                        
    <chr>      <chr>                  <chr>                        
  1 January 1  New Year's Day         New Year's Day               
  2 January 21 Martin Luther King Day Martin LREPLACEDther King Day
  3 January 24 Belly Laugh Day        Belly LaREPLACEDgh Day       
  4 February 2 Groundhog Day          GroREPLACEDndhog Day
```
]

---
count: false

```r
library(stringr)
holidays_2019 %>%
  mutate(var_1 =
           str_detect(holiday, "u")) %>%
  mutate(var_1 =
           tolower(holiday)) %>%
  mutate(var_1 =
*          str_replace(holiday, "^.", "REPLACED")) #ROTATE
```
]
 
.panel2-stringr-rotate[

```
  # A tibble: 4 x 3
    date       holiday                var_1                        
    <chr>      <chr>                  <chr>                        
  1 January 1  New Year's Day         REPLACEDew Year's Day        
  2 January 21 Martin Luther King Day REPLACEDartin Luther King Day
  3 January 24 Belly Laugh Day        REPLACEDelly Laugh Day       
  4 February 2 Groundhog Day          REPLACEDroundhog Day
```
]

---
count: false

```r
library(stringr)
holidays_2019 %>%
  mutate(var_1 =
           str_detect(holiday, "u")) %>%
  mutate(var_1 =
           tolower(holiday)) %>%
  mutate(var_1 =
*          str_replace(holiday, ".$", "REPLACED")) #ROTATE
```
]
 
.panel2-stringr-rotate[

```
  # A tibble: 4 x 3
    date       holiday                var_1                        
    <chr>      <chr>                  <chr>                        
  1 January 1  New Year's Day         New Year's DaREPLACED        
  2 January 21 Martin Luther King Day Martin Luther King DaREPLACED
  3 January 24 Belly Laugh Day        Belly Laugh DaREPLACED       
  4 February 2 Groundhog Day          Groundhog DaREPLACED
```
]

---
count: false

```r
library(stringr)
holidays_2019 %>%
  mutate(var_1 =
           str_detect(holiday, "u")) %>%
  mutate(var_1 =
           tolower(holiday)) %>%
  mutate(var_1 =
*          str_count(holiday, "a")) #ROTATE
```
]
 
.panel2-stringr-rotate[

```
  # A tibble: 4 x 3
    date       holiday                var_1
    <chr>      <chr>                  <int>
  1 January 1  New Year's Day             2
  2 January 21 Martin Luther King Day     2
  3 January 24 Belly Laugh Day            2
  4 February 2 Groundhog Day              1
```
]

---
count: false

```r
library(stringr)
holidays_2019 %>%
  mutate(var_1 =
           str_detect(holiday, "u")) %>%
  mutate(var_1 =
           tolower(holiday)) %>%
  mutate(var_1 =
*          str_extract(holiday, ".....Day")) #ROTATE
```
]
 
.panel2-stringr-rotate[

```
  # A tibble: 4 x 3
    date       holiday                var_1   
    <chr>      <chr>                  <chr>   
  1 January 1  New Year's Day         ar's Day
  2 January 21 Martin Luther King Day King Day
  3 January 24 Belly Laugh Day        augh Day
  4 February 2 Groundhog Day          dhog Day
```
]

---
count: false

```r
library(stringr)
holidays_2019 %>%
  mutate(var_1 =
           str_detect(holiday, "u")) %>%
  mutate(var_1 =
           tolower(holiday)) %>%
  mutate(var_1 =
*          str_extract(holiday, "[[:PUNCT:]]")) #ROTATE
```
]
 
.panel2-stringr-rotate[

```
  # A tibble: 4 x 3
    date       holiday                var_1
    <chr>      <chr>                  <chr>
  1 January 1  New Year's Day         '    
  2 January 21 Martin Luther King Day <NA> 
  3 January 24 Belly Laugh Day        <NA> 
  4 February 2 Groundhog Day          <NA>
```
]

---

# Key packages used

Sam Firke (2019). janitor: Simple Tools for Examining and
  Cleaning Dirty Data. R package version 1.2.0.
  https://CRAN.R-project.org/package=janitor

Hadley Wickham (2017). tidyverse: Easily Install and Load
  the 'Tidyverse'. R package version 1.2.1.
  https://CRAN.R-project.org/package=tidyverse

Arel-Bundock et al., (2018). countrycode: An R package to
  convert country names and country codes. Journal of Open
  Source Software, 3(28), 848,
  https://doi.org/10.21105/joss.00848

Christian Rubba (2016). htmltab: Assemble Data Frames from
  HTML Tables. R package version 0.7.1.
  https://CRAN.R-project.org/package=htmltab

---

I get the UN Security Council data from the UN's website "Countries Elected Members of the Security Council".

---

Sam Firke (2019). janitor: Simple Tools for Examining and
  Cleaning Dirty Data. R package version 1.2.0.
  https://CRAN.R-project.org/package=janitor

Hadley Wickham (2017). tidyverse: Easily Install and Load
  the 'Tidyverse'. R package version 1.2.1.
  https://CRAN.R-project.org/package=tidyverse

Arel-Bundock et al., (2018). countrycode: An R package to
  convert country names and country codes. Journal of Open
  Source Software, 3(28), 848,
  https://doi.org/10.21105/joss.00848

Christian Rubba (2016). htmltab: Assemble Data Frames from
  HTML Tables. R package version 0.7.1.
  https://CRAN.R-project.org/package=htmltab

Garrett Grolemund, Hadley Wickham (2011). Dates
  and Times Made Easy with lubridate. Journal of
  Statistical Software, 40(3), 1-25. URL
  http://www.jstatsoft.org/v40/i03/.

@media print {
  .has-continuation {
    display: block;
  }
}

</style>