class: left, top, inverse background-image: url(broom.jpg) background-size: cover # .large[Data Cleaning] ### Walk-throughs <br>with {flipbookr}<br>and {xaringan} <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> #### .right[Gina Reynolds<br>Photo Credit: Zoya Loonohod] # Introduction Data cleaning is the pits. Or is it? The cleaning tools of the `tidyverse` and other packages like `janitor` might help such tasks become less terrible -- and more fun! Let's look at a couple of examples. --- # Getting started with R and RStudio The tools demonstrated in this book are implemented in the statistical software, R which is opensource and freely available. RStudio, an integrated development environment (IDE), is a nice environment for working in R. The following links may help you get R and RStudio up and running. - [Install Windows](https://www.youtube.com/watch?v=aCRMhAWmtpw) - [Install Mac](https://www.youtube.com/watch?v=GLLZhc_5enQ) --- # Acknowledgements I'm grateful to the tidyverse team for their tireless work on developing and maintaining this wonderful toolkit and to RStudio for supporting the development. Emi Tanaka (@statsgen) and Garrick Aden-Buie (@grrrck) helped by writing code to made the flipbook style possible. I'm *very* grateful to them for their work on this as well as to Yihui Xie and others for their work on the Xaringan package --- the extraordinary platform for creating the slides for this book. I'm thankful too to my students who refresh my perspective on teaching and learning new tools. Finally, thanks to David who is supportive of data cleaning and of me. --- # Preliminaries This book covers some examples of getting data into R and cleaning it. There are about 5 steps involved as follows: - Step 0: Identifying data of interest - Step 1: Saving this data to your raw_data folder - Step 2: Scanning the data into R and assigning an object like *the_data_raw* - Step 3: Cleaning the data and assigning to an object like *the_data* - Step 4: Saving out the clean data in a data_products folder - Step 5: Loading your clean data - ready for analysis! This workflow assumes that you are using a project structure. So let's make sure that you have that set up. --- # Open your project in RStudio It's helpful to use RStudio projects. You should have a final_project created already. In RStudio, you can navigate to it from the right hand corner icon. In the example, here, I'm currently in a project called `flipbooks` but I need to navigate to the project in the list called `final_project`. You should navigate there too! ![](figures/navigate_to_project.png) --- # Project structure ![](figures/project_file_structure.png) Steps 1-4 (maybe just 2-4) will happen in the 01_cleaning_data.Rmd, and Step 5 will happen in 02_analysis.Rmd. --- # Project structure Another way to show this is with a directory tree as shown here: ``` final_project ├── 01_data_cleaning.Rmd <- work files ├── 02_analysis.Rmd <- work files ├── final_project.Rproj <- you declared your project ├── raw_data <- Folder ├── data_products <- Folder ├── figures <- Folder ├── literature <- Folder ``` --- # Example 1: Fisheries data from Wikipedia This book will cover a few examples. The first is about fisheries around the world. A clean version of this data is used in a lab that we might do in the future about conflict over resources; the lab is adapted from a lab originally written by Angela Zoss and Eric Monson. --- ## Step 0: find data! The date for this first example comes from the second table on the Wikipedia page [Fishing Industry by Country](https://en.wikipedia.org/wiki/Fishing_industry_by_country). <iframe src="https://en.wikipedia.org/wiki/Fishing_industry_by_country" width="100%" height="400px"></iframe> --- ## Step 1: get data The first step saving your file of interest to your raw data folder. You can do this manually or use *download.file()* with a web address, as we do here. ```r download.file(url = "https://en.wikipedia.org/wiki/Fishing_industry_by_country", destfile = "raw_data/wikipedia_fisheries.html") ``` --- ## Step 2: scan in data and create an object *like `fisheries_2015_raw`* Once you have saved your raw data, you need to scan/read the data into R. There are a variety of functions to do this depending on your data type. In this example, we are wanting to read in a table embedded in an html file. We can read in the data with the htmltab function from the htmltab package. --- count: false .panel1-read_in_fisheries-auto[ ```r # using the htmltab package to scan html file *htmltab::htmltab(doc = "raw_data/wikipedia_fisheries.html", # the second table object is what we want * which = 2) ``` ] .panel2-read_in_fisheries-auto[ ``` Country Capture Aquaculture Total 2 China 17,800,000 63,700,000 81,500,000 3 Indonesia 6,584,419 16,600,000 23,200,000 4 India 5,082,332 5,703,002 10,800,000 5 Vietnam 2,785,940 3,634,531 6,420,471 6 United States 4,931,017 444,369 5,375,386 7 Russia 4,773,413 173,840 4,947,253 8 Japan 3,275,263 1,067,994 4,343,257 9 Philippines 2,027,992 2,200,914 4,228,906 10 Peru 3,811,802 100,187 3,911,989 11 Bangladesh 1,674,770 2,203,554 3,878,324 12 Norway 2,203,360 1,326,216 3,529,576 13 South Korea 1,395,951 1,859,220 3,255,171 14 Myanmar 2,072,390 1,017,644 3,090,034 15 Chile 1,829,238 1,050,117 2,879,355 16 Thailand 1,530,583 962,571 2,493,154 17 Malaysia 1,584,371 407,887 1,992,258 18 Mexico 1,524,467 221,328 1,745,795 19 Egypt 335,614 1,370,660 1,706,274 20 Morocco 1,454,105 1,142 1,455,247 21 Brazil 705,000 581,230 1,286,230 22 Spain 915,137 283,831 1,198,968 23 Ecuador 715,495 451,090 1,166,585 24 Iceland 1,085,176 15,129 1,100,305 25 Iran 695,407 398,129 1,093,536 26 Canada 874,727 200,765 1,075,492 27 Nigeria 734,731 306,727 1,041,458 28 United Kingdom 702,405 194,492 896,897 29 Cambodia 629,950 172,500 802,450 30 North Korea 209,000 554,100 763,100 31 Argentina 755,226 3,673 758,900 32 France 561,173 166,640 727,813 33 Denmark 670,344 36,337 706,681 34 Pakistan 513,156 156,430 669,586 35 Faroe Islands 568,435 83,300 651,735 36 South Africa 622,070 7,994 630,064 37 Mauritania 609,754 0 609,754 38 Turkey 335,326 250,331 585,657 39 Sri Lanka 521,637 30,974 552,611 40 New Zealand 424,791 109,016 533,807 41 Uganda 389,244 118,051 507,295 42 Namibia 503,878 591 504,469 43 Angola 486,490 655 487,145 44 Senegal 474,162 2,079 476,241 45 Netherlands 370,274 62,940 433,214 46 Tanzania 370,966 12,547 383,514 47 Ghana 327,457 52,480 379,937 48 Italy 198,130 157,109 355,239 49 Papua New Guinea 309,245 6,200 315,445 50 Germany 271,185 41,721 312,906 51 Venezuela 284,175 25,998 310,173 52 Mozambique 299,591 1,180 300,771 53 Ireland 259,845 40,244 300,089 54 Oman 279,606 103 279,709 55 Greenland 273,175 0 273,175 56 Australia 174,629 96,847 271,477 57 Poland 218,115 38,300 256,415 58 Democratic Republic of the Congo 237,372 3,161 240,533 59 Cameroon 233,190 2,315 235,505 60 Sweden 208,783 15,747 224,530 61 Finland 192,065 14,412 206,477 62 Sierra Leone 202,100 75 202,175 63 Greece 76,362 123,410 199,772 64 Portugal 186,950 9,787 196,737 65 Kenya 171,391 15,360 186,751 66 Colombia 86,344 96,970 183,314 67 Laos 70,915 109,835 180,750 68 Kiribati 172,822 3,652 176,474 69 Madagascar 142,333 25,998 168,331 70 Malawi 152,852 7,646 160,498 71 Yemen 154,450 0 154,450 72 Panama 144,450 8,808 153,258 73 Hong Kong 142,775 4,258 147,033 74 Tunisia 115,064 16,165 131,229 75 Maldives 129,191 0 129,191 76 Guinea 128,000 250 128,250 77 Seychelles 127,128 0 127,128 78 Latvia 114,806 788 115,594 79 Zambia 83,918 30,285 114,203 80 Lithuania 106,945 4,393 111,338 81 Chad 110,000 94 110,094 82 Saudi Arabia 68,082 39,920 108,002 83 Mali 102,486 4,194 106,680 84 Ukraine 75,743 21,425 97,168 85 Algeria 95,000 1,361 96,361 86 Belize 91,432 953 92,385 87 Federated States of Micronesia 88,397 0 88,397 88 Croatia 72,312 15,805 88,116 89 Republic of the Congo 86,748 177 86,925 90 Solomon Islands 66,445 10,582 77,027 91 Estonia 75,931 868 76,799 92 United Arab Emirates 73,000 1,241 74,241 93 Ivory Coast 67,500 4,701 72,201 94 Nepal 21,500 49,043 70,543 95 Nicaragua 45,500 22,530 68,030 96 Saint Kitts and Nevis 65,734 1 65,735 97 Uzbekistan 27,267 38,055 65,322 98 Marshall Islands 64,795 5 64,800 99 Honduras 10,600 53,100 63,700 100 El Salvador 54,084 7,956 62,040 101 Gambia 58,261 5 58,296 102 Iraq 28,000 28,835 56,835 103 Benin 49,806 3,080 52,886 104 Cuba 23,574 29,185 52,759 105 Uruguay 51,500 70 51,570 106 Suriname 47,013 102 47,115 107 Ethiopia 45,500 95 45,595 108 Fiji 44,663 754 45,417 109 Guatemala 19,011 26,268 45,279 110 Vanuatu 44,002 16 44,018 111 Kazakhstan 41,335 1,878 43,213 112 Guyana 42,142 337 42,478 113 Sudan 33,002 4,500 37,502 114 Costa Rica 14,750 22,421 37,171 115 Curaçao 35,534 0 35,534 116 South Sudan 35,000 20 35,020 117 Niger 34,592 300 34,892 118 Togo 31,891 98 31,989 119 Gabon 31,000 45 31,045 120 Georgia 30,078 670 30,748 121 Libya 30,002 10 30,012 122 Somalia 30,000 0 30,000 123 Central African Republic 29,000 190 29,190 124 Belgium 26,970 44 27,014 125 Rwanda 25,013 580 26,593 126 Zimbabwe 15,711 10,085 25,796 127 Paraguay 17,000 8,500 25,500 128 Romania 12,728 12,574 25,302 129 Czech Republic 3,507 20,952 24,459 130 Bulgaria 8,614 15,762 24,376 131 Burundi 21,805 1,330 23,135 132 Saint Vincent and the Grenadines 23,077 0 23,077 133 Burkina Faso 22,070 470 22,540 134 Hungary 5,048 16,248 21,296 135 Israel 1,758 18,914 20,672 136 Armenia 3,758 16,381 20,139 137 Cape Verde 19,900 0 19,900 138 Mauritius 18,062 1,021 19,083 139 Haiti 16,510 1,220 17,730 140 Jamaica 16,800 620 17,420 141 Dominican Republic 14,640 2,285 16,925 142 Comoros 16,407 0 16,407 143 French Polynesia 13,754 1,343 15,097 144 Turkmenistan 15,000 30 15,030 145 Bahrain 15,000 6 15,006 146 Liberia 14,700 40 14,740 147 Qatar 14,516 10 14,526 148 Brunei 13,292 948 14,240 149 Trinidad and Tobago 13,027 11 13,038 150 Moldova 50 16,011 12,061 151 Belarus 686 11,199 11,885 152 São Tomé and Príncipe 11,750 0 11,750 153 Bahamas 11,625 8 11,633 154 Bolivia 7,000 3,000 10,000 155 Serbia 2,067 6,878 8,945 156 Albania 7,886 950 8,836 157 Samoa 8,801 10 8,811 158 Malta 2,420 6,073 8,493 159 Cyprus 1,507 6,625 8,132 160 Equatorial Guinea 8,000 15 8,015 161 Tuvalu 7,684 3 7,687 162 Singapore 1,234 6,112 7,346 163 Isle of Man 7,040 0 7,040 164 Syria 4,500 3,500 7,000 165 Guinea-Bissau 6,700 0 6,700 166 Kuwait 5,493 196 5,689 167 New Caledonia 3,815 1,587 5,402 168 Lebanon 4,291 1,015 5,306 169 Bosnia and Herzegovina 305 4,564 4,869 170 Timor-Leste 3,200 1,560 4,760 171 Jersey and Guernsey 2,985 1,499 4,483 172 Eritrea 4,300 0 4,300 173 Slovakia 1,866 2,169 4,035 174 Austria 350 3,483 3,833 175 Palestine 3,306 280 3,586 176 Switzerland 1,851 1,733 3,584 177 American Samoa 3,047 20 3,067 178 Antigua and Barbuda 3,000 10 3,010 179 Turks and Caicos Islands 2,780 0 2,780 180 Grenada 2,550 0 2,550 181 Montenegro 1,595 929 2,524 182 Djibouti 2,220 0 2,220 183 Afghanistan 1,000 1,200 2,200 184 Slovenia 311 1,844 2,154 185 Saint Lucia 2,097 32 2,129 186 Kyrgyzstan 89 1,931 2,020 187 Puerto Rico 1,901 20 1,921 188 Barbados 1,735 26 1,761 189 Jordan 873 885 1,758 190 Tonga 1,697 3 1,700 191 Tajikistan 1,100 450 1,550 192 Guam 1,391 110 1,501 193 Macao 1,500 0 1,500 194 Azerbaijan 676 640 1,316 195 North Macedonia 306 986 1,292 196 British Virgin Islands 1,200 0 1,200 197 Lesotho 52 1,050 1,102 198 Northern Mariana Islands 950 42 992 199 Palau 818 23 841 200 Dominica 770 6 776 201 US Virgin Islands 551 8 559 202 Nauru 530 0 530 203 Bermuda 410 0 410 204 Sint Maarten 253 0 253 205 Eswatini 65 100 165 206 Bhutan 7 150 157 207 Aruba 142 0 142 208 Cayman Islands 125 0 125 209 France 90 0 90 210 Botswana 38 15 53 211 Mongolia 15 0 15 212 Monaco 1 0 1 213 Gibraltar 1 0 1 214 Andorra 0 0 0 215 Liechtenstein 0 0 0 216 Luxembourg 0 0 0 217 San Marino 0 0 0 218 Total (2015) 93,736,944 101,084,799 199,741,129 ``` ] --- count: false .panel1-read_in_fisheries-auto[ ```r # using the htmltab package to scan html file htmltab::htmltab(doc = "raw_data/wikipedia_fisheries.html", # the second table object is what we want which = 2) -> # assign this to an object with the _raw suffix *fisheries_2015_raw ``` ] .panel2-read_in_fisheries-auto[ ] <style> .panel1-read_in_fisheries-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-read_in_fisheries-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-read_in_fisheries-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Step 3: Load tidyverse, Clean data and create an cleaned object *like fisheries_2015* We'll do most of the manipulation with dplyr and tidyr which are subpackages of the tidyverse. --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r *fisheries_2015_raw ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` Country Capture Aquaculture Total 2 China 17,800,000 63,700,000 81,500,000 3 Indonesia 6,584,419 16,600,000 23,200,000 4 India 5,082,332 5,703,002 10,800,000 5 Vietnam 2,785,940 3,634,531 6,420,471 6 United States 4,931,017 444,369 5,375,386 7 Russia 4,773,413 173,840 4,947,253 8 Japan 3,275,263 1,067,994 4,343,257 9 Philippines 2,027,992 2,200,914 4,228,906 10 Peru 3,811,802 100,187 3,911,989 11 Bangladesh 1,674,770 2,203,554 3,878,324 12 Norway 2,203,360 1,326,216 3,529,576 13 South Korea 1,395,951 1,859,220 3,255,171 14 Myanmar 2,072,390 1,017,644 3,090,034 15 Chile 1,829,238 1,050,117 2,879,355 16 Thailand 1,530,583 962,571 2,493,154 17 Malaysia 1,584,371 407,887 1,992,258 18 Mexico 1,524,467 221,328 1,745,795 19 Egypt 335,614 1,370,660 1,706,274 20 Morocco 1,454,105 1,142 1,455,247 21 Brazil 705,000 581,230 1,286,230 22 Spain 915,137 283,831 1,198,968 23 Ecuador 715,495 451,090 1,166,585 24 Iceland 1,085,176 15,129 1,100,305 25 Iran 695,407 398,129 1,093,536 26 Canada 874,727 200,765 1,075,492 27 Nigeria 734,731 306,727 1,041,458 28 United Kingdom 702,405 194,492 896,897 29 Cambodia 629,950 172,500 802,450 30 North Korea 209,000 554,100 763,100 31 Argentina 755,226 3,673 758,900 32 France 561,173 166,640 727,813 33 Denmark 670,344 36,337 706,681 34 Pakistan 513,156 156,430 669,586 35 Faroe Islands 568,435 83,300 651,735 36 South Africa 622,070 7,994 630,064 37 Mauritania 609,754 0 609,754 38 Turkey 335,326 250,331 585,657 39 Sri Lanka 521,637 30,974 552,611 40 New Zealand 424,791 109,016 533,807 41 Uganda 389,244 118,051 507,295 42 Namibia 503,878 591 504,469 43 Angola 486,490 655 487,145 44 Senegal 474,162 2,079 476,241 45 Netherlands 370,274 62,940 433,214 46 Tanzania 370,966 12,547 383,514 47 Ghana 327,457 52,480 379,937 48 Italy 198,130 157,109 355,239 49 Papua New Guinea 309,245 6,200 315,445 50 Germany 271,185 41,721 312,906 51 Venezuela 284,175 25,998 310,173 52 Mozambique 299,591 1,180 300,771 53 Ireland 259,845 40,244 300,089 54 Oman 279,606 103 279,709 55 Greenland 273,175 0 273,175 56 Australia 174,629 96,847 271,477 57 Poland 218,115 38,300 256,415 58 Democratic Republic of the Congo 237,372 3,161 240,533 59 Cameroon 233,190 2,315 235,505 60 Sweden 208,783 15,747 224,530 61 Finland 192,065 14,412 206,477 62 Sierra Leone 202,100 75 202,175 63 Greece 76,362 123,410 199,772 64 Portugal 186,950 9,787 196,737 65 Kenya 171,391 15,360 186,751 66 Colombia 86,344 96,970 183,314 67 Laos 70,915 109,835 180,750 68 Kiribati 172,822 3,652 176,474 69 Madagascar 142,333 25,998 168,331 70 Malawi 152,852 7,646 160,498 71 Yemen 154,450 0 154,450 72 Panama 144,450 8,808 153,258 73 Hong Kong 142,775 4,258 147,033 74 Tunisia 115,064 16,165 131,229 75 Maldives 129,191 0 129,191 76 Guinea 128,000 250 128,250 77 Seychelles 127,128 0 127,128 78 Latvia 114,806 788 115,594 79 Zambia 83,918 30,285 114,203 80 Lithuania 106,945 4,393 111,338 81 Chad 110,000 94 110,094 82 Saudi Arabia 68,082 39,920 108,002 83 Mali 102,486 4,194 106,680 84 Ukraine 75,743 21,425 97,168 85 Algeria 95,000 1,361 96,361 86 Belize 91,432 953 92,385 87 Federated States of Micronesia 88,397 0 88,397 88 Croatia 72,312 15,805 88,116 89 Republic of the Congo 86,748 177 86,925 90 Solomon Islands 66,445 10,582 77,027 91 Estonia 75,931 868 76,799 92 United Arab Emirates 73,000 1,241 74,241 93 Ivory Coast 67,500 4,701 72,201 94 Nepal 21,500 49,043 70,543 95 Nicaragua 45,500 22,530 68,030 96 Saint Kitts and Nevis 65,734 1 65,735 97 Uzbekistan 27,267 38,055 65,322 98 Marshall Islands 64,795 5 64,800 99 Honduras 10,600 53,100 63,700 100 El Salvador 54,084 7,956 62,040 101 Gambia 58,261 5 58,296 102 Iraq 28,000 28,835 56,835 103 Benin 49,806 3,080 52,886 104 Cuba 23,574 29,185 52,759 105 Uruguay 51,500 70 51,570 106 Suriname 47,013 102 47,115 107 Ethiopia 45,500 95 45,595 108 Fiji 44,663 754 45,417 109 Guatemala 19,011 26,268 45,279 110 Vanuatu 44,002 16 44,018 111 Kazakhstan 41,335 1,878 43,213 112 Guyana 42,142 337 42,478 113 Sudan 33,002 4,500 37,502 114 Costa Rica 14,750 22,421 37,171 115 Curaçao 35,534 0 35,534 116 South Sudan 35,000 20 35,020 117 Niger 34,592 300 34,892 118 Togo 31,891 98 31,989 119 Gabon 31,000 45 31,045 120 Georgia 30,078 670 30,748 121 Libya 30,002 10 30,012 122 Somalia 30,000 0 30,000 123 Central African Republic 29,000 190 29,190 124 Belgium 26,970 44 27,014 125 Rwanda 25,013 580 26,593 126 Zimbabwe 15,711 10,085 25,796 127 Paraguay 17,000 8,500 25,500 128 Romania 12,728 12,574 25,302 129 Czech Republic 3,507 20,952 24,459 130 Bulgaria 8,614 15,762 24,376 131 Burundi 21,805 1,330 23,135 132 Saint Vincent and the Grenadines 23,077 0 23,077 133 Burkina Faso 22,070 470 22,540 134 Hungary 5,048 16,248 21,296 135 Israel 1,758 18,914 20,672 136 Armenia 3,758 16,381 20,139 137 Cape Verde 19,900 0 19,900 138 Mauritius 18,062 1,021 19,083 139 Haiti 16,510 1,220 17,730 140 Jamaica 16,800 620 17,420 141 Dominican Republic 14,640 2,285 16,925 142 Comoros 16,407 0 16,407 143 French Polynesia 13,754 1,343 15,097 144 Turkmenistan 15,000 30 15,030 145 Bahrain 15,000 6 15,006 146 Liberia 14,700 40 14,740 147 Qatar 14,516 10 14,526 148 Brunei 13,292 948 14,240 149 Trinidad and Tobago 13,027 11 13,038 150 Moldova 50 16,011 12,061 151 Belarus 686 11,199 11,885 152 São Tomé and Príncipe 11,750 0 11,750 153 Bahamas 11,625 8 11,633 154 Bolivia 7,000 3,000 10,000 155 Serbia 2,067 6,878 8,945 156 Albania 7,886 950 8,836 157 Samoa 8,801 10 8,811 158 Malta 2,420 6,073 8,493 159 Cyprus 1,507 6,625 8,132 160 Equatorial Guinea 8,000 15 8,015 161 Tuvalu 7,684 3 7,687 162 Singapore 1,234 6,112 7,346 163 Isle of Man 7,040 0 7,040 164 Syria 4,500 3,500 7,000 165 Guinea-Bissau 6,700 0 6,700 166 Kuwait 5,493 196 5,689 167 New Caledonia 3,815 1,587 5,402 168 Lebanon 4,291 1,015 5,306 169 Bosnia and Herzegovina 305 4,564 4,869 170 Timor-Leste 3,200 1,560 4,760 171 Jersey and Guernsey 2,985 1,499 4,483 172 Eritrea 4,300 0 4,300 173 Slovakia 1,866 2,169 4,035 174 Austria 350 3,483 3,833 175 Palestine 3,306 280 3,586 176 Switzerland 1,851 1,733 3,584 177 American Samoa 3,047 20 3,067 178 Antigua and Barbuda 3,000 10 3,010 179 Turks and Caicos Islands 2,780 0 2,780 180 Grenada 2,550 0 2,550 181 Montenegro 1,595 929 2,524 182 Djibouti 2,220 0 2,220 183 Afghanistan 1,000 1,200 2,200 184 Slovenia 311 1,844 2,154 185 Saint Lucia 2,097 32 2,129 186 Kyrgyzstan 89 1,931 2,020 187 Puerto Rico 1,901 20 1,921 188 Barbados 1,735 26 1,761 189 Jordan 873 885 1,758 190 Tonga 1,697 3 1,700 191 Tajikistan 1,100 450 1,550 192 Guam 1,391 110 1,501 193 Macao 1,500 0 1,500 194 Azerbaijan 676 640 1,316 195 North Macedonia 306 986 1,292 196 British Virgin Islands 1,200 0 1,200 197 Lesotho 52 1,050 1,102 198 Northern Mariana Islands 950 42 992 199 Palau 818 23 841 200 Dominica 770 6 776 201 US Virgin Islands 551 8 559 202 Nauru 530 0 530 203 Bermuda 410 0 410 204 Sint Maarten 253 0 253 205 Eswatini 65 100 165 206 Bhutan 7 150 157 207 Aruba 142 0 142 208 Cayman Islands 125 0 125 209 France 90 0 90 210 Botswana 38 15 53 211 Mongolia 15 0 15 212 Monaco 1 0 1 213 Gibraltar 1 0 1 214 Andorra 0 0 0 215 Liechtenstein 0 0 0 216 Luxembourg 0 0 0 217 San Marino 0 0 0 218 Total (2015) 93,736,944 101,084,799 199,741,129 ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% * as_tibble() # show data in "tidyverse" way, as_tibble ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 4 Country Capture Aquaculture Total <chr> <chr> <chr> <chr> 1 China 17,800,000 63,700,000 81,500,000 2 Indonesia 6,584,419 16,600,000 23,200,000 3 India 5,082,332 5,703,002 10,800,000 4 Vietnam 2,785,940 3,634,531 6,420,471 5 United States 4,931,017 444,369 5,375,386 6 Russia 4,773,413 173,840 4,947,253 7 Japan 3,275,263 1,067,994 4,343,257 8 Philippines 2,027,992 2,200,914 4,228,906 9 Peru 3,811,802 100,187 3,911,989 10 Bangladesh 1,674,770 2,203,554 3,878,324 # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so * janitor::clean_names(case = "snake") # snake is the default too ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 4 country capture aquaculture total <chr> <chr> <chr> <chr> 1 China 17,800,000 63,700,000 81,500,000 2 Indonesia 6,584,419 16,600,000 23,200,000 3 India 5,082,332 5,703,002 10,800,000 4 Vietnam 2,785,940 3,634,531 6,420,471 5 United States 4,931,017 444,369 5,375,386 6 Russia 4,773,413 173,840 4,947,253 7 Japan 3,275,263 1,067,994 4,343,257 8 Philippines 2,027,992 2,200,914 4,228,906 9 Peru 3,811,802 100,187 3,911,989 10 Bangladesh 1,674,770 2,203,554 3,878,324 # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' * mutate(capture = * readr::parse_number(x = capture)) ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 4 country capture aquaculture total <chr> <dbl> <chr> <chr> 1 China 17800000 63,700,000 81,500,000 2 Indonesia 6584419 16,600,000 23,200,000 3 India 5082332 5,703,002 10,800,000 4 Vietnam 2785940 3,634,531 6,420,471 5 United States 4931017 444,369 5,375,386 6 Russia 4773413 173,840 4,947,253 7 Japan 3275263 1,067,994 4,343,257 8 Philippines 2027992 2,200,914 4,228,906 9 Peru 3811802 100,187 3,911,989 10 Bangladesh 1674770 2,203,554 3,878,324 # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' mutate(capture = readr::parse_number(x = capture)) %>% * mutate(aquaculture = * readr::parse_number(x = aquaculture)) ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 4 country capture aquaculture total <chr> <dbl> <dbl> <chr> 1 China 17800000 63700000 81,500,000 2 Indonesia 6584419 16600000 23,200,000 3 India 5082332 5703002 10,800,000 4 Vietnam 2785940 3634531 6,420,471 5 United States 4931017 444369 5,375,386 6 Russia 4773413 173840 4,947,253 7 Japan 3275263 1067994 4,343,257 8 Philippines 2027992 2200914 4,228,906 9 Peru 3811802 100187 3,911,989 10 Bangladesh 1674770 2203554 3,878,324 # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' mutate(capture = readr::parse_number(x = capture)) %>% mutate(aquaculture = readr::parse_number(x = aquaculture)) %>% * mutate(total = * readr::parse_number(x = total)) ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 4 country capture aquaculture total <chr> <dbl> <dbl> <dbl> 1 China 17800000 63700000 81500000 2 Indonesia 6584419 16600000 23200000 3 India 5082332 5703002 10800000 4 Vietnam 2785940 3634531 6420471 5 United States 4931017 444369 5375386 6 Russia 4773413 173840 4947253 7 Japan 3275263 1067994 4343257 8 Philippines 2027992 2200914 4228906 9 Peru 3811802 100187 3911989 10 Bangladesh 1674770 2203554 3878324 # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' mutate(capture = readr::parse_number(x = capture)) %>% mutate(aquaculture = readr::parse_number(x = aquaculture)) %>% mutate(total = readr::parse_number(x = total)) %>% # removing white spaces before and after a string * mutate(country = str_trim(string = country)) ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 4 country capture aquaculture total <chr> <dbl> <dbl> <dbl> 1 China 17800000 63700000 81500000 2 Indonesia 6584419 16600000 23200000 3 India 5082332 5703002 10800000 4 Vietnam 2785940 3634531 6420471 5 United States 4931017 444369 5375386 6 Russia 4773413 173840 4947253 7 Japan 3275263 1067994 4343257 8 Philippines 2027992 2200914 4228906 9 Peru 3811802 100187 3911989 10 Bangladesh 1674770 2203554 3878324 # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' mutate(capture = readr::parse_number(x = capture)) %>% mutate(aquaculture = readr::parse_number(x = aquaculture)) %>% mutate(total = readr::parse_number(x = total)) %>% # removing white spaces before and after a string mutate(country = str_trim(string = country)) %>% # creating region variable # using countrycode function from countrycode package # if you haven't done so, * mutate(region = * countrycode::countrycode( * sourcevar = country, * origin = "country.name", * dest = "region")) ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 5 country capture aquaculture total region <chr> <dbl> <dbl> <dbl> <chr> 1 China 17800000 63700000 81500000 Eastern Asia 2 Indonesia 6584419 16600000 23200000 South-Eastern Asia 3 India 5082332 5703002 10800000 Southern Asia 4 Vietnam 2785940 3634531 6420471 South-Eastern Asia 5 United States 4931017 444369 5375386 Northern America 6 Russia 4773413 173840 4947253 Eastern Europe 7 Japan 3275263 1067994 4343257 Eastern Asia 8 Philippines 2027992 2200914 4228906 South-Eastern Asia 9 Peru 3811802 100187 3911989 South America 10 Bangladesh 1674770 2203554 3878324 Southern Asia # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' mutate(capture = readr::parse_number(x = capture)) %>% mutate(aquaculture = readr::parse_number(x = aquaculture)) %>% mutate(total = readr::parse_number(x = total)) %>% # removing white spaces before and after a string mutate(country = str_trim(string = country)) %>% # creating region variable # using countrycode function from countrycode package # if you haven't done so, mutate(region = countrycode::countrycode( sourcevar = country, origin = "country.name", dest = "region")) %>% # country codes are useful for merging data * mutate(iso3c = * countrycode::countrycode( * sourcevar = country, * origin = "country.name", * dest = "iso3c")) ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 217 x 6 country capture aquaculture total region iso3c <chr> <dbl> <dbl> <dbl> <chr> <chr> 1 China 17800000 63700000 81500000 Eastern Asia CHN 2 Indonesia 6584419 16600000 23200000 South-Eastern Asia IDN 3 India 5082332 5703002 10800000 Southern Asia IND 4 Vietnam 2785940 3634531 6420471 South-Eastern Asia VNM 5 United States 4931017 444369 5375386 Northern America USA 6 Russia 4773413 173840 4947253 Eastern Europe RUS 7 Japan 3275263 1067994 4343257 Eastern Asia JPN 8 Philippines 2027992 2200914 4228906 South-Eastern Asia PHL 9 Peru 3811802 100187 3911989 South America PER 10 Bangladesh 1674770 2203554 3878324 Southern Asia BGD # … with 207 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' mutate(capture = readr::parse_number(x = capture)) %>% mutate(aquaculture = readr::parse_number(x = aquaculture)) %>% mutate(total = readr::parse_number(x = total)) %>% # removing white spaces before and after a string mutate(country = str_trim(string = country)) %>% # creating region variable # using countrycode function from countrycode package # if you haven't done so, mutate(region = countrycode::countrycode( sourcevar = country, origin = "country.name", dest = "region")) %>% # country codes are useful for merging data mutate(iso3c = countrycode::countrycode( sourcevar = country, origin = "country.name", dest = "iso3c")) %>% # one row is an aggregation - deleting this * filter(country != "Total (2015)") ``` ] .panel2-clean_wikipedia_fisheries-auto[ ``` # A tibble: 216 x 6 country capture aquaculture total region iso3c <chr> <dbl> <dbl> <dbl> <chr> <chr> 1 China 17800000 63700000 81500000 Eastern Asia CHN 2 Indonesia 6584419 16600000 23200000 South-Eastern Asia IDN 3 India 5082332 5703002 10800000 Southern Asia IND 4 Vietnam 2785940 3634531 6420471 South-Eastern Asia VNM 5 United States 4931017 444369 5375386 Northern America USA 6 Russia 4773413 173840 4947253 Eastern Europe RUS 7 Japan 3275263 1067994 4343257 Eastern Asia JPN 8 Philippines 2027992 2200914 4228906 South-Eastern Asia PHL 9 Peru 3811802 100187 3911989 South America PER 10 Bangladesh 1674770 2203554 3878324 Southern Asia BGD # … with 206 more rows ``` ] --- count: false .panel1-clean_wikipedia_fisheries-auto[ ```r fisheries_2015_raw %>% as_tibble() %>% # show data in "tidyverse" way, as_tibble # you may need to install the janitor package for this next step # if you have not already done so janitor::clean_names(case = "snake") %>% # snake is the default too # converting columns to numeric 'dbl', which were character 'cha' mutate(capture = readr::parse_number(x = capture)) %>% mutate(aquaculture = readr::parse_number(x = aquaculture)) %>% mutate(total = readr::parse_number(x = total)) %>% # removing white spaces before and after a string mutate(country = str_trim(string = country)) %>% # creating region variable # using countrycode function from countrycode package # if you haven't done so, mutate(region = countrycode::countrycode( sourcevar = country, origin = "country.name", dest = "region")) %>% # country codes are useful for merging data mutate(iso3c = countrycode::countrycode( sourcevar = country, origin = "country.name", dest = "iso3c")) %>% # one row is an aggregation - deleting this filter(country != "Total (2015)") -> *fisheries_2015 ``` ] .panel2-clean_wikipedia_fisheries-auto[ ] <style> .panel1-clean_wikipedia_fisheries-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-clean_wikipedia_fisheries-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-clean_wikipedia_fisheries-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Step 4: Saving out clean data: Now you can save out your work in your data_products folder as follows: ```r save(fisheries_2015, file = "data_products/fisheries_2015.Rdata") ``` This will typically be the last step in your 01_data_cleaning.Rmd file. --- ## Step 5: Time for Analysis!!! Yay! The first code chunk in 02_analysis.Rmd should look like this. This makes the tidyverse and the cleaned data available at the start of your analysis. ```r library(tidyverse) # for wrangling (dplyr and tidyr) and plotting (ggplot) load("data_products/fisheries_2015.Rdata") ``` --- ## Project overview Now your'll have some files in raw_data and data_products folders: ```r final_project ├── 01_data_cleaning.Rmd <- work files ├── 02_analysis.Rmd <- work files ├── final_project.Rproj <- you declared your project └── raw_data <- Folder └── wikipedia_fisheries.html <- New! html file, containing table ├── data_products <- Folder │ └── fisheries_2015.Rdata <- New! cleaned data ├── figures <- Folder ├── literature <- Folder ``` --- # Example 2: ## Step 0: find data! You might be interested in % of land area in a country that is forest use. This is an indicator available at worldbank.org. <img src="figures/world_bank_url.png" width="90%" /> --- ## Step 1: get data This time, I just downloaded the data from the website, saving it to my *raw_data* folder. --- ## Step 2: scan in the data and create `wb_percent_forest_raw` ### Step 2a: Use RStudio importer tool Navigate to your raw_data folder in the "files" pane. Then click on data and then choose "Import Dataset" ![](figures/import_dataset.png) --- ### Step 2b: Make adjustments if needed This is what I see initially. <img src="figures/world_bank_forest_area_importer.png" width="90%" /> --- ### Step 2b: Make adjustments if needed There is 'meta data' in the first three rows. The dataset doesn't really get start until row 4. Therefore, in the **import options** section, I'll adjust the option for **Skip** to *3*. I then click **update** in the righthand corner. <img src="figures/adjust_import.png" width="90%" /> --- ### Step 2c: Copy everything **except** View() code from **Code Preview** Copy everything *but* View() code from **Code Preview** and paste this into a code chunk in the file `01_data_cleaning.Rmd`. ![](figures/copy_code_from_import_tool.png) --- ### Step 2c: Change default name You'll paste code that looks like this: ```r library(readxl) API_AG_LND_FRST_ZS_DS2_en_excel_v2_248616 <- read_excel("raw_data/API_AG.LND.FRST.ZS_DS2_en_excel_v2_248616.xls", skip = 3) ``` But this is a bit of a long object name that is automatically generated, so you might modify this to something simpler, with the '_raw' suffix. ```r library(readxl) wb_percent_forest_raw <- read_excel("raw_data/API_AG.LND.FRST.ZS_DS2_en_excel_v2_248616.xls", skip = 3) ``` Execute the code! --- ## Step 3: Load tidyverse and clean the data Now, starting with the object you created by scanning in the data and assigning it to the object `wb_percent_forest_raw`, you are ready to perform cleaning steps. --- count: false .panel1-wb_cleaning-auto[ ```r *wb_percent_forest_raw ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 264 x 63 `Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961` <chr> <chr> <chr> <chr> <lgl> <lgl> 1 Aruba ABW Forest area (% … AG.LND.FRST.ZS NA NA 2 Afghanistan AFG Forest area (% … AG.LND.FRST.ZS NA NA 3 Angola AGO Forest area (% … AG.LND.FRST.ZS NA NA 4 Albania ALB Forest area (% … AG.LND.FRST.ZS NA NA 5 Andorra AND Forest area (% … AG.LND.FRST.ZS NA NA 6 Arab World ARB Forest area (% … AG.LND.FRST.ZS NA NA 7 United Arab E… ARE Forest area (% … AG.LND.FRST.ZS NA NA 8 Argentina ARG Forest area (% … AG.LND.FRST.ZS NA NA 9 Armenia ARM Forest area (% … AG.LND.FRST.ZS NA NA 10 American Samoa ASM Forest area (% … AG.LND.FRST.ZS NA NA # … with 254 more rows, and 57 more variables: `1962` <lgl>, `1963` <lgl>, # `1964` <lgl>, `1965` <lgl>, `1966` <lgl>, `1967` <lgl>, `1968` <lgl>, # `1969` <lgl>, `1970` <lgl>, `1971` <lgl>, `1972` <lgl>, `1973` <lgl>, # `1974` <lgl>, `1975` <lgl>, `1976` <lgl>, `1977` <lgl>, `1978` <lgl>, # `1979` <lgl>, `1980` <lgl>, `1981` <lgl>, `1982` <lgl>, `1983` <lgl>, # `1984` <lgl>, `1985` <lgl>, `1986` <lgl>, `1987` <lgl>, `1988` <lgl>, # `1989` <lgl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, # `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, # `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, # `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, # `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, # `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <lgl>, `2018` <lgl> ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case * janitor::clean_names() ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 264 x 63 country_name country_code indicator_name indicator_code x1960 x1961 x1962 <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> 1 Aruba ABW Forest area (… AG.LND.FRST.ZS NA NA NA 2 Afghanistan AFG Forest area (… AG.LND.FRST.ZS NA NA NA 3 Angola AGO Forest area (… AG.LND.FRST.ZS NA NA NA 4 Albania ALB Forest area (… AG.LND.FRST.ZS NA NA NA 5 Andorra AND Forest area (… AG.LND.FRST.ZS NA NA NA 6 Arab World ARB Forest area (… AG.LND.FRST.ZS NA NA NA 7 United Arab… ARE Forest area (… AG.LND.FRST.ZS NA NA NA 8 Argentina ARG Forest area (… AG.LND.FRST.ZS NA NA NA 9 Armenia ARM Forest area (… AG.LND.FRST.ZS NA NA NA 10 American Sa… ASM Forest area (… AG.LND.FRST.ZS NA NA NA # … with 254 more rows, and 56 more variables: x1963 <lgl>, x1964 <lgl>, # x1965 <lgl>, x1966 <lgl>, x1967 <lgl>, x1968 <lgl>, x1969 <lgl>, # x1970 <lgl>, x1971 <lgl>, x1972 <lgl>, x1973 <lgl>, x1974 <lgl>, # x1975 <lgl>, x1976 <lgl>, x1977 <lgl>, x1978 <lgl>, x1979 <lgl>, # x1980 <lgl>, x1981 <lgl>, x1982 <lgl>, x1983 <lgl>, x1984 <lgl>, # x1985 <lgl>, x1986 <lgl>, x1987 <lgl>, x1988 <lgl>, x1989 <lgl>, # x1990 <dbl>, x1991 <dbl>, x1992 <dbl>, x1993 <dbl>, x1994 <dbl>, # x1995 <dbl>, x1996 <dbl>, x1997 <dbl>, x1998 <dbl>, x1999 <dbl>, # x2000 <dbl>, x2001 <dbl>, x2002 <dbl>, x2003 <dbl>, x2004 <dbl>, # x2005 <dbl>, x2006 <dbl>, x2007 <dbl>, x2008 <dbl>, x2009 <dbl>, # x2010 <dbl>, x2011 <dbl>, x2012 <dbl>, x2013 <dbl>, x2014 <dbl>, # x2015 <dbl>, x2016 <dbl>, x2017 <lgl>, x2018 <lgl> ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code * select(-indicator_name, -indicator_code) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 264 x 61 country_name country_code x1960 x1961 x1962 x1963 x1964 x1965 x1966 x1967 <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> 1 Aruba ABW NA NA NA NA NA NA NA NA 2 Afghanistan AFG NA NA NA NA NA NA NA NA 3 Angola AGO NA NA NA NA NA NA NA NA 4 Albania ALB NA NA NA NA NA NA NA NA 5 Andorra AND NA NA NA NA NA NA NA NA 6 Arab World ARB NA NA NA NA NA NA NA NA 7 United Arab… ARE NA NA NA NA NA NA NA NA 8 Argentina ARG NA NA NA NA NA NA NA NA 9 Armenia ARM NA NA NA NA NA NA NA NA 10 American Sa… ASM NA NA NA NA NA NA NA NA # … with 254 more rows, and 51 more variables: x1968 <lgl>, x1969 <lgl>, # x1970 <lgl>, x1971 <lgl>, x1972 <lgl>, x1973 <lgl>, x1974 <lgl>, # x1975 <lgl>, x1976 <lgl>, x1977 <lgl>, x1978 <lgl>, x1979 <lgl>, # x1980 <lgl>, x1981 <lgl>, x1982 <lgl>, x1983 <lgl>, x1984 <lgl>, # x1985 <lgl>, x1986 <lgl>, x1987 <lgl>, x1988 <lgl>, x1989 <lgl>, # x1990 <dbl>, x1991 <dbl>, x1992 <dbl>, x1993 <dbl>, x1994 <dbl>, # x1995 <dbl>, x1996 <dbl>, x1997 <dbl>, x1998 <dbl>, x1999 <dbl>, # x2000 <dbl>, x2001 <dbl>, x2002 <dbl>, x2003 <dbl>, x2004 <dbl>, # x2005 <dbl>, x2006 <dbl>, x2007 <dbl>, x2008 <dbl>, x2009 <dbl>, # x2010 <dbl>, x2011 <dbl>, x2012 <dbl>, x2013 <dbl>, x2014 <dbl>, # x2015 <dbl>, x2016 <dbl>, x2017 <lgl>, x2018 <lgl> ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 * select(country_name, country_code, x2000:x2005) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 264 x 8 country_name country_code x2000 x2001 x2002 x2003 x2004 x2005 <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Aruba ABW 2.33 2.33 2.33 2.33 2.33 2.33 2 Afghanistan AFG 2.07 2.07 2.07 2.07 2.07 2.07 3 Angola AGO 47.9 47.8 47.7 47.6 47.5 47.4 4 Albania ALB 28.1 28.2 28.3 28.4 28.5 28.6 5 Andorra AND 34.0 34.0 34.0 34.0 34.0 34.0 6 Arab World ARB 1.68 1.68 1.68 1.69 1.69 1.69 7 United Arab Emirates ARE 4.36 4.37 4.38 4.38 4.39 4.39 8 Argentina ARG 11.6 11.5 11.4 11.3 11.2 11.0 9 Armenia ARM 11.7 11.7 11.7 11.7 11.7 11.7 10 American Samoa ASM 90.2 90.1 89.9 89.8 89.5 89.4 # … with 254 more rows ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 select(country_name, country_code, x2000:x2005) %>% # data is "wide", but to use for analysis # we transform it to "long" * gather(key = "year", * value = "percent_forest", * x2000:x2005) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 1,584 x 4 country_name country_code year percent_forest <chr> <chr> <chr> <dbl> 1 Aruba ABW x2000 2.33 2 Afghanistan AFG x2000 2.07 3 Angola AGO x2000 47.9 4 Albania ALB x2000 28.1 5 Andorra AND x2000 34.0 6 Arab World ARB x2000 1.68 7 United Arab Emirates ARE x2000 4.36 8 Argentina ARG x2000 11.6 9 Armenia ARM x2000 11.7 10 American Samoa ASM x2000 90.2 # … with 1,574 more rows ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 select(country_name, country_code, x2000:x2005) %>% # data is "wide", but to use for analysis # we transform it to "long" gather(key = "year", value = "percent_forest", x2000:x2005) %>% * mutate(year = str_remove(year, "x")) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 1,584 x 4 country_name country_code year percent_forest <chr> <chr> <chr> <dbl> 1 Aruba ABW 2000 2.33 2 Afghanistan AFG 2000 2.07 3 Angola AGO 2000 47.9 4 Albania ALB 2000 28.1 5 Andorra AND 2000 34.0 6 Arab World ARB 2000 1.68 7 United Arab Emirates ARE 2000 4.36 8 Argentina ARG 2000 11.6 9 Armenia ARM 2000 11.7 10 American Samoa ASM 2000 90.2 # … with 1,574 more rows ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 select(country_name, country_code, x2000:x2005) %>% # data is "wide", but to use for analysis # we transform it to "long" gather(key = "year", value = "percent_forest", x2000:x2005) %>% mutate(year = str_remove(year, "x")) %>% * mutate(year = as.numeric(year)) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 1,584 x 4 country_name country_code year percent_forest <chr> <chr> <dbl> <dbl> 1 Aruba ABW 2000 2.33 2 Afghanistan AFG 2000 2.07 3 Angola AGO 2000 47.9 4 Albania ALB 2000 28.1 5 Andorra AND 2000 34.0 6 Arab World ARB 2000 1.68 7 United Arab Emirates ARE 2000 4.36 8 Argentina ARG 2000 11.6 9 Armenia ARM 2000 11.7 10 American Samoa ASM 2000 90.2 # … with 1,574 more rows ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 select(country_name, country_code, x2000:x2005) %>% # data is "wide", but to use for analysis # we transform it to "long" gather(key = "year", value = "percent_forest", x2000:x2005) %>% mutate(year = str_remove(year, "x")) %>% mutate(year = as.numeric(year)) %>% # reordering data if you like * arrange(-percent_forest) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 1,584 x 4 country_name country_code year percent_forest <chr> <chr> <dbl> <dbl> 1 Suriname SUR 2000 98.7 2 Suriname SUR 2001 98.6 3 Suriname SUR 2002 98.6 4 Suriname SUR 2003 98.6 5 Suriname SUR 2004 98.6 6 Suriname SUR 2005 98.5 7 Micronesia, Fed. Sts. FSM 2005 91.4 8 Micronesia, Fed. Sts. FSM 2004 91.4 9 Micronesia, Fed. Sts. FSM 2003 91.3 10 Micronesia, Fed. Sts. FSM 2002 91.3 # … with 1,574 more rows ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 select(country_name, country_code, x2000:x2005) %>% # data is "wide", but to use for analysis # we transform it to "long" gather(key = "year", value = "percent_forest", x2000:x2005) %>% mutate(year = str_remove(year, "x")) %>% mutate(year = as.numeric(year)) %>% # reordering data if you like arrange(-percent_forest) %>% # rename variable * rename(wb_country_code = country_code) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 1,584 x 4 country_name wb_country_code year percent_forest <chr> <chr> <dbl> <dbl> 1 Suriname SUR 2000 98.7 2 Suriname SUR 2001 98.6 3 Suriname SUR 2002 98.6 4 Suriname SUR 2003 98.6 5 Suriname SUR 2004 98.6 6 Suriname SUR 2005 98.5 7 Micronesia, Fed. Sts. FSM 2005 91.4 8 Micronesia, Fed. Sts. FSM 2004 91.4 9 Micronesia, Fed. Sts. FSM 2003 91.3 10 Micronesia, Fed. Sts. FSM 2002 91.3 # … with 1,574 more rows ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 select(country_name, country_code, x2000:x2005) %>% # data is "wide", but to use for analysis # we transform it to "long" gather(key = "year", value = "percent_forest", x2000:x2005) %>% mutate(year = str_remove(year, "x")) %>% mutate(year = as.numeric(year)) %>% # reordering data if you like arrange(-percent_forest) %>% # rename variable rename(wb_country_code = country_code) %>% * mutate(continent = * countrycode::countrycode( * sourcevar = wb_country_code, * origin = "wb", * destination = "continent" * )) ``` ] .panel2-wb_cleaning-auto[ ``` # A tibble: 1,584 x 5 country_name wb_country_code year percent_forest continent <chr> <chr> <dbl> <dbl> <chr> 1 Suriname SUR 2000 98.7 Americas 2 Suriname SUR 2001 98.6 Americas 3 Suriname SUR 2002 98.6 Americas 4 Suriname SUR 2003 98.6 Americas 5 Suriname SUR 2004 98.6 Americas 6 Suriname SUR 2005 98.5 Americas 7 Micronesia, Fed. Sts. FSM 2005 91.4 Oceania 8 Micronesia, Fed. Sts. FSM 2004 91.4 Oceania 9 Micronesia, Fed. Sts. FSM 2003 91.3 Oceania 10 Micronesia, Fed. Sts. FSM 2002 91.3 Oceania # … with 1,574 more rows ``` ] --- count: false .panel1-wb_cleaning-auto[ ```r wb_percent_forest_raw %>% # change names to snake case janitor::clean_names() %>% # get rid of columns that are not variables # indicator name and indicator code select(-indicator_name, -indicator_code) %>% # suppose you are interested only in 2000 to 2005 select(country_name, country_code, x2000:x2005) %>% # data is "wide", but to use for analysis # we transform it to "long" gather(key = "year", value = "percent_forest", x2000:x2005) %>% mutate(year = str_remove(year, "x")) %>% mutate(year = as.numeric(year)) %>% # reordering data if you like arrange(-percent_forest) %>% # rename variable rename(wb_country_code = country_code) %>% mutate(continent = countrycode::countrycode( sourcevar = wb_country_code, origin = "wb", destination = "continent" )) -> *wb_percent_forest ``` ] .panel2-wb_cleaning-auto[ ] <style> .panel1-wb_cleaning-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-wb_cleaning-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-wb_cleaning-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Step 4: Save Clean Data ```r save(wb_percent_forest, file = "data_products/wb_percent_forest.Rdata") ``` --- ## Step 5: Time for Analysis! The first code chunk in 02_analysis.Rmd should look like this. This makes the tidyverse and the cleaned data available at the start of your analysis. ```r library(tidyverse) load(file = "data_products/wb_percent_forest.Rdata") ``` --- # Deeper Dives: country codes, dates, string manipulation The examples above should give you some intuition for the look and feel of some real data cleaning pipelines. But they of course are not very comprehensive. In this section, let's focus manipulation using a few packages. - countrycode - translates between one name or id for a country to another - lubridate - working with dates - stringr - character manipulation --- ## Country Code Countries can have many names. Consider the United States. You can call it the US, the USA, the United States of America and so on. The package countrycode can be helpful in translating between different ways to refer to a specific country. Let's see how you might use it. --- count: false .panel1-countrycode-auto[ ```r *library(gapminder) ``` ] .panel2-countrycode-auto[ ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) *library(countrycode) ``` ] .panel2-countrycode-auto[ ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) library(countrycode) *gapminder ``` ] .panel2-countrycode-auto[ ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # … with 1,694 more rows ``` ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) library(countrycode) gapminder %>% * distinct(country) ``` ] .panel2-countrycode-auto[ ``` # A tibble: 142 x 1 country <fct> 1 Afghanistan 2 Albania 3 Algeria 4 Angola 5 Argentina 6 Australia 7 Austria 8 Bahrain 9 Bangladesh 10 Belgium # … with 132 more rows ``` ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) library(countrycode) gapminder %>% distinct(country) %>% * mutate(iso3c = countrycode( * sourcevar = country, * origin = "country.name", * destination = "iso3c" * )) ``` ] .panel2-countrycode-auto[ ``` # A tibble: 142 x 2 country iso3c <fct> <chr> 1 Afghanistan AFG 2 Albania ALB 3 Algeria DZA 4 Angola AGO 5 Argentina ARG 6 Australia AUS 7 Austria AUT 8 Bahrain BHR 9 Bangladesh BGD 10 Belgium BEL # … with 132 more rows ``` ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) library(countrycode) gapminder %>% distinct(country) %>% mutate(iso3c = countrycode( sourcevar = country, origin = "country.name", destination = "iso3c" )) %>% * mutate(cowc = countrycode( * s = country, * origin = "country.name", * d = "cowc") * ) ``` ] .panel2-countrycode-auto[ ``` # A tibble: 142 x 3 country iso3c cowc <fct> <chr> <chr> 1 Afghanistan AFG AFG 2 Albania ALB ALB 3 Algeria DZA ALG 4 Angola AGO ANG 5 Argentina ARG ARG 6 Australia AUS AUL 7 Austria AUT AUS 8 Bahrain BHR BAH 9 Bangladesh BGD BNG 10 Belgium BEL BEL # … with 132 more rows ``` ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) library(countrycode) gapminder %>% distinct(country) %>% mutate(iso3c = countrycode( sourcevar = country, origin = "country.name", destination = "iso3c" )) %>% mutate(cowc = countrycode( s = country, origin = "country.name", d = "cowc") ) %>% * mutate(cown = countrycode( * s = cowc, * origin = "cowc", * d = "cown") * ) ``` ] .panel2-countrycode-auto[ ``` # A tibble: 142 x 4 country iso3c cowc cown <fct> <chr> <chr> <int> 1 Afghanistan AFG AFG 700 2 Albania ALB ALB 339 3 Algeria DZA ALG 615 4 Angola AGO ANG 540 5 Argentina ARG ARG 160 6 Australia AUS AUL 900 7 Austria AUT AUS 305 8 Bahrain BHR BAH 692 9 Bangladesh BGD BNG 771 10 Belgium BEL BEL 211 # … with 132 more rows ``` ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) library(countrycode) gapminder %>% distinct(country) %>% mutate(iso3c = countrycode( sourcevar = country, origin = "country.name", destination = "iso3c" )) %>% mutate(cowc = countrycode( s = country, origin = "country.name", d = "cowc") ) %>% mutate(cown = countrycode( s = cowc, origin = "cowc", d = "cown") ) %>% * mutate(eu28 = countrycode( * s = country, * origin = "country.name", * d = "eu28") * ) ``` ] .panel2-countrycode-auto[ ``` # A tibble: 142 x 5 country iso3c cowc cown eu28 <fct> <chr> <chr> <int> <chr> 1 Afghanistan AFG AFG 700 <NA> 2 Albania ALB ALB 339 <NA> 3 Algeria DZA ALG 615 <NA> 4 Angola AGO ANG 540 <NA> 5 Argentina ARG ARG 160 <NA> 6 Australia AUS AUL 900 <NA> 7 Austria AUT AUS 305 EU 8 Bahrain BHR BAH 692 <NA> 9 Bangladesh BGD BNG 771 <NA> 10 Belgium BEL BEL 211 EU # … with 132 more rows ``` ] --- count: false .panel1-countrycode-auto[ ```r library(gapminder) library(countrycode) gapminder %>% distinct(country) %>% mutate(iso3c = countrycode( sourcevar = country, origin = "country.name", destination = "iso3c" )) %>% mutate(cowc = countrycode( s = country, origin = "country.name", d = "cowc") ) %>% mutate(cown = countrycode( s = cowc, origin = "cowc", d = "cown") ) %>% mutate(eu28 = countrycode( s = country, origin = "country.name", d = "eu28") ) %>% * mutate(eu28 = ifelse( * is.na(eu28), "Not EU", "EU") * ) ``` ] .panel2-countrycode-auto[ ``` # A tibble: 142 x 5 country iso3c cowc cown eu28 <fct> <chr> <chr> <int> <chr> 1 Afghanistan AFG AFG 700 Not EU 2 Albania ALB ALB 339 Not EU 3 Algeria DZA ALG 615 Not EU 4 Angola AGO ANG 540 Not EU 5 Argentina ARG ARG 160 Not EU 6 Australia AUS AUL 900 Not EU 7 Austria AUT AUS 305 EU 8 Bahrain BHR BAH 692 Not EU 9 Bangladesh BGD BNG 771 Not EU 10 Belgium BEL BEL 211 EU # … with 132 more rows ``` ] <style> .panel1-countrycode-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-countrycode-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-countrycode-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Lubridate Dates are a special class of data, and there are many ways to express dates. Let's have a look at how to use the lubridate package to "move around" between dates. We'll look at a very small dataset of four US holidays. ```r holidays_2019 ``` ``` # A tibble: 4 x 2 date holiday <chr> <chr> 1 January 1 New Year's Day 2 January 21 Martin Luther King Day 3 January 24 Belly Laugh Day 4 February 2 Groundhog Day ``` --- count: false .panel1-lubridate-auto[ ```r *library(lubridate) ``` ] .panel2-lubridate-auto[ ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) *holidays_2019 ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 2 date holiday <chr> <chr> 1 January 1 New Year's Day 2 January 21 Martin Luther King Day 3 January 24 Belly Laugh Day 4 February 2 Groundhog Day ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% * mutate(date = paste0(date, ", 2019")) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 2 date holiday <chr> <chr> 1 January 1, 2019 New Year's Day 2 January 21, 2019 Martin Luther King Day 3 January 24, 2019 Belly Laugh Day 4 February 2, 2019 Groundhog Day ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% * mutate(date = mdy(date)) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 2 date holiday <date> <chr> 1 2019-01-01 New Year's Day 2 2019-01-21 Martin Luther King Day 3 2019-01-24 Belly Laugh Day 4 2019-02-02 Groundhog Day ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% * mutate(day = day(date)) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 3 date holiday day <date> <chr> <int> 1 2019-01-01 New Year's Day 1 2 2019-01-21 Martin Luther King Day 21 3 2019-01-24 Belly Laugh Day 24 4 2019-02-02 Groundhog Day 2 ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% mutate(day = day(date)) %>% * mutate(month = month(date)) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 4 date holiday day month <date> <chr> <int> <dbl> 1 2019-01-01 New Year's Day 1 1 2 2019-01-21 Martin Luther King Day 21 1 3 2019-01-24 Belly Laugh Day 24 1 4 2019-02-02 Groundhog Day 2 2 ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% mutate(day = day(date)) %>% mutate(month = month(date)) %>% * mutate(year = year(date)) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 5 date holiday day month year <date> <chr> <int> <dbl> <dbl> 1 2019-01-01 New Year's Day 1 1 2019 2 2019-01-21 Martin Luther King Day 21 1 2019 3 2019-01-24 Belly Laugh Day 24 1 2019 4 2019-02-02 Groundhog Day 2 2 2019 ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% mutate(day = day(date)) %>% mutate(month = month(date)) %>% mutate(year = year(date)) %>% * mutate(day_of_week = weekdays(date)) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 6 date holiday day month year day_of_week <date> <chr> <int> <dbl> <dbl> <chr> 1 2019-01-01 New Year's Day 1 1 2019 Tuesday 2 2019-01-21 Martin Luther King Day 21 1 2019 Monday 3 2019-01-24 Belly Laugh Day 24 1 2019 Thursday 4 2019-02-02 Groundhog Day 2 2 2019 Saturday ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% mutate(day = day(date)) %>% mutate(month = month(date)) %>% mutate(year = year(date)) %>% mutate(day_of_week = weekdays(date)) %>% # making space for more manipulation * select(-holiday, -day_of_week) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 4 date day month year <date> <int> <dbl> <dbl> 1 2019-01-01 1 1 2019 2 2019-01-21 21 1 2019 3 2019-01-24 24 1 2019 4 2019-02-02 2 2 2019 ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% mutate(day = day(date)) %>% mutate(month = month(date)) %>% mutate(year = year(date)) %>% mutate(day_of_week = weekdays(date)) %>% # making space for more manipulation select(-holiday, -day_of_week) %>% * mutate(euro_ordering = paste(day, month, year, sep = "/")) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 5 date day month year euro_ordering <date> <int> <dbl> <dbl> <chr> 1 2019-01-01 1 1 2019 1/1/2019 2 2019-01-21 21 1 2019 21/1/2019 3 2019-01-24 24 1 2019 24/1/2019 4 2019-02-02 2 2 2019 2/2/2019 ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% mutate(day = day(date)) %>% mutate(month = month(date)) %>% mutate(year = year(date)) %>% mutate(day_of_week = weekdays(date)) %>% # making space for more manipulation select(-holiday, -day_of_week) %>% mutate(euro_ordering = paste(day, month, year, sep = "/")) %>% * mutate(date_0 = dmy(euro_ordering)) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 6 date day month year euro_ordering date_0 <date> <int> <dbl> <dbl> <chr> <date> 1 2019-01-01 1 1 2019 1/1/2019 2019-01-01 2 2019-01-21 21 1 2019 21/1/2019 2019-01-21 3 2019-01-24 24 1 2019 24/1/2019 2019-01-24 4 2019-02-02 2 2 2019 2/2/2019 2019-02-02 ``` ] --- count: false .panel1-lubridate-auto[ ```r library(lubridate) holidays_2019 %>% mutate(date = paste0(date, ", 2019")) %>% mutate(date = mdy(date)) %>% mutate(day = day(date)) %>% mutate(month = month(date)) %>% mutate(year = year(date)) %>% mutate(day_of_week = weekdays(date)) %>% # making space for more manipulation select(-holiday, -day_of_week) %>% mutate(euro_ordering = paste(day, month, year, sep = "/")) %>% mutate(date_0 = dmy(euro_ordering)) %>% * mutate(since_jan_1 = date - ymd("2019-01-01")) ``` ] .panel2-lubridate-auto[ ``` # A tibble: 4 x 7 date day month year euro_ordering date_0 since_jan_1 <date> <int> <dbl> <dbl> <chr> <date> <drtn> 1 2019-01-01 1 1 2019 1/1/2019 2019-01-01 0 days 2 2019-01-21 21 1 2019 21/1/2019 2019-01-21 20 days 3 2019-01-24 24 1 2019 24/1/2019 2019-01-24 23 days 4 2019-02-02 2 2 2019 2/2/2019 2019-02-02 32 days ``` ] <style> .panel1-lubridate-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-lubridate-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-lubridate-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## String (character) manipulation Character manipulation is also referred to as "regular expressions", or "Regex". The package stringr seeks to provide a unified framework for string manipulation. Let's get a taste, using the holidays_2019 dataset again. (The section is not particularly inspired - sorry!) --- count: false .panel1-stringr-rotate[ ```r library(stringr) holidays_2019 %>% mutate(var_1 = str_detect(holiday, "u")) %>% mutate(var_1 = tolower(holiday)) %>% mutate(var_1 = * str_replace(holiday, "u", "REPLACED")) ``` ] .panel2-stringr-rotate[ ``` # A tibble: 4 x 3 date holiday var_1 <chr> <chr> <chr> 1 January 1 New Year's Day New Year's Day 2 January 21 Martin Luther King Day Martin LREPLACEDther King Day 3 January 24 Belly Laugh Day Belly LaREPLACEDgh Day 4 February 2 Groundhog Day GroREPLACEDndhog Day ``` ] --- count: false .panel1-stringr-rotate[ ```r library(stringr) holidays_2019 %>% mutate(var_1 = str_detect(holiday, "u")) %>% mutate(var_1 = tolower(holiday)) %>% mutate(var_1 = * str_replace(holiday, "^.", "REPLACED")) #ROTATE ``` ] .panel2-stringr-rotate[ ``` # A tibble: 4 x 3 date holiday var_1 <chr> <chr> <chr> 1 January 1 New Year's Day REPLACEDew Year's Day 2 January 21 Martin Luther King Day REPLACEDartin Luther King Day 3 January 24 Belly Laugh Day REPLACEDelly Laugh Day 4 February 2 Groundhog Day REPLACEDroundhog Day ``` ] --- count: false .panel1-stringr-rotate[ ```r library(stringr) holidays_2019 %>% mutate(var_1 = str_detect(holiday, "u")) %>% mutate(var_1 = tolower(holiday)) %>% mutate(var_1 = * str_replace(holiday, ".$", "REPLACED")) #ROTATE ``` ] .panel2-stringr-rotate[ ``` # A tibble: 4 x 3 date holiday var_1 <chr> <chr> <chr> 1 January 1 New Year's Day New Year's DaREPLACED 2 January 21 Martin Luther King Day Martin Luther King DaREPLACED 3 January 24 Belly Laugh Day Belly Laugh DaREPLACED 4 February 2 Groundhog Day Groundhog DaREPLACED ``` ] --- count: false .panel1-stringr-rotate[ ```r library(stringr) holidays_2019 %>% mutate(var_1 = str_detect(holiday, "u")) %>% mutate(var_1 = tolower(holiday)) %>% mutate(var_1 = * str_count(holiday, "a")) #ROTATE ``` ] .panel2-stringr-rotate[ ``` # A tibble: 4 x 3 date holiday var_1 <chr> <chr> <int> 1 January 1 New Year's Day 2 2 January 21 Martin Luther King Day 2 3 January 24 Belly Laugh Day 2 4 February 2 Groundhog Day 1 ``` ] --- count: false .panel1-stringr-rotate[ ```r library(stringr) holidays_2019 %>% mutate(var_1 = str_detect(holiday, "u")) %>% mutate(var_1 = tolower(holiday)) %>% mutate(var_1 = * str_extract(holiday, ".....Day")) #ROTATE ``` ] .panel2-stringr-rotate[ ``` # A tibble: 4 x 3 date holiday var_1 <chr> <chr> <chr> 1 January 1 New Year's Day ar's Day 2 January 21 Martin Luther King Day King Day 3 January 24 Belly Laugh Day augh Day 4 February 2 Groundhog Day dhog Day ``` ] --- count: false .panel1-stringr-rotate[ ```r library(stringr) holidays_2019 %>% mutate(var_1 = str_detect(holiday, "u")) %>% mutate(var_1 = tolower(holiday)) %>% mutate(var_1 = * str_extract(holiday, "[[:PUNCT:]]")) #ROTATE ``` ] .panel2-stringr-rotate[ ``` # A tibble: 4 x 3 date holiday var_1 <chr> <chr> <chr> 1 January 1 New Year's Day ' 2 January 21 Martin Luther King Day <NA> 3 January 24 Belly Laugh Day <NA> 4 February 2 Groundhog Day <NA> ``` ] <style> .panel1-stringr-rotate { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-stringr-rotate { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-stringr-rotate { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Key packages used Sam Firke (2019). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 1.2.0. https://CRAN.R-project.org/package=janitor Hadley Wickham (2017). tidyverse: Easily Install and Load the 'Tidyverse'. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse Arel-Bundock et al., (2018). countrycode: An R package to convert country names and country codes. Journal of Open Source Software, 3(28), 848, https://doi.org/10.21105/joss.00848 Christian Rubba (2016). htmltab: Assemble Data Frames from HTML Tables. R package version 0.7.1. https://CRAN.R-project.org/package=htmltab --- I get the UN Security Council data from the UN's website "Countries Elected Members of the Security Council". --- --- Sam Firke (2019). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 1.2.0. https://CRAN.R-project.org/package=janitor Hadley Wickham (2017). tidyverse: Easily Install and Load the 'Tidyverse'. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse Arel-Bundock et al., (2018). countrycode: An R package to convert country names and country codes. Journal of Open Source Software, 3(28), 848, https://doi.org/10.21105/joss.00848 Christian Rubba (2016). htmltab: Assemble Data Frames from HTML Tables. R package version 0.7.1. https://CRAN.R-project.org/package=htmltab Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/. <style type="text/css"> .remark-code{line-height: 1.5; font-size: 65%} @media print { .has-continuation { display: block; } } </style>