class: center, middle, inverse, title-slide # Basic Bivariate Statistical Tests ## Last Updated, May 19, 2020 ### Gina Reynolds, January 2020 --- In this resource, we will address several questions: - How do we visualize the relationship between two variables? -- - How can we quantify the strength of the relationship between two variables? -- - How likely is it that such a strong relationship or stronger resulted from chance? (i.e. what is the p-value?) -- - Given this, should I consider the relationship too likely to have arisen just by chance? Is the p-value < `\(\alpha\)`? Given this, should I reject the null or not? --- # Visualization - use ggplot2 - mostly one variable on x and one on y (but not always, other options too) --- # Quantifying strength of relationships -- - Correlation Coefficient (-1, 0, 1) -- - Difference-in-means and t-statistic -- - Chi-Squared --- # Statistical inference How likely is it that such a strong relationship as the one I observe (or stronger) is just the product of chance. -- If the null is true (often statistical independence), then how likely is it to observe stuch a strong relationship. --- # Basic Statistical tests - Correlation test -- - ttest -- - chi squared test --- ```r library(gapminder) library(tidyverse) ``` --- class: split-40 count: false .column[.content[ ```r *set.seed(2019) ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) *gapminder ``` ]] .column[.content[ ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # … with 1,694 more rows ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% * filter(year == 2007) # ``` ]] .column[.content[ ``` # A tibble: 142 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 2007 43.8 31889923 975. 2 Albania Europe 2007 76.4 3600523 5937. 3 Algeria Africa 2007 72.3 33333216 6223. 4 Angola Africa 2007 42.7 12420476 4797. 5 Argentina Americas 2007 75.3 40301927 12779. 6 Australia Oceania 2007 81.2 20434176 34435. 7 Austria Europe 2007 79.8 8199783 36126. 8 Bahrain Asia 2007 75.6 708573 29796. 9 Bangladesh Asia 2007 64.1 150448339 1391. 10 Belgium Europe 2007 79.4 10392226 33693. # … with 132 more rows ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% filter(year == 2007) %>% # * filter(continent != "Oceania") ``` ]] .column[.content[ ``` # A tibble: 140 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 2007 43.8 31889923 975. 2 Albania Europe 2007 76.4 3600523 5937. 3 Algeria Africa 2007 72.3 33333216 6223. 4 Angola Africa 2007 42.7 12420476 4797. 5 Argentina Americas 2007 75.3 40301927 12779. 6 Austria Europe 2007 79.8 8199783 36126. 7 Bahrain Asia 2007 75.6 708573 29796. 8 Bangladesh Asia 2007 64.1 150448339 1391. 9 Belgium Europe 2007 79.4 10392226 33693. 10 Benin Africa 2007 56.7 8078314 1441. # … with 130 more rows ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% filter(year == 2007) %>% # filter(continent != "Oceania") %>% * mutate(continent = factor(continent)) ``` ]] .column[.content[ ``` # A tibble: 140 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 2007 43.8 31889923 975. 2 Albania Europe 2007 76.4 3600523 5937. 3 Algeria Africa 2007 72.3 33333216 6223. 4 Angola Africa 2007 42.7 12420476 4797. 5 Argentina Americas 2007 75.3 40301927 12779. 6 Austria Europe 2007 79.8 8199783 36126. 7 Bahrain Asia 2007 75.6 708573 29796. 8 Bangladesh Asia 2007 64.1 150448339 1391. 9 Belgium Europe 2007 79.4 10392226 33693. 10 Benin Africa 2007 56.7 8078314 1441. # … with 130 more rows ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% filter(year == 2007) %>% # filter(continent != "Oceania") %>% mutate(continent = factor(continent)) %>% * mutate(high_income = case_when( * gdpPercap >= 15000 ~ "high income", * gdpPercap < 15000 ~ "not high income") * ) ``` ]] .column[.content[ ``` # A tibble: 140 x 7 country continent year lifeExp pop gdpPercap high_income <fct> <fct> <int> <dbl> <int> <dbl> <chr> 1 Afghanistan Asia 2007 43.8 31889923 975. not high income 2 Albania Europe 2007 76.4 3600523 5937. not high income 3 Algeria Africa 2007 72.3 33333216 6223. not high income 4 Angola Africa 2007 42.7 12420476 4797. not high income 5 Argentina Americas 2007 75.3 40301927 12779. not high income 6 Austria Europe 2007 79.8 8199783 36126. high income 7 Bahrain Asia 2007 75.6 708573 29796. high income 8 Bangladesh Asia 2007 64.1 150448339 1391. not high income 9 Belgium Europe 2007 79.4 10392226 33693. high income 10 Benin Africa 2007 56.7 8078314 1441. not high income # … with 130 more rows ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% filter(year == 2007) %>% # filter(continent != "Oceania") %>% mutate(continent = factor(continent)) %>% mutate(high_income = case_when( gdpPercap >= 15000 ~ "high income", gdpPercap < 15000 ~ "not high income") ) %>% * mutate(asia = case_when( * continent == "Asia" ~ "Asia", * continent != "Asia" ~ "not Asia") * ) ``` ]] .column[.content[ ``` # A tibble: 140 x 8 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Afghanistan Asia 2007 43.8 3.19e7 975. not high inco… Asia 2 Albania Europe 2007 76.4 3.60e6 5937. not high inco… not As… 3 Algeria Africa 2007 72.3 3.33e7 6223. not high inco… not As… 4 Angola Africa 2007 42.7 1.24e7 4797. not high inco… not As… 5 Argentina Americas 2007 75.3 4.03e7 12779. not high inco… not As… 6 Austria Europe 2007 79.8 8.20e6 36126. high income not As… 7 Bahrain Asia 2007 75.6 7.09e5 29796. high income Asia 8 Bangladesh Asia 2007 64.1 1.50e8 1391. not high inco… Asia 9 Belgium Europe 2007 79.4 1.04e7 33693. high income not As… 10 Benin Africa 2007 56.7 8.08e6 1441. not high inco… not As… # … with 130 more rows ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% filter(year == 2007) %>% # filter(continent != "Oceania") %>% mutate(continent = factor(continent)) %>% mutate(high_income = case_when( gdpPercap >= 15000 ~ "high income", gdpPercap < 15000 ~ "not high income") ) %>% mutate(asia = case_when( continent == "Asia" ~ "Asia", continent != "Asia" ~ "not Asia") ) %>% * mutate(gdppercap_log10 = * log10(gdpPercap)) ``` ]] .column[.content[ ``` # A tibble: 140 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Afghan… Asia 2007 43.8 3.19e7 975. not high i… Asia 2 Albania Europe 2007 76.4 3.60e6 5937. not high i… not … 3 Algeria Africa 2007 72.3 3.33e7 6223. not high i… not … 4 Angola Africa 2007 42.7 1.24e7 4797. not high i… not … 5 Argent… Americas 2007 75.3 4.03e7 12779. not high i… not … 6 Austria Europe 2007 79.8 8.20e6 36126. high income not … 7 Bahrain Asia 2007 75.6 7.09e5 29796. high income Asia 8 Bangla… Asia 2007 64.1 1.50e8 1391. not high i… Asia 9 Belgium Europe 2007 79.4 1.04e7 33693. high income not … 10 Benin Africa 2007 56.7 8.08e6 1441. not high i… not … # … with 130 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% filter(year == 2007) %>% # filter(continent != "Oceania") %>% mutate(continent = factor(continent)) %>% mutate(high_income = case_when( gdpPercap >= 15000 ~ "high income", gdpPercap < 15000 ~ "not high income") ) %>% mutate(asia = case_when( continent == "Asia" ~ "Asia", continent != "Asia" ~ "not Asia") ) %>% mutate(gdppercap_log10 = log10(gdpPercap)) %>% * sample_n(50, replace = F) ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r set.seed(2019) gapminder %>% filter(year == 2007) %>% # filter(continent != "Oceania") %>% mutate(continent = factor(continent)) %>% mutate(high_income = case_when( gdpPercap >= 15000 ~ "high income", gdpPercap < 15000 ~ "not high income") ) %>% mutate(asia = case_when( continent == "Asia" ~ "Asia", continent != "Asia" ~ "not Asia") ) %>% mutate(gdppercap_log10 = log10(gdpPercap)) %>% sample_n(50, replace = F) -> *gm_2007_sample_50 ``` ]] .column[.content[ ]] --- class: inverse, center, middle # The relationship between two continuous variables Does information about the value of "x" tell me information about the probable value of "y"? Is the relationship strong enough to say that it is unlikely to have arisen from chance? --- class: split-40 count: false .column[.content[ ```r *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% * ggplot() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + * aes(x = gdppercap_log10) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_3_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + aes(x = gdppercap_log10) + * aes(y = lifeExp) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_4_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + aes(x = gdppercap_log10) + aes(y = lifeExp) + * geom_point(alpha = .5, * size = 5, * color = "magenta") ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_5_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + aes(x = gdppercap_log10) + aes(y = lifeExp) + geom_point(alpha = .5, size = 5, color = "magenta") + * theme_minimal(base_size = 15) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_6_output-1.png" width="432" /> ]] --- ### Correlation test Eyeballing it, there certainly seems to be a relationship between x and y --- knowing the value of x would seem to inform you of the general likely value of y. But let's do the statistical test too. --- Now, the statistical test. Note: The normalized version of covariance is Pearson's correlation coefficient. `$$cov(x,y) = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})}{n}$$` --- <div class="figure"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Karl_Pearson%3B_Sir_Francis_Galton.jpg/800px-Karl_Pearson%3B_Sir_Francis_Galton.jpg" alt="Pearson and Galton - via wikipedia" width="45%" /> <p class="caption">Pearson and Galton - via wikipedia</p> </div> --- class: split-40 count: false .column[.content[ ```r *cor.test( * x = gm_2007_sample_50$gdppercap_log10, * y = gm_2007_sample_50$lifeExp * ) ``` ]] .column[.content[ ``` Pearson's product-moment correlation data: gm_2007_sample_50$gdppercap_log10 and gm_2007_sample_50$lifeExp t = 14.437, df = 48, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8320019 0.9432120 sample estimates: cor 0.9015609 ``` ]] --- class: split-40 count: false .column[.content[ ```r cor.test( x = gm_2007_sample_50$gdppercap_log10, y = gm_2007_sample_50$lifeExp ) # using a new pipe from the magrittr package *library(magrittr) ``` ]] .column[.content[ ``` Pearson's product-moment correlation data: gm_2007_sample_50$gdppercap_log10 and gm_2007_sample_50$lifeExp t = 14.437, df = 48, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8320019 0.9432120 sample estimates: cor 0.9015609 ``` ]] --- class: split-40 count: false .column[.content[ ```r cor.test( x = gm_2007_sample_50$gdppercap_log10, y = gm_2007_sample_50$lifeExp ) # using a new pipe from the magrittr package library(magrittr) *gm_2007_sample_50 # state what data vars come from ``` ]] .column[.content[ ``` Pearson's product-moment correlation data: gm_2007_sample_50$gdppercap_log10 and gm_2007_sample_50$lifeExp t = 14.437, df = 48, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8320019 0.9432120 sample estimates: cor 0.9015609 ``` ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r cor.test( x = gm_2007_sample_50$gdppercap_log10, y = gm_2007_sample_50$lifeExp ) # using a new pipe from the magrittr package library(magrittr) gm_2007_sample_50 %$% # state what data vars come from * cor.test(x = gdppercap_log10, * y = lifeExp) ``` ]] .column[.content[ ``` Pearson's product-moment correlation data: gm_2007_sample_50$gdppercap_log10 and gm_2007_sample_50$lifeExp t = 14.437, df = 48, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8320019 0.9432120 sample estimates: cor 0.9015609 ``` ``` Pearson's product-moment correlation data: gdppercap_log10 and lifeExp t = 14.437, df = 48, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8320019 0.9432120 sample estimates: cor 0.9015609 ``` ]] --- class: inverse, center, middle # The relationship between a dichotomous (two groups) and continuous variable --- class: split-40 count: false .column[.content[ ```r *ggplot(gm_2007_sample_50) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_1_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + * aes(y = lifeExp) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + * aes(x = asia) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_3_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = asia) + * aes(fill = asia) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_4_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = asia) + aes(fill = asia) + * geom_boxplot() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_5_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = asia) + aes(fill = asia) + geom_boxplot() + * geom_jitter(height = 0, * width = .2, * alpha = .3) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_6_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = asia) + aes(fill = asia) + geom_boxplot() + geom_jitter(height = 0, width = .2, alpha = .3) + * geom_point(data = . %>% * group_by(asia) %>% * summarize(mean_life_exp = mean(lifeExp)), * aes(y = mean_life_exp), * color = "goldenrod", * size = 5) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_7_output-1.png" width="432" /> ]] --- # An alternative --- class: split-40 count: false .column[.content[ ```r *ggplot(gm_2007_sample_50) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_1_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + * aes(x = lifeExp) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = lifeExp) + * aes(fill = asia) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_3_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = lifeExp) + aes(fill = asia) + * aes(col = asia) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_4_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = lifeExp) + aes(fill = asia) + aes(col = asia) + * geom_histogram(alpha = .2) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_5_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = lifeExp) + aes(fill = asia) + aes(col = asia) + geom_histogram(alpha = .2) + * facet_wrap(~ asia, nrow = 2) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_6_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = lifeExp) + aes(fill = asia) + aes(col = asia) + geom_histogram(alpha = .2) + facet_wrap(~ asia, nrow = 2) + * geom_rug() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_7_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = lifeExp) + aes(fill = asia) + aes(col = asia) + geom_histogram(alpha = .2) + facet_wrap(~ asia, nrow = 2) + geom_rug() + * geom_vline(data = . %>% * group_by(asia) %>% * summarize(mean_life_exp = mean(lifeExp)), * aes(xintercept = mean_life_exp), * linetype = "dashed") ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_8_output-1.png" width="432" /> ]] --- ### "student's t-test" <div class="figure"> <img src="https://upload.wikimedia.org/wikipedia/commons/4/42/William_Sealy_Gosset.jpg" alt="William_Sealy_Gosset - via wikipedia" width="35%" /><img src="https://upload.wikimedia.org/wikipedia/commons/6/65/William_Gosset_plaque_in_Guinness_storehouse_tour%2C_Ireland.jpg" alt="William_Sealy_Gosset - via wikipedia" width="35%" /> <p class="caption">William_Sealy_Gosset - via wikipedia</p> </div> --- ### The t-test compares group means, or a sample mean from an expected mean. Are they different enough from each other to be considered statistically different? --- $$ t = \frac{(\bar x_1 - \bar x_2) - (\mu_1 - \mu_2) }{\sqrt(s_p^2/n_1 +s_p^2/n_2)} $$ -- $$ t = \frac{(\bar x_1 - \bar x_2)}{\sqrt(s_p^2/n_1 +s_p^2/n_2)} $$ --- class: split-40 count: false .column[.content[ ```r *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% * t.test(formula = lifeExp ~ asia, * data = .) ``` ]] .column[.content[ ``` Welch Two Sample t-test data: lifeExp by asia t = 0.99834, df = 30.407, p-value = 0.326 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.46270 10.09291 sample estimates: mean in group Asia mean in group not Asia 70.11513 66.80003 ``` ]] --- class: inverse, center, middle # Relationship between multinomial variable (many groups) and continuous variable --- class: split-40 count: false .column[.content[ ```r *ggplot(gm_2007_sample_50) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_1_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + * aes(y = lifeExp) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + * aes(x = continent) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_3_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = continent) + * aes(fill = continent) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_4_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = continent) + aes(fill = continent) + * geom_boxplot() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_5_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = continent) + aes(fill = continent) + geom_boxplot() + * geom_jitter(height = 0, * width = .2, * alpha = .3) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_6_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = continent) + aes(fill = continent) + geom_boxplot() + geom_jitter(height = 0, width = .2, alpha = .3) + * theme_minimal() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_7_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = continent) + aes(fill = continent) + geom_boxplot() + geom_jitter(height = 0, width = .2, alpha = .3) + theme_minimal() + * theme(legend.position = "none") ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_8_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(y = lifeExp) + aes(x = continent) + aes(fill = continent) + geom_boxplot() + geom_jitter(height = 0, width = .2, alpha = .3) + theme_minimal() + theme(legend.position = "none") + * geom_vline(data = . %>% * group_by(asia) %>% * summarize(mean_life_exp = mean(lifeExp)), * aes(xintercept = mean_life_exp), * linetype = "dashed") ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_9_output-1.png" width="432" /> ]] --- ## "Analysis of Variance" (Anova) test statistic = difference in means. Are there any difference in means for the groups that are strong enough to say that they are statistically different. --- class: split-40 count: false .column[.content[ ```r # the model itself *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r # the model itself gm_2007_sample_50 %>% *aov(formula = lifeExp ~ continent, * data = .) ``` ]] .column[.content[ ``` Call: aov(formula = lifeExp ~ continent, data = .) Terms: continent Residuals Sum of Squares 3115.130 3234.028 Deg. of Freedom 3 46 Residual standard error: 8.384805 Estimated effects may be unbalanced ``` ]] --- class: split-40 count: false .column[.content[ ```r # the model itself gm_2007_sample_50 %>% aov(formula = lifeExp ~ continent, data = .) -> *a ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r # the model itself gm_2007_sample_50 %>% aov(formula = lifeExp ~ continent, data = .) -> a # the summary of the analysis of variance *summary(a) ``` ]] .column[.content[ ``` Df Sum Sq Mean Sq F value Pr(>F) continent 3 3115 1038.4 14.77 7.18e-07 *** Residuals 46 3234 70.3 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ]] --- ## Which are the statisically different groups? Post hoc testing is done to identify which groups are statistically different. ```r TukeyHSD(a) ``` ``` Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = lifeExp ~ continent, data = .) $continent diff lwr upr p adj Americas-Africa 17.643000 6.344685 28.941315 0.0007645 Asia-Africa 12.304133 4.490624 20.117642 0.0006850 Europe-Africa 18.866750 10.537521 27.195979 0.0000015 Asia-Americas -5.338867 -16.880184 6.202451 0.6094127 Europe-Americas 1.223750 -10.672768 13.120268 0.9926979 Europe-Asia 6.562617 -2.093372 15.218605 0.1952925 ``` --- class: inverse, center, middle # Relationships between categorical variables First we consider dichotomous variables, and then more categories. --- class: inverse, center, middle ## Relationship between between dichotomous variables (two categories for each variable) - Binary (male, female; high income, low income) - Indicator, "Dummy variable" (on drug, not on drug; recovered, not recovered; democracy, not democracy; militarized interstate dispute, no dispute) --- ### contingency table Creating a contingency table with base R is pretty straightforward. We are using the dollar sign to point to a column in the dataframe gm_2007_sample_50. --- class: split-40 count: false .column[.content[ ```r *library(magrittr) ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% *table(x = asia, * y = high_income) ``` ]] .column[.content[ ``` y x high income not high income Asia 6 9 not Asia 7 28 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% table(x = asia, y = high_income) %>% * plot() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/contingency_auto_4_output-1.png" width="432" /> ]] --- ### Contingency table with the tidyverse You can do this in the tidyverse too. With a few more steps. --- class: split-40 count: false .column[.content[ ```r *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% * group_by(asia, high_income) ``` ]] .column[.content[ ``` # A tibble: 50 x 9 # Groups: asia, high_income [4] country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% group_by(asia, high_income) %>% * count() ``` ]] .column[.content[ ``` # A tibble: 4 x 3 # Groups: asia, high_income [4] asia high_income n <chr> <chr> <int> 1 Asia high income 6 2 Asia not high income 9 3 not Asia high income 7 4 not Asia not high income 28 ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% group_by(asia, high_income) %>% count() %>% * spread(high_income, n) ``` ]] .column[.content[ ``` # A tibble: 2 x 3 # Groups: asia [2] asia `high income` `not high income` <chr> <int> <int> 1 Asia 6 9 2 not Asia 7 28 ``` ]] --- class: split-40 count: false .column[.content[ ```r *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% *ggplot() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + * aes(x = high_income) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_3_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + aes(x = high_income) + * aes(y = asia) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_4_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + aes(x = high_income) + aes(y = asia) + * geom_jitter(width = .3, height = .3) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_5_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + aes(x = high_income) + aes(y = asia) + geom_jitter(width = .3, height = .3) + * geom_count(color = "blue", * alpha = .3) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_6_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% ggplot() + aes(x = high_income) + aes(y = asia) + geom_jitter(width = .3, height = .3) + geom_count(color = "blue", alpha = .3) + * scale_size_continuous(range = c(5, 25), * breaks = 1:5 * 5 ) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_7_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r *ggplot(gm_2007_sample_50) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_1_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + * aes(x = asia) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = asia) + * aes(fill = high_income) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_3_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(gm_2007_sample_50) + aes(x = asia) + aes(fill = high_income) + * geom_bar(position = "dodge") ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_4_output-1.png" width="432" /> ]] --- ### Chi-squared test $$ \tilde{\chi}^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k} $$ --- ### mosaic plot A mosaic plot represents the proportions of each cell. Usually the major divide (the split that cuts across the whole square) is the "explanatory" variable. test statistic: Chi-squared statistic --- class: split-40 count: false .column[.content[ ```r *library(magrittr) ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% * chisq.test(x = high_income, * y = asia) ``` ]] .column[.content[ ``` Pearson's Chi-squared test with Yates' continuity correction data: high_income and asia X-squared = 1.2672, df = 1, p-value = 0.2603 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia) -> *chi_sq_income_asia ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia) -> chi_sq_income_asia *chi_sq_income_asia$observed ``` ]] .column[.content[ ``` asia high_income Asia not Asia high income 6 7 not high income 9 28 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia) -> chi_sq_income_asia chi_sq_income_asia$observed %>% * vcd::mosaic() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/chi_two_by_two_auto_6_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r *chi_sq_income_asia$expected # expectation under null ``` ]] .column[.content[ ``` asia high_income Asia not Asia high income 3.9 9.1 not high income 11.1 25.9 ``` ]] --- class: split-40 count: false .column[.content[ ```r chi_sq_income_asia$expected %>% # expectation under null * vcd::mosaic() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/mosaic_two_by_two_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r *library(magrittr) ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% * chisq.test(x = high_income, * y = asia, * correct = F) ``` ]] .column[.content[ ``` Pearson's Chi-squared test data: high_income and asia X-squared = 2.183, df = 1, p-value = 0.1395 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia, correct = F) *chi_sq_income_asia$expected ``` ]] .column[.content[ ``` Pearson's Chi-squared test data: high_income and asia X-squared = 2.183, df = 1, p-value = 0.1395 ``` ``` asia high_income Asia not Asia high income 3.9 9.1 not high income 11.1 25.9 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia, correct = F) chi_sq_income_asia$expected %>% * `-`(chi_sq_income_asia$observed) ``` ]] .column[.content[ ``` Pearson's Chi-squared test data: high_income and asia X-squared = 2.183, df = 1, p-value = 0.1395 ``` ``` asia high_income Asia not Asia high income -2.1 2.1 not high income 2.1 -2.1 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia, correct = F) chi_sq_income_asia$expected %>% `-`(chi_sq_income_asia$observed) %>% * `^`(2) ``` ]] .column[.content[ ``` Pearson's Chi-squared test data: high_income and asia X-squared = 2.183, df = 1, p-value = 0.1395 ``` ``` asia high_income Asia not Asia high income 4.41 4.41 not high income 4.41 4.41 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia, correct = F) chi_sq_income_asia$expected %>% `-`(chi_sq_income_asia$observed) %>% `^`(2) %>% * `/`(chi_sq_income_asia$expected) ``` ]] .column[.content[ ``` Pearson's Chi-squared test data: high_income and asia X-squared = 2.183, df = 1, p-value = 0.1395 ``` ``` asia high_income Asia not Asia high income 1.1307692 0.4846154 not high income 0.3972973 0.1702703 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = high_income, y = asia, correct = F) chi_sq_income_asia$expected %>% `-`(chi_sq_income_asia$observed) %>% `^`(2) %>% `/`(chi_sq_income_asia$expected) %>% * sum() ``` ]] .column[.content[ ``` Pearson's Chi-squared test data: high_income and asia X-squared = 2.183, df = 1, p-value = 0.1395 ``` ``` [1] 2.182952 ``` ]] --- ![](https://upload.wikimedia.org/wikipedia/commons/3/35/Chi-square_pdf.svg)<!-- --> --- class: inverse, center, middle ## Relationships between categorical data: more than two categories test statistic: Chi-squared statistic --- class: split-40 count: false .column[.content[ ```r *library(magrittr) ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% *chisq.test(x = continent, * y = high_income) ``` ]] .column[.content[ ``` Pearson's Chi-squared test data: continent and high_income X-squared = 16.13, df = 3, p-value = 0.001067 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = continent, y = high_income) -> *my_chi_sqr ``` ]] .column[.content[ ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = continent, y = high_income) -> my_chi_sqr *my_chi_sqr$observed ``` ]] .column[.content[ ``` high_income continent high income not high income Africa 0 18 Americas 0 5 Asia 6 9 Europe 7 5 ``` ]] --- class: split-40 count: false .column[.content[ ```r library(magrittr) gm_2007_sample_50 %$% chisq.test(x = continent, y = high_income) -> my_chi_sqr my_chi_sqr$observed %>% * vcd::mosaic() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/chi_mult_cat_auto_6_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r *my_chi_sqr$expected # expectation under null ``` ]] .column[.content[ ``` high_income continent high income not high income Africa 4.68 13.32 Americas 1.30 3.70 Asia 3.90 11.10 Europe 3.12 8.88 ``` ]] --- class: split-40 count: false .column[.content[ ```r my_chi_sqr$expected %>% # expectation under null * vcd::mosaic() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/chi_mult_cat_plot_auto_2_output-1.png" width="432" /> ]] --- # Warnings --- ## Statistics don't always perform well <div class="figure"> <img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg" alt="Anscombe's quartet - via wikipedia" width="80%" /> <p class="caption">Anscombe's quartet - via wikipedia</p> </div> --- <div class="figure"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-Correlation_examples2.svg.png" alt="correlation coefficients - via wikipedia" width="80%" /> <p class="caption">correlation coefficients - via wikipedia</p> </div> --- There are many more statistics not covered here, that address different situations, where assumptions for the t.test, correlation test, and chi-squared test might not be met. But these are a starting point. ### *There's a Stat for That!: What to Do & When to Do it* Bruce Frey --- # Linear Modeling: Ordinary least squares --- class: split-40 count: false .column[.content[ ```r *ggplot(data = gm_2007_sample_50) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_1_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(data = gm_2007_sample_50) + * aes(x = gdppercap_log10) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_2_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(data = gm_2007_sample_50) + aes(x = gdppercap_log10) + * aes(y = lifeExp) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_3_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(data = gm_2007_sample_50) + aes(x = gdppercap_log10) + aes(y = lifeExp) + * geom_point() ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_4_output-1.png" width="432" /> ]] --- class: split-40 count: false .column[.content[ ```r ggplot(data = gm_2007_sample_50) + aes(x = gdppercap_log10) + aes(y = lifeExp) + geom_point() + * geom_smooth(method = lm) ``` ]] .column[.content[ <img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_5_output-1.png" width="432" /> ]] --- # Modeling logic - x - explanatory variable - right-hand variable - predictor - independent variable - y - outcome - left-hand variable - dependent variable - variable to be explained --- class: inverse, center, middle # Ordinary least squares modeling - minimizes the sum of the squared residuals - residuals = predicted value (value predicted by linear model) - observed value --- # OLS model estimation and summary --- class: split-40 count: false .column[.content[ ```r *gm_2007_sample_50 ``` ]] .column[.content[ ``` # A tibble: 50 x 9 country continent year lifeExp pop gdpPercap high_income asia <fct> <fct> <int> <dbl> <int> <dbl> <chr> <chr> 1 Colomb… Americas 2007 72.9 4.42e7 7007. not high i… not … 2 Tanzan… Africa 2007 52.5 3.81e7 1107. not high i… not … 3 Korea,… Asia 2007 67.3 2.33e7 1593. not high i… Asia 4 Burundi Africa 2007 49.6 8.39e6 430. not high i… not … 5 Sudan Africa 2007 58.6 4.23e7 2602. not high i… not … 6 Iraq Asia 2007 59.5 2.75e7 4471. not high i… Asia 7 Camero… Africa 2007 50.4 1.77e7 2042. not high i… not … 8 Comoros Africa 2007 65.2 7.11e5 986. not high i… not … 9 Slovak… Europe 2007 74.7 5.45e6 18678. high income not … 10 Lebanon Asia 2007 72.0 3.92e6 10461. not high i… Asia # … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl> ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% * lm(formula = lifeExp ~ gdppercap_log10, * data = .) ``` ]] .column[.content[ ``` Call: lm(formula = lifeExp ~ gdppercap_log10, data = .) Coefficients: (Intercept) gdppercap_log10 4.999 16.930 ``` ]] --- class: split-40 count: false .column[.content[ ```r gm_2007_sample_50 %>% lm(formula = lifeExp ~ gdppercap_log10, data = .) %>% * summary() ``` ]] .column[.content[ ``` Call: lm(formula = lifeExp ~ gdppercap_log10, data = .) Residuals: Min 1Q Median 3Q Max -12.2805 -2.5701 0.1678 3.4702 9.4659 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.999 4.406 1.135 0.262 gdppercap_log10 16.930 1.173 14.437 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.976 on 48 degrees of freedom Multiple R-squared: 0.8128, Adjusted R-squared: 0.8089 F-statistic: 208.4 on 1 and 48 DF, p-value: < 2.2e-16 ``` ]] --- ```r gm_2007_sample_50 %>% lm(formula = lifeExp ~ gdppercap_log10, data = .) ``` ``` Call: lm(formula = lifeExp ~ gdppercap_log10, data = .) Coefficients: (Intercept) gdppercap_log10 4.999 16.930 ``` $$ LifeExp = 16.93*log10GDP + 4.999 + \epsilon $$ <style type="text/css"> .remark-code{line-height: 1.5; font-size: 75%} </style>