Basic Bivariate Statistical Tests

# Basic Bivariate Statistical Tests
## Last Updated, May 19, 2020
### Gina Reynolds, January 2020

---

In this resource, we will address several questions:

- How do we visualize the relationship between two variables?

- How can we quantify the strength of the relationship between two variables?

- How likely is it that such a strong relationship or stronger resulted from chance? (i.e. what is the p-value?)

- Given this, should I consider the relationship too likely to have arisen just by chance? Is the p-value < `$\alpha$`? Given this, should I reject the null or not?

---

# Visualization

- use ggplot2
- mostly one variable on x and one on y (but not always, other options too)

---

# Quantifying strength of relationships

- Correlation Coefficient (-1, 0, 1)

- Difference-in-means and t-statistic

- Chi-Squared

---

# Statistical inference

How likely is it that such a strong relationship as the one I observe (or stronger) is just the product of chance.

If the null is true (often statistical independence), then how likely is it to observe stuch a strong relationship.

---

# Basic Statistical tests

- Correlation test

- ttest

- chi squared test

---

```r
library(gapminder)
library(tidyverse)
```

---

```r
*set.seed(2019)
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
set.seed(2019)  
*gapminder
```
]]
.column[.content[

```
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows
```
]]
---
class: split-40
count: false

```r
set.seed(2019)  
gapminder %>%  
* filter(year == 2007)   #
```
]]
.column[.content[

```
# A tibble: 142 x 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2007    43.8  31889923      975.
 2 Albania     Europe     2007    76.4   3600523     5937.
 3 Algeria     Africa     2007    72.3  33333216     6223.
 4 Angola      Africa     2007    42.7  12420476     4797.
 5 Argentina   Americas   2007    75.3  40301927    12779.
 6 Australia   Oceania    2007    81.2  20434176    34435.
 7 Austria     Europe     2007    79.8   8199783    36126.
 8 Bahrain     Asia       2007    75.6    708573    29796.
 9 Bangladesh  Asia       2007    64.1 150448339     1391.
10 Belgium     Europe     2007    79.4  10392226    33693.
# … with 132 more rows
```
]]
---
class: split-40
count: false

```r
set.seed(2019)  
gapminder %>%  
  filter(year == 2007) %>%  #
* filter(continent != "Oceania")
```
]]
.column[.content[

```
# A tibble: 140 x 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2007    43.8  31889923      975.
 2 Albania     Europe     2007    76.4   3600523     5937.
 3 Algeria     Africa     2007    72.3  33333216     6223.
 4 Angola      Africa     2007    42.7  12420476     4797.
 5 Argentina   Americas   2007    75.3  40301927    12779.
 6 Austria     Europe     2007    79.8   8199783    36126.
 7 Bahrain     Asia       2007    75.6    708573    29796.
 8 Bangladesh  Asia       2007    64.1 150448339     1391.
 9 Belgium     Europe     2007    79.4  10392226    33693.
10 Benin       Africa     2007    56.7   8078314     1441.
# … with 130 more rows
```
]]
---
class: split-40
count: false

```r
set.seed(2019)  
gapminder %>%  
  filter(year == 2007) %>%  #
  filter(continent != "Oceania") %>%  
* mutate(continent = factor(continent))
```
]]
.column[.content[

```r
set.seed(2019)  
gapminder %>%  
  filter(year == 2007) %>%  #
  filter(continent != "Oceania") %>%  
  mutate(continent = factor(continent)) %>%  
* mutate(high_income = case_when(
*   gdpPercap >= 15000 ~ "high income",
*   gdpPercap < 15000 ~ "not high income")
*   )
```
]]
.column[.content[

```
# A tibble: 140 x 7
   country     continent  year lifeExp       pop gdpPercap high_income    
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl> <chr>          
 1 Afghanistan Asia       2007    43.8  31889923      975. not high income
 2 Albania     Europe     2007    76.4   3600523     5937. not high income
 3 Algeria     Africa     2007    72.3  33333216     6223. not high income
 4 Angola      Africa     2007    42.7  12420476     4797. not high income
 5 Argentina   Americas   2007    75.3  40301927    12779. not high income
 6 Austria     Europe     2007    79.8   8199783    36126. high income    
 7 Bahrain     Asia       2007    75.6    708573    29796. high income    
 8 Bangladesh  Asia       2007    64.1 150448339     1391. not high income
 9 Belgium     Europe     2007    79.4  10392226    33693. high income    
10 Benin       Africa     2007    56.7   8078314     1441. not high income
# … with 130 more rows
```
]]
---
class: split-40
count: false

```r
set.seed(2019)  
gapminder %>%  
  filter(year == 2007) %>%  #
  filter(continent != "Oceania") %>%  
  mutate(continent = factor(continent)) %>%  
  mutate(high_income = case_when(  
    gdpPercap >= 15000 ~ "high income",  
    gdpPercap < 15000 ~ "not high income")  
    ) %>%  
* mutate(asia = case_when(
*   continent == "Asia" ~ "Asia",
*   continent != "Asia" ~ "not Asia")
*        )
```
]]
.column[.content[

```
# A tibble: 140 x 8
   country     continent  year lifeExp      pop gdpPercap high_income    asia   
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>          <chr>  
 1 Afghanistan Asia       2007    43.8   3.19e7      975. not high inco… Asia   
 2 Albania     Europe     2007    76.4   3.60e6     5937. not high inco… not As…
 3 Algeria     Africa     2007    72.3   3.33e7     6223. not high inco… not As…
 4 Angola      Africa     2007    42.7   1.24e7     4797. not high inco… not As…
 5 Argentina   Americas   2007    75.3   4.03e7    12779. not high inco… not As…
 6 Austria     Europe     2007    79.8   8.20e6    36126. high income    not As…
 7 Bahrain     Asia       2007    75.6   7.09e5    29796. high income    Asia   
 8 Bangladesh  Asia       2007    64.1   1.50e8     1391. not high inco… Asia   
 9 Belgium     Europe     2007    79.4   1.04e7    33693. high income    not As…
10 Benin       Africa     2007    56.7   8.08e6     1441. not high inco… not As…
# … with 130 more rows
```
]]
---
class: split-40
count: false

```r
set.seed(2019)  
gapminder %>%  
  filter(year == 2007) %>%  #
  filter(continent != "Oceania") %>%  
  mutate(continent = factor(continent)) %>%  
  mutate(high_income = case_when(  
    gdpPercap >= 15000 ~ "high income",  
    gdpPercap < 15000 ~ "not high income")  
    ) %>%  
  mutate(asia = case_when(  
    continent == "Asia" ~ "Asia",  
    continent != "Asia" ~ "not Asia")  
         ) %>%  
* mutate(gdppercap_log10 =
*          log10(gdpPercap))
```
]]
.column[.content[

```
# A tibble: 140 x 9
   country continent  year lifeExp    pop gdpPercap high_income asia 
   <fct>   <fct>     <int>   <dbl>  <int>     <dbl> <chr>       <chr>
 1 Afghan… Asia       2007    43.8 3.19e7      975. not high i… Asia 
 2 Albania Europe     2007    76.4 3.60e6     5937. not high i… not …
 3 Algeria Africa     2007    72.3 3.33e7     6223. not high i… not …
 4 Angola  Africa     2007    42.7 1.24e7     4797. not high i… not …
 5 Argent… Americas   2007    75.3 4.03e7    12779. not high i… not …
 6 Austria Europe     2007    79.8 8.20e6    36126. high income not …
 7 Bahrain Asia       2007    75.6 7.09e5    29796. high income Asia 
 8 Bangla… Asia       2007    64.1 1.50e8     1391. not high i… Asia 
 9 Belgium Europe     2007    79.4 1.04e7    33693. high income not …
10 Benin   Africa     2007    56.7 8.08e6     1441. not high i… not …
# … with 130 more rows, and 1 more variable: gdppercap_log10 <dbl>
```
]]
---
class: split-40
count: false

```r
set.seed(2019)  
gapminder %>%  
  filter(year == 2007) %>%  #
  filter(continent != "Oceania") %>%  
  mutate(continent = factor(continent)) %>%  
  mutate(high_income = case_when(  
    gdpPercap >= 15000 ~ "high income",  
    gdpPercap < 15000 ~ "not high income")  
    ) %>%  
  mutate(asia = case_when(  
    continent == "Asia" ~ "Asia",  
    continent != "Asia" ~ "not Asia")  
         ) %>%  
  mutate(gdppercap_log10 =  
           log10(gdpPercap)) %>%  
* sample_n(50, replace = F)
```
]]
.column[.content[

```
# A tibble: 50 x 9
   country continent  year lifeExp    pop gdpPercap high_income asia 
   <fct>   <fct>     <int>   <dbl>  <int>     <dbl> <chr>       <chr>
 1 Colomb… Americas   2007    72.9 4.42e7     7007. not high i… not …
 2 Tanzan… Africa     2007    52.5 3.81e7     1107. not high i… not …
 3 Korea,… Asia       2007    67.3 2.33e7     1593. not high i… Asia 
 4 Burundi Africa     2007    49.6 8.39e6      430. not high i… not …
 5 Sudan   Africa     2007    58.6 4.23e7     2602. not high i… not …
 6 Iraq    Asia       2007    59.5 2.75e7     4471. not high i… Asia 
 7 Camero… Africa     2007    50.4 1.77e7     2042. not high i… not …
 8 Comoros Africa     2007    65.2 7.11e5      986. not high i… not …
 9 Slovak… Europe     2007    74.7 5.45e6    18678. high income not …
10 Lebanon Asia       2007    72.0 3.92e6    10461. not high i… Asia 
# … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl>
```
]]
---
class: split-40
count: false

```r
set.seed(2019)  
gapminder %>%  
  filter(year == 2007) %>%  #
  filter(continent != "Oceania") %>%  
  mutate(continent = factor(continent)) %>%  
  mutate(high_income = case_when(  
    gdpPercap >= 15000 ~ "high income",  
    gdpPercap < 15000 ~ "not high income")  
    ) %>%  
  mutate(asia = case_when(  
    continent == "Asia" ~ "Asia",  
    continent != "Asia" ~ "not Asia")  
         ) %>%  
  mutate(gdppercap_log10 =  
           log10(gdpPercap)) %>%  
  sample_n(50, replace = F) ->  
*gm_2007_sample_50
```
]]
.column[.content[

]]

---
class: inverse, center, middle

# The relationship between two continuous variables

Does information about the value of "x" tell me information about the probable value of "y"?  Is the relationship strong enough to say that it is unlikely to have arisen from chance?

---

```r
*gm_2007_sample_50
```
]]
.column[.content[

```r
gm_2007_sample_50 %>%  
* ggplot()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_2_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
  ggplot() +  
* aes(x = gdppercap_log10)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_3_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
  ggplot() +  
  aes(x = gdppercap_log10) +  
* aes(y = lifeExp)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_4_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
  ggplot() +  
  aes(x = gdppercap_log10) +  
  aes(y = lifeExp) +  
* geom_point(alpha = .5,
*            size = 5,
*            color = "magenta")
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_5_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
  ggplot() +  
  aes(x = gdppercap_log10) +  
  aes(y = lifeExp) +  
  geom_point(alpha = .5,  
             size = 5,  
             color = "magenta") +  
* theme_minimal(base_size = 15)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/plot_1_auto_6_output-1.png" width="432" />
]]

---

### Correlation test

Eyeballing it, there certainly seems to be a relationship between x and y --- knowing the value of x would seem to inform you of the general likely value of y.  But let's do the statistical test too.

---

Now, the statistical test.

Note: The normalized version of covariance is Pearson's correlation coefficient.

`$$cov(x,y) = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})}{n}$$`

---

<div class="figure">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Karl_Pearson%3B_Sir_Francis_Galton.jpg/800px-Karl_Pearson%3B_Sir_Francis_Galton.jpg" alt="Pearson and Galton - via wikipedia" width="45%" />
<p class="caption">Pearson and Galton - via wikipedia</p>
</div>

---

```r
*cor.test(
* x = gm_2007_sample_50$gdppercap_log10,
* y = gm_2007_sample_50$lifeExp
* )
```
]]
.column[.content[

```

Pearson's product-moment correlation

data:  gm_2007_sample_50$gdppercap_log10 and gm_2007_sample_50$lifeExp
t = 14.437, df = 48, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8320019 0.9432120
sample estimates:
      cor 
0.9015609 
```
]]
---
class: split-40
count: false

```r
cor.test(  
  x = gm_2007_sample_50$gdppercap_log10,  
  y = gm_2007_sample_50$lifeExp  
  )

# using a new pipe from the magrittr package
*library(magrittr)
```
]]
.column[.content[

```

Pearson's product-moment correlation

```r
cor.test(  
  x = gm_2007_sample_50$gdppercap_log10,  
  y = gm_2007_sample_50$lifeExp  
  )

# using a new pipe from the magrittr package
library(magrittr)  
*gm_2007_sample_50   # state what data vars come from
```
]]
.column[.content[

```

Pearson's product-moment correlation

```r
cor.test(  
  x = gm_2007_sample_50$gdppercap_log10,  
  y = gm_2007_sample_50$lifeExp  
  )

# using a new pipe from the magrittr package
library(magrittr)  
gm_2007_sample_50 %$%  # state what data vars come from
* cor.test(x = gdppercap_log10,
*          y = lifeExp)
```
]]
.column[.content[

```

Pearson's product-moment correlation

```

Pearson's product-moment correlation

data:  gdppercap_log10 and lifeExp
t = 14.437, df = 48, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8320019 0.9432120
sample estimates:
      cor 
0.9015609 
```
]]

---
class: inverse, center, middle

# The relationship between a dichotomous (two groups) and continuous variable

---

```r
*ggplot(gm_2007_sample_50)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_1_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
* aes(y = lifeExp)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_2_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
* aes(x = asia)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_3_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = asia) +  
* aes(fill = asia)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_4_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = asia) +  
  aes(fill = asia) +  
* geom_boxplot()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_5_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = asia) +  
  aes(fill = asia) +  
  geom_boxplot() +  
* geom_jitter(height = 0,
*             width = .2,
*             alpha = .3)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_6_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = asia) +  
  aes(fill = asia) +  
  geom_boxplot() +  
  geom_jitter(height = 0,  
              width = .2,  
              alpha = .3) +  
* geom_point(data  = . %>%
*              group_by(asia) %>%
*              summarize(mean_life_exp = mean(lifeExp)),
*            aes(y = mean_life_exp),
*            color = "goldenrod",
*            size = 5)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/t_plot_auto_7_output-1.png" width="432" />
]]

---

# An alternative

---

```r
*ggplot(gm_2007_sample_50)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_1_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
* aes(x = lifeExp)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_2_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = lifeExp) +  
* aes(fill = asia)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_3_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = lifeExp) +  
  aes(fill = asia) +  
* aes(col = asia)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_4_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = lifeExp) +  
  aes(fill = asia) +  
  aes(col = asia) +  
* geom_histogram(alpha = .2)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_5_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = lifeExp) +  
  aes(fill = asia) +  
  aes(col = asia) +  
  geom_histogram(alpha = .2) +  
* facet_wrap(~ asia, nrow = 2)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_6_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = lifeExp) +  
  aes(fill = asia) +  
  aes(col = asia) +  
  geom_histogram(alpha = .2) +  
  facet_wrap(~ asia, nrow = 2) +  
* geom_rug()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_7_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = lifeExp) +  
  aes(fill = asia) +  
  aes(col = asia) +  
  geom_histogram(alpha = .2) +  
  facet_wrap(~ asia, nrow = 2) +  
  geom_rug() +  
* geom_vline(data  = . %>%
*              group_by(asia) %>%
*              summarize(mean_life_exp = mean(lifeExp)),
*            aes(xintercept = mean_life_exp),
*            linetype = "dashed")
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/hist_2_auto_8_output-1.png" width="432" />
]]

---

### "student's t-test"

<div class="figure">
<img src="https://upload.wikimedia.org/wikipedia/commons/4/42/William_Sealy_Gosset.jpg" alt="William_Sealy_Gosset - via wikipedia" width="35%" /><img src="https://upload.wikimedia.org/wikipedia/commons/6/65/William_Gosset_plaque_in_Guinness_storehouse_tour%2C_Ireland.jpg" alt="William_Sealy_Gosset - via wikipedia" width="35%" />
<p class="caption">William_Sealy_Gosset - via wikipedia</p>
</div>

---

### The t-test compares group means, or a sample mean from an expected mean.  Are they different enough from each other to be considered statistically different?

---

$$ t = \frac{(\bar x_1 - \bar x_2) - (\mu_1 - \mu_2) }{\sqrt(s_p^2/n_1 +s_p^2/n_2)} $$

$$ t = \frac{(\bar x_1 - \bar x_2)}{\sqrt(s_p^2/n_1 +s_p^2/n_2)} $$

---

```r
*gm_2007_sample_50
```
]]
.column[.content[

```r
gm_2007_sample_50 %>%  
* t.test(formula = lifeExp ~ asia,
*        data = .)
```
]]
.column[.content[

```

Welch Two Sample t-test

data:  lifeExp by asia
t = 0.99834, df = 30.407, p-value = 0.326
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.46270 10.09291
sample estimates:
    mean in group Asia mean in group not Asia 
              70.11513               66.80003 
```
]]

---
class: inverse, center, middle

# Relationship between multinomial variable (many groups) and continuous variable

---

```r
*ggplot(gm_2007_sample_50)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_1_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
* aes(y = lifeExp)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_2_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
* aes(x = continent)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_3_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = continent) +  
* aes(fill = continent)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_4_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = continent) +  
  aes(fill = continent) +  
* geom_boxplot()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_5_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = continent) +  
  aes(fill = continent) +  
  geom_boxplot() +  
* geom_jitter(height = 0,
*             width = .2,
*             alpha = .3)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_6_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = continent) +  
  aes(fill = continent) +  
  geom_boxplot() +  
  geom_jitter(height = 0,  
              width = .2,  
              alpha = .3) +  
* theme_minimal()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_7_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = continent) +  
  aes(fill = continent) +  
  geom_boxplot() +  
  geom_jitter(height = 0,  
              width = .2,  
              alpha = .3) +  
  theme_minimal() +  
* theme(legend.position = "none")
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_8_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(y = lifeExp) +  
  aes(x = continent) +  
  aes(fill = continent) +  
  geom_boxplot() +  
  geom_jitter(height = 0,  
              width = .2,  
              alpha = .3) +  
  theme_minimal() +  
  theme(legend.position = "none") +  
* geom_vline(data  = . %>%
*              group_by(asia) %>%
*              summarize(mean_life_exp = mean(lifeExp)),
*            aes(xintercept = mean_life_exp),
*            linetype = "dashed")
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/multi_box_auto_9_output-1.png" width="432" />
]]

---

## "Analysis of Variance" (Anova)

test statistic = difference in means.

Are there any difference in means for the groups that are strong enough to say that they are statistically different.

---

```r
# the model itself
*gm_2007_sample_50
```
]]
.column[.content[

```r
# the model itself
gm_2007_sample_50 %>%  
*aov(formula = lifeExp ~ continent,
*   data = .)
```
]]
.column[.content[

```
Call:
   aov(formula = lifeExp ~ continent, data = .)

Terms:
                continent Residuals
Sum of Squares   3115.130  3234.028
Deg. of Freedom         3        46

Residual standard error: 8.384805
Estimated effects may be unbalanced
```
]]
---
class: split-40
count: false

```r
# the model itself
gm_2007_sample_50 %>%  
aov(formula = lifeExp ~ continent,  
    data = .) ->  
*a 
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
# the model itself
gm_2007_sample_50 %>%  
aov(formula = lifeExp ~ continent,  
    data = .) ->  
a  
# the summary of the analysis of variance
*summary(a)
```
]]
.column[.content[

```
            Df Sum Sq Mean Sq F value   Pr(>F)    
continent    3   3115  1038.4   14.77 7.18e-07 ***
Residuals   46   3234    70.3                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]]

---

## Which are the statisically different groups?

Post hoc testing is done to identify which groups are statistically different.

```r
TukeyHSD(a)
```

```
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lifeExp ~ continent, data = .)

$continent
                     diff        lwr       upr     p adj
Americas-Africa 17.643000   6.344685 28.941315 0.0007645
Asia-Africa     12.304133   4.490624 20.117642 0.0006850
Europe-Africa   18.866750  10.537521 27.195979 0.0000015
Asia-Americas   -5.338867 -16.880184  6.202451 0.6094127
Europe-Americas  1.223750 -10.672768 13.120268 0.9926979
Europe-Asia      6.562617  -2.093372 15.218605 0.1952925
```

---
class: inverse, center, middle

# Relationships between categorical variables

First we consider dichotomous variables, and then more categories.

---
class: inverse, center, middle
## Relationship between between dichotomous variables (two categories for each variable)

- Binary (male, female; high income, low income)
- Indicator, "Dummy variable" 
(on drug, not on drug; recovered, not recovered; democracy, not democracy; militarized interstate dispute, no dispute)

---

### contingency table

Creating a contingency table with base R is pretty straightforward. We are using the dollar sign to point to a column in the dataframe gm_2007_sample_50.

---

```r
*library(magrittr)
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
library(magrittr)  
*gm_2007_sample_50
```
]]
.column[.content[

```r
library(magrittr)  
gm_2007_sample_50 %$%  
*table(x = asia,
*     y = high_income)
```
]]
.column[.content[

```
          y
x          high income not high income
  Asia               6               9
  not Asia           7              28
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
table(x = asia,  
      y = high_income) %>%  
* plot()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/contingency_auto_4_output-1.png" width="432" />
]]

---
### Contingency table with the tidyverse

You can do this in the tidyverse too.  With a few more steps.

---

```r
*gm_2007_sample_50
```
]]
.column[.content[

```r
gm_2007_sample_50 %>%  
* group_by(asia, high_income)
```
]]
.column[.content[

```
# A tibble: 50 x 9
# Groups:   asia, high_income [4]
   country continent  year lifeExp    pop gdpPercap high_income asia 
   <fct>   <fct>     <int>   <dbl>  <int>     <dbl> <chr>       <chr>
 1 Colomb… Americas   2007    72.9 4.42e7     7007. not high i… not …
 2 Tanzan… Africa     2007    52.5 3.81e7     1107. not high i… not …
 3 Korea,… Asia       2007    67.3 2.33e7     1593. not high i… Asia 
 4 Burundi Africa     2007    49.6 8.39e6      430. not high i… not …
 5 Sudan   Africa     2007    58.6 4.23e7     2602. not high i… not …
 6 Iraq    Asia       2007    59.5 2.75e7     4471. not high i… Asia 
 7 Camero… Africa     2007    50.4 1.77e7     2042. not high i… not …
 8 Comoros Africa     2007    65.2 7.11e5      986. not high i… not …
 9 Slovak… Europe     2007    74.7 5.45e6    18678. high income not …
10 Lebanon Asia       2007    72.0 3.92e6    10461. not high i… Asia 
# … with 40 more rows, and 1 more variable: gdppercap_log10 <dbl>
```
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
  group_by(asia, high_income) %>%  
* count()
```
]]
.column[.content[

```
# A tibble: 4 x 3
# Groups:   asia, high_income [4]
  asia     high_income         n
  <chr>    <chr>           <int>
1 Asia     high income         6
2 Asia     not high income     9
3 not Asia high income         7
4 not Asia not high income    28
```
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
  group_by(asia, high_income) %>%  
  count() %>%  
* spread(high_income, n)
```
]]
.column[.content[

```
# A tibble: 2 x 3
# Groups:   asia [2]
  asia     `high income` `not high income`
  <chr>            <int>             <int>
1 Asia                 6                 9
2 not Asia             7                28
```
]]

---

```r
*gm_2007_sample_50
```
]]
.column[.content[

```r
gm_2007_sample_50 %>%  
*ggplot()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_2_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
ggplot() +  
* aes(x = high_income)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_3_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
ggplot() +  
  aes(x = high_income) +  
* aes(y = asia)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_4_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
ggplot() +  
  aes(x = high_income) +  
  aes(y = asia) +  
* geom_jitter(width = .3, height = .3)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_5_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
ggplot() +  
  aes(x = high_income) +  
  aes(y = asia) +  
  geom_jitter(width = .3, height = .3) +  
* geom_count(color = "blue",
*            alpha = .3)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_6_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
ggplot() +  
  aes(x = high_income) +  
  aes(y = asia) +  
  geom_jitter(width = .3, height = .3) +  
  geom_count(color = "blue",  
             alpha = .3) +  
* scale_size_continuous(range = c(5, 25),
*                       breaks = 1:5 * 5 )
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/jitter_auto_7_output-1.png" width="432" />
]]

---

```r
*ggplot(gm_2007_sample_50)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_1_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
* aes(x = asia)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_2_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = asia) +  
* aes(fill = high_income)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_3_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(gm_2007_sample_50) +  
  aes(x = asia) +  
  aes(fill = high_income) +  
* geom_bar(position = "dodge")
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/news_auto_4_output-1.png" width="432" />
]]

---

### Chi-squared test

$$ \tilde{\chi}^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k} $$

---

### mosaic plot

A mosaic plot represents the proportions of each cell.  Usually the major divide (the split that cuts across the whole square) is the "explanatory" variable.

test statistic: Chi-squared statistic

---

```r
*library(magrittr)
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
library(magrittr)  
*gm_2007_sample_50
```
]]
.column[.content[

```r
library(magrittr)  
gm_2007_sample_50 %$%  
* chisq.test(x = high_income,
*            y = asia)
```
]]
.column[.content[

```

Pearson's Chi-squared test with Yates' continuity correction

data:  high_income and asia
X-squared = 1.2672, df = 1, p-value = 0.2603
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia) ->  
*chi_sq_income_asia
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia) ->  
chi_sq_income_asia  
*chi_sq_income_asia$observed
```
]]
.column[.content[

```
                 asia
high_income       Asia not Asia
  high income        6        7
  not high income    9       28
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia) ->  
chi_sq_income_asia  
chi_sq_income_asia$observed %>%  
* vcd::mosaic()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/chi_two_by_two_auto_6_output-1.png" width="432" />
]]

---

```r
*chi_sq_income_asia$expected   # expectation under null
```
]]
.column[.content[

```
                 asia
high_income       Asia not Asia
  high income      3.9      9.1
  not high income 11.1     25.9
```
]]
---
class: split-40
count: false

```r
chi_sq_income_asia$expected %>%  # expectation under null
* vcd::mosaic()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/mosaic_two_by_two_auto_2_output-1.png" width="432" />
]]

---

```r
*library(magrittr)
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
library(magrittr)  
*gm_2007_sample_50
```
]]
.column[.content[

```r
library(magrittr)  
gm_2007_sample_50 %$%  
* chisq.test(x = high_income,
*            y = asia,
*            correct = F)
```
]]
.column[.content[

```

Pearson's Chi-squared test

data:  high_income and asia
X-squared = 2.183, df = 1, p-value = 0.1395
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia,  
             correct = F)

*chi_sq_income_asia$expected
```
]]
.column[.content[

```

Pearson's Chi-squared test

data:  high_income and asia
X-squared = 2.183, df = 1, p-value = 0.1395
```

```
                 asia
high_income       Asia not Asia
  high income      3.9      9.1
  not high income 11.1     25.9
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia,  
             correct = F)

chi_sq_income_asia$expected %>%  
* `-`(chi_sq_income_asia$observed)
```
]]
.column[.content[

```

Pearson's Chi-squared test

data:  high_income and asia
X-squared = 2.183, df = 1, p-value = 0.1395
```

```
                 asia
high_income       Asia not Asia
  high income     -2.1      2.1
  not high income  2.1     -2.1
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia,  
             correct = F)

chi_sq_income_asia$expected %>%  
  `-`(chi_sq_income_asia$observed) %>%  
* `^`(2)
```
]]
.column[.content[

```

Pearson's Chi-squared test

data:  high_income and asia
X-squared = 2.183, df = 1, p-value = 0.1395
```

```
                 asia
high_income       Asia not Asia
  high income     4.41     4.41
  not high income 4.41     4.41
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia,  
             correct = F)

chi_sq_income_asia$expected %>%  
  `-`(chi_sq_income_asia$observed) %>%  
  `^`(2) %>%  
* `/`(chi_sq_income_asia$expected)
```
]]
.column[.content[

```

Pearson's Chi-squared test

data:  high_income and asia
X-squared = 2.183, df = 1, p-value = 0.1395
```

```
                 asia
high_income            Asia  not Asia
  high income     1.1307692 0.4846154
  not high income 0.3972973 0.1702703
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
  chisq.test(x = high_income,  
             y = asia,  
             correct = F)

chi_sq_income_asia$expected %>%  
  `-`(chi_sq_income_asia$observed) %>%  
  `^`(2) %>%  
  `/`(chi_sq_income_asia$expected) %>%  
* sum()
```
]]
.column[.content[

```

Pearson's Chi-squared test

data:  high_income and asia
X-squared = 2.183, df = 1, p-value = 0.1395
```

```
[1] 2.182952
```
]]

---

![](https://upload.wikimedia.org/wikipedia/commons/3/35/Chi-square_pdf.svg)

---
class: inverse, center, middle
## Relationships between categorical data: more than two categories

test statistic: Chi-squared statistic

---

```r
*library(magrittr)
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
library(magrittr)  
*gm_2007_sample_50
```
]]
.column[.content[

```r
library(magrittr)  
gm_2007_sample_50 %$%  
*chisq.test(x = continent,
*          y = high_income)
```
]]
.column[.content[

```

Pearson's Chi-squared test

data:  continent and high_income
X-squared = 16.13, df = 3, p-value = 0.001067
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
chisq.test(x = continent,  
           y = high_income) ->  
*my_chi_sqr
```
]]
.column[.content[

]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
chisq.test(x = continent,  
           y = high_income) ->  
my_chi_sqr  
*my_chi_sqr$observed
```
]]
.column[.content[

```
          high_income
continent  high income not high income
  Africa             0              18
  Americas           0               5
  Asia               6               9
  Europe             7               5
```
]]
---
class: split-40
count: false

```r
library(magrittr)  
gm_2007_sample_50 %$%  
chisq.test(x = continent,  
           y = high_income) ->  
my_chi_sqr  
my_chi_sqr$observed %>%  
* vcd::mosaic()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/chi_mult_cat_auto_6_output-1.png" width="432" />
]]

---

```r
*my_chi_sqr$expected   # expectation under null
```
]]
.column[.content[

```
          high_income
continent  high income not high income
  Africa          4.68           13.32
  Americas        1.30            3.70
  Asia            3.90           11.10
  Europe          3.12            8.88
```
]]
---
class: split-40
count: false

```r
my_chi_sqr$expected %>%  # expectation under null
* vcd::mosaic()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/chi_mult_cat_plot_auto_2_output-1.png" width="432" />
]]

---

# Warnings

---

## Statistics don't always perform well

<div class="figure">
<img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg" alt="Anscombe's quartet - via wikipedia" width="80%" />
<p class="caption">Anscombe's quartet - via wikipedia</p>
</div>

---

<div class="figure">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-Correlation_examples2.svg.png" alt="correlation coefficients - via wikipedia" width="80%" />
<p class="caption">correlation coefficients - via wikipedia</p>
</div>

---

There are many more statistics not covered here, that address different situations, where assumptions for the t.test, correlation test, and chi-squared test might not be met. But these are a starting point.

### *There's a Stat for That!: What to Do & When to Do it* Bruce Frey

---

# Linear Modeling: Ordinary least squares

---

```r
*ggplot(data = gm_2007_sample_50)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_1_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(data = gm_2007_sample_50) +  
* aes(x = gdppercap_log10)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_2_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(data = gm_2007_sample_50) +  
  aes(x = gdppercap_log10) +  
* aes(y = lifeExp)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_3_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(data = gm_2007_sample_50) +  
  aes(x = gdppercap_log10) +  
  aes(y = lifeExp) +  
* geom_point()
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_4_output-1.png" width="432" />
]]
---
class: split-40
count: false

```r
ggplot(data = gm_2007_sample_50) +  
  aes(x = gdppercap_log10) +  
  aes(y = lifeExp) +  
  geom_point() +  
* geom_smooth(method = lm)
```
]]
.column[.content[
<img src="basic_bivariate_statistical_tests_files/figure-html/picture_lm_auto_5_output-1.png" width="432" />
]]

---

# Modeling logic

- x
    
    - explanatory variable
    - right-hand variable
    - predictor
    - independent variable

- y

- outcome
    - left-hand variable
    - dependent variable
    - variable to be explained

---
class: inverse, center, middle
# Ordinary least squares modeling

- minimizes the sum of the squared residuals

- residuals = predicted value (value predicted by linear model) - observed value

---

# OLS model estimation and summary

---

```r
*gm_2007_sample_50
```
]]
.column[.content[

```r
gm_2007_sample_50 %>%  
* lm(formula = lifeExp ~ gdppercap_log10,
*    data = .)
```
]]
.column[.content[

```

Call:
lm(formula = lifeExp ~ gdppercap_log10, data = .)

Coefficients:
    (Intercept)  gdppercap_log10  
          4.999           16.930  
```
]]
---
class: split-40
count: false

```r
gm_2007_sample_50 %>%  
  lm(formula = lifeExp ~ gdppercap_log10,  
     data = .) %>%  
* summary()
```
]]
.column[.content[

```

Call:
lm(formula = lifeExp ~ gdppercap_log10, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2805  -2.5701   0.1678   3.4702   9.4659

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.999      4.406   1.135    0.262    
gdppercap_log10   16.930      1.173  14.437   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.976 on 48 degrees of freedom
Multiple R-squared:  0.8128,	Adjusted R-squared:  0.8089 
F-statistic: 208.4 on 1 and 48 DF,  p-value: < 2.2e-16
```
]]

---

```r
gm_2007_sample_50 %>% 
  lm(formula = lifeExp ~ gdppercap_log10, 
     data = .)
```

```

Call:
lm(formula = lifeExp ~ gdppercap_log10, data = .)

Coefficients:
    (Intercept)  gdppercap_log10  
          4.999           16.930  
```
$$ LifeExp = 16.93*log10GDP + 4.999 + \epsilon $$