--- title: "Comparing two variances" output: html_document: code_folding: hide css: http://www.bradthiessen.com/batlab3.css df_print: tibble fig_height: 3.9 fig_width: 6.3 highlight: pygments theme: spacelab toc: yes toc_float: yes html_notebook: code_folding: hide css: http://www.bradthiessen.com/batlab3.css fig_height: 3.9 fig_width: 6.3 highlight: pygments theme: spacelab toc: yes --- ```{r 'global options', echo=FALSE, message=FALSE, results='hide'} knitr::opts_chunk$set( comment = "# ", collapse = TRUE, fig.height = 3.9, fig.width = 6.3 ) ``` ```{r 'prereqs', message=FALSE} # Load the tidyverse and mosaic packages library(tidyverse) library(mosaic) ``` ***** # Scenario: Grading essays If you gave a student's essay to multiple teachers, would they score it [consistently](https://en.wikipedia.org/wiki/Inter-rater_reliability)? Can consistency improve through training?

![](essay.png)

50 teachers score the same essay. 28 received prior training, while the other 22 received no training.
1. What will we calculate to determine if scores from the trained teachers are more consistent?

![](Transparent.gif)

***** ## Data Exploration Click `code` to see how the data were entered --> ```{r 'data-simulation'} # Simulate n=28 scores with a mean=5, sd=1.8. # Force all scores to be between 1-10 # Round to nearest whole number. set.seed(123) n1 <- 28 m <- 5 s1 <- 1.8 L <- 1 U <- 10 trained <- round(qnorm(runif(n1, pnorm(L, mean=m, sd=s1), pnorm(U, mean=m, sd=s1)), mean=m, sd=s1),0) # Simulate n=22 scores with the same conditions # except make the standard deviation larger (around 3.7) set.seed(123) n2 <- 22 s2 <- 3.7 L <- 1 U <- 10 untrained <- round(qnorm(runif(n2, pnorm(L, mean=m, sd=s2), pnorm(U, mean=m, sd=s2)), mean=m, sd=s2),0) # Combine these scores into a single data frame essay <- tibble( group = c( rep("trained", 28), rep("not-trained", 22) ), scores = c(trained, untrained)) # Get summary statistics essay %>% group_by(group) %>% summarize(n = n(), mean=mean(scores), sd=sd(scores), var=var(scores)) ``` | Group | n | mean | sd | var | |-------------|---:|-----:|-----:|------:| | Not-trained | 22 | 5.95 | 2.44 | 5.950 | | Trained | 28 | 5.57 | 1.73 | 2.995 |
### {.tabset .tabset-fade} #### Dotplot ```{r 'dotplot', message=FALSE, fig.height = 3.9, fig.width = 4.5} essay %>% ggplot(aes(x = group, y = scores)) + geom_dotplot(binaxis = "y", binpositions="all", stackdir = "center", fill = "steelblue", color = "white") + scale_y_continuous(breaks=seq(1, 10, 10), minor_breaks=NULL) + theme(axis.title.x = element_text(size = 12, color = "#777777")) + theme(axis.text.x = element_text(size = 12)) + theme(axis.title.y = element_text(size = 12, color="#777777")) + theme(axis.text.y = element_text(size = 12)) + labs( title = "Scores by group" ) ``` #### Boxplot ```{r 'density', message=FALSE, fig.height = 3.9, fig.width = 4.5} essay %>% ggplot(aes(y = scores, x = group)) + geom_boxplot(fill = "white", color = "black", alpha = 0.6) + geom_dotplot(binaxis = "y", stackdir = "center", fill = "steelblue", color = "white", alpha = 0.8) + scale_y_continuous(breaks=seq(1, 10, 10), minor_breaks=NULL) + theme(axis.title.x = element_text(size = 12, color = "#777777")) + theme(axis.text.x = element_text(size = 12)) + theme(axis.title.y = element_text(size = 12, color="#777777")) + theme(axis.text.y = element_text(size = 12)) + labs( title = "Scores by group" ) ``` ### . 2. Based on these visualizations, what can you conclude about the impact of training on score consistency?

![](Transparent.gif)

***** ## Test statistic We use $\overline{X}_{1}-\overline{X}_{2}$ to estimate the difference between two means. How do we estimate the difference in variances? Consider the following:
**Scenario A**: A hospital calibrates and then repeatedly tests two blood pressure monitors. Measurements from the inexpensive armband-pump had a variance of **36 units**. The expensive automated monitor had a variance of **9 units**.
**Scenario B**: Samples of upholstery fabric from two suppliers have similar *average* durability ratings (75,000 [double-rubs](https://www.greenhousefabrics.com/blog/what-are-double-rubs)) but different variances. Samples from one supplier had a variance of **22,000,000** double-rubs, while the variance from the other supplier was **20,000,000** double-rubs.
3. Why would we be interested in comparing the variances within each scenario? Which scenario represents the larger discrepancy in variances. Explain.

![](Transparent.gif)

4. What test statistic should we use to compare the variances?

![](Transparent.gif)

We'll define our **test statistic** as: `test.stat` = $\frac{s_{1}^{2}}{s_{2}^{2}}=\frac{5.950}{2.995}=1.987$ ```{r 'calculate-test-stat'} test_stat <- var( essay$scores[essay$group=="not-trained"] ) / var( essay$scores[essay$group=="trained"] ) test_stat ```
How likely were we to obtain a ratio of 1.987 from our sample data if there were no differences in *population* variances for the groups? To estimate this, we can use randomization- and theory-based methods to determine the **samping distribution of the ratio of variances**. | Framework: | Goal: NHST | Goal: Estimate effect | |-------------------------|:---------------------------------|:------------------------------| | **Randomization-based** | Randomized test of ratios | Bootstrap confidence interval | | **Theory-based** | F-, Levene, Brown-Forsythe tests | Confidence interval |
***** ## Randomized test of ratios 5. The first two teachers in the *trained* group gave scores of 4 and 6. Under our *randomization null hypothesis*, what scores would these teachers have given if they had been randomly assigned to the *untrained* group?

![](Transparent.gif)

To estimate the sampling distribution of our test statistic, we will: - Randomly shuffle our 50 scores between the trained (n=28) and untrained (n=22) groups. - Calculate our test statistic: $\frac{s_{1}^{2}}{s_{2}^{2}}$ - Repeat many times and plot the distribution of randomized test statistics The **code** to the right shows how I did this in R: ```{r 'randomized-ratios', message=FALSE} # There are many ways to calculate the ratio of variances # with 10,000 random shuffles of the group assignment. # The sample() function randomly samples (shuffles) the data # # <------------------------------> # <--- Option #1: Really fast ---> # <------------------------------> # # To conduct the 10,000 replications... # random_ratios <- replicate(10000, { # all <- sample(essay$scores) # shuffled.train <- all[1:28] # shuffled.untrain <- all[29:50] # return(var(shuffled.untrain) / var(shuffled.train)) # }) # # Here's a function to shuffle and calculate the variance ratio # It's slow, though. # random_ratios <- function (scores, groups) # { # shuffledata <- sample(scores) # n1 <- count(groups) # N <- length(groups) # shuffle1 <- shuffledata[1:n1] # shuffle2 <- shuffledata[(n1+1):N] # return(var(shuffle1) / var(shuffle2)) # } # random_ratios <- replicate(10000, rvr(essay$scores, essay$group)) # # # <------------------------------> # <------ Option #2: dplyr ------> # <------------------------------> # # random_ratios <- Do(10000) * essay %>% # mutate(shuffled = sample(scores)) %>% # summarize(var.ratio = var(shuffled[group=="not-trained"]) / # var(shuffled[group=="trained"]) ) # # or # random_ratios <- Do(1000) * essay %>% # mutate(shuffled = sample(scores)) %>% # group_by(group) %>% # summarize(variances = var(shuffled)) %>% # mutate(ratio = lag(variances) / variances) %>% # select(ratio) %>% # filter(ratio != "NA") # # <------------------------------> # <----- Option #3: slowest -----> # <------------------------------> random_ratios <- Do(10000) * var.test(scores ~ shuffle(group), data=essay)$estimate # # That stored the results in a variable (column) called "ratio.of.variances" ```
Now we plot the randomized variance ratios: ```{r 'plot-randomized-ratios'} # Let's plot the distribution of those randomized ratios of variances # # Here's the most straight-forward way to plot this # histogram(~ratio.of.variances, data=random_ratios, # xlim=c(0,4), # Set the x-axis range # width=.1, # Set the width of each bar # xlab="Possible mean differences assuming null hypothesis is true") # ladd(panel.abline(v=test_stat)) # Add vertical line at test statistic # # I'll use ggplot2. Replace geom_density with geom_histogram for a histogram ggplot(data = random_ratios, aes(x = ratio.of.variances)) + geom_density(fill="lightblue", color="white", alpha = 0.8) + annotate("segment", x = 1.987, xend = 1.987, y = 0, yend = .8, color = "red") + annotate("text", x = 1.987, y = .85, label = "observed 1.987", color = "red") + annotate("segment", x = 1/1.987, xend = 1/1.987, y = 0, yend = .8, color = "red") + annotate("text", x = 1/1.987, y = .85, label = "observed 1/1.987", color = "red") + labs( title = "Randomized ratios of variances", x = "ratios of variances" ) + scale_x_continuous(breaks=seq(0, 4, 1), minor_breaks=NULL) + theme( axis.text.x = element_text(size = 11, color="grey10"), legend.position = "none", panel.grid.major.y = element_line(colour = "white"), panel.grid.major.x = element_line(colour = "white", size=.15), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "grey93") ) ```
```{r 'estimate-p-value-from-randomization'} # Count the proportion of randomized ratios > our test statistic p <- prop(~ (ratio.of.variances >= test_stat), data=random_ratios) # We were just as likely to calculate our ratio as # var(untrained) / var(trained) # as we were to calculate the reciprocal. We should estimate # the p-value for the ratio of our test statistic. p_reciprocal <- prop(~ (ratio.of.variances <= 1/test_stat), data=random_ratios) # Sum these values pvalue <- p + p_reciprocal pvalue ``` 6. The p-value is estimated to be `r round(pvalue,3)`. How was this estimated (in the above code)? Interpret this p-value.

![](Transparent.gif)

***** ## Bootstrap confidence interval 7. Explain how we can use bootstrap methods to construct a 95% confidence interval for the ratio in variances.

![](Transparent.gif)

Expand the code to see how I did this in R. ```{r 'bootstrap-ci'} boot_ratios <- Do(10000) * var.test(scores ~ group, data=resample(essay))$estimate # Capture the endpoints lower <- quantile(boot_ratios$ratio.of.variances, 0.025) upper <- quantile(boot_ratios$ratio.of.variances, 0.975) # Put the endpoints into the plot # Density plot ggplot(data = boot_ratios, aes(x = ratio.of.variances)) + geom_density(fill="lightblue", color="white", alpha = 0.8) + labs( title = "Bootstrap distribution of variance ratios", x = "ratio of group variances (from resampled data)" ) + scale_x_continuous(breaks=seq(0, 6, 1), minor_breaks=NULL) + theme( axis.text.x = element_text(size = 11, color="grey10"), legend.position = "none", panel.grid.major.y = element_line(colour = "white"), panel.grid.major.x = element_line(colour = "white", size=.15), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "grey93") ) + annotate("text", x = lower, y = .05, label = round(lower,3)) + annotate("text", x = upper, y = .05, label = round(upper,3)) + annotate("text", x = mean(boot_ratios$ratio.of.variances), y = .055, label = paste("95%")) + annotate("segment", x = lower, xend = upper, y = 0.03, yend = .03, color = "red") ```
8. Look at the randomized and bootstrap distributions of our ratio of variances. Describe the shape of these distributions.

![](Transparent.gif)

***** ## Theory-based methods In this example, the randomization-based methods resulted in a sampling distribution with a *positive skew*. Will the distribution of sample variance ratios **always** have a positive skew? To investigate, let's run some simulations. Let's start with two populations that follow normal distributions with equal variances: ```{r 'normal-distributions', fig.keep="last"} # I'll use lattice graphics plotDist('norm', mean=60, sd=10, lw=3, xlim = c(0, 200), main="Normal distributions with equal variances") plotDist('norm', mean=140, sd=10, lw=3, col="red", add=TRUE) # Here's the ggplot2 syntax # ggplot(data.frame(x = c(0,200)), aes(x = x)) + # stat_function(fun = dnorm, args = list(mean = 60, sd = 10), color="blue", size=2) + # stat_function(fun = dnorm, args = list(mean = 140, sd = 10), color="red", size=2) + # labs( # title = "Normal distributions with equal variances") ```
8. Suppose we take a random sample of 10 observations from each normal distribution. We then calculate the variance of each sample and take the ratio of those variances. What value would you expect

If we repeated this process many times, describe the distribution of values you would expect for those variance ratios.

Would your expectation for that distribution change if we sampled 100 observations per group?

![](Transparent.gif)

Here are the distributions of variance ratios with sample sizes of 10 and 100. Do they match your expectations? ```{r 'simulated-variance-ratios-from-normal-curves'} # Take random samples and calculate ratio of variances # Results are stored in a variable called "result" ratios10 <- Do(5000) * ( var(rnorm(10, mean=60, sd=10)) / var(rnorm(10, mean=140, sd=10)) ) ratios100 <- Do(5000) * ( var(rnorm(100, mean=60, sd=10)) / var(rnorm(100, mean=140, sd=10)) ) # Plot the distributions ggplot(data = ratios100, aes(x = result)) + geom_density(fill="lightgreen", color="forestgreen", alpha = 0.2) + geom_density(data = ratios10, aes(x=result), fill="purple", color="purple", alpha = 0.1) + annotate("text", x = 2.5, y = .25, label = "n=10", color="purple") + annotate("text", x = 1.8, y = 1.5, label = "n=100", color="forestgreen") + coord_cartesian(seq(0,10,1)) + scale_x_continuous(breaks=seq(0, 10, 1), minor_breaks=NULL) ```

![](Transparent.gif)

This time, let's repeat the simulation with normal distributions that have unequal variances. ```{r 'normal-distributions-unequal-variances', fig.keep="last"} # I'll use lattice graphics plotDist('norm', mean=60, sd=20, lw=3, xlim = c(0, 200), ylim = c(0, .042), main="Normal distributions with unequal variances (400 vs 100)") plotDist('norm', mean=140, sd=10, lw=3, col="red", add=TRUE) ``` 9. How do the unequal population variances affect the distributions of variance ratios (shown below)? Why is this? ```{r 'simulated-variance-ratios-from-normal-curves-with-unequal-variances'} # Take random samples and calculate ratio of variances # Results are stored in a variable called "result" ratios10 <- Do(5000) * ( var(rnorm(10, mean=60, sd=20)) / var(rnorm(10, mean=140, sd=10)) ) ratios100 <- Do(5000) * ( var(rnorm(100, mean=60, sd=20)) / var(rnorm(100, mean=140, sd=10)) ) # Plot the distributions ggplot(data = ratios100, aes(x = result)) + geom_density(fill="lightgreen", color="forestgreen", alpha = 0.2) + geom_density(data = ratios10, aes(x=result), fill="purple", color="purple", alpha = 0.1) + annotate("text", x = 2, y = .2, label = "n=10", color="purple") + annotate("text", x = 2.8, y = .4, label = "n=100", color="forestgreen") + coord_cartesian(seq(0,14,1)) + scale_x_continuous(breaks=seq(0, 14, 2), minor_breaks=NULL) ```

![](Transparent.gif)

10. If we repeatedly take samples of size n from two populations with equal variances and calculate the ratio of variances, the distribution will have a __________ skew and will be centered at __________. If the populations have **unequal variances**, the distribution will have a __________ skew and will be centered at __________.

![](Transparent.gif)

Let's see if we can predict what will happen if we start with distributions that are **not** normally distributed. Let's start with two non-central *t-distributions with 3 degrees-of-freedom*: ```{r 't-distributions', fig.keep="last"} plotDist('t', df=3, ncp=1, lw=3, xlim = c(-5, 5), ylim = c(0,.5), main="Skewed distributions with equal variances") plotDist('t', df=3, ncp=-1, lw=3, col="red", add=TRUE) ``` 11. The sampling distribution of variance ratios is shifted a bit left (and has a heavier tail) than the distribution we'd expect from sampling from normal distributions. Why would this be? (You do not yet know where the expected distribution -- drawn in black -- comes from.) ```{r 'simulated-variance-ratios-from-t-distributions'} # Take random samples and calculate ratio of variances ratios10 <- Do(5000) * ( var(rt(10, df=3, ncp=1)) / var(rt(10, df=3, ncp=-1)) ) # Plot the distribution ggplot(data = ratios10, aes(x = result)) + geom_density(fill="purple", color="purple", alpha = 0.2) + stat_function(fun = df, args = list(df1=9, df2=9)) + annotate("text", x = -.5, y = .25, label = "n=10", color="purple") + annotate("text", x = 1.8, y = .5, label = "expectation", color="black") + scale_x_continuous(limits = c(-1,8), breaks=seq(0, 8, 1), minor_breaks=NULL) ```

![](Transparent.gif)

***** ## F-distribution If we repeatedly sample $n_1$ and $n_2$ observations from two populations and calculate $\frac{s_{1}^{2}}{s_{2}^{2}}$ for each pair of samples, then: - $\frac{s_{1}^{2}}{s_{2}^{2}}\sim{F_{n_2-1}^{n_1-1}}$ If: - The populations follow normal distributions - The population distributions have equal variances
If we want to compare two sample variances, we can simply take their ratio and compare it to an F-distribution with the appropriate set of degrees-of-freedom. If our sample variance ratio is unlikely to have come from that F-distribution, then either: - The population variances are not equal, or - The populations do not follow normal distributions.

You can use the [Statkey applet](http://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#F) to visualize F-distributions with varying degrees of freedom in the numerator and denominator.

***** ## F-test ***** 12. Expand the code and interpret the output of the F-test. What are the degrees-of-freedom for the F-distribution? ```{r 'f-test-plot'} # Q-Q Plot to check for normality # What does this plot show? qqnorm(essay$scores, ylab = "Essay scores"); qqline(essay$scores) # Recall our observed test_stat = 1.987 # Here's a function to plot the F-distribution # with 21 and 27 degrees-of-freedom # and display the p-value. xpf((1.987), df1=21, df2=27) ```
```{r 'f-test'} # Here's the built-in F-test # Note that it also produces a confidence interval var.test(scores ~ group, data = essay, alternative="greater") ```

![](Transparent.gif)

- If you want to see a derivation of the F-distribution (touching such topics as confidence intervals for variances, along with the t- and chi-square distributions), you can read through [pages 4-17 of an old activity I wrote long ago](http://bradthiessen.com/html5/stats/m301/activity2.pdf). - A table of [F critical values](http://bradthiessen.com/html5/stats/m301/ftable.pdf) shows an old-school (and terrible) method I had to use to conduct an F-test. We'll discuss issues with NHST throughout the semester. For now, look at the table and see if you can find a *typical* critical value that will allow you to quickly (and informally) compare variances with virtually no effort at all.

![](Transparent.gif)

***** ## Normality assumption ***** Recall this F-test only works under the condition that our population data follow normal distributions. Other tests to compare variances (e.g., [Bartlett's test](https://en.wikipedia.org/wiki/Bartlett%27s_test) and [Hartley's Fmax test](https://en.wikipedia.org/wiki/Hartley%27s_test) are also sensitive to violations of normality. Alternative tests (such as [Levene's test](https://en.wikipedia.org/wiki/Levene%27s_test) or the [Brown-Forsyth test](https://en.wikipedia.org/wiki/Brown–Forsythe_test)) are more robust against violations of the normality assumption. To understand these tests, though, we need to first learn about *analysis of variance* (which is the topic of the next activity.) Expand the code to see Levene's test and the Brown-Forsythe test in action: ```{r 'other-tests-car-package', message=FALSE} # Functions for both tests are contained in the car package # install.packages("car", dependencies=TRUE) library(car) # Levene Test leveneTest(scores ~ group, data = essay, center = mean) # Brown-Forsythe Test (uses medians) leveneTest(scores ~ group, data = essay, center = median) ```

![](Transparent.gif)

***** # Your turn 13. In 2014, 219 first-year students and 219 sophomores from St. Ambrose responded to a survey question asking them to report the **number of hours they study per week**. I hypothesize that the responses from first-year students (who take similar General Education courses) will hve less variation than the responses from sophomores (who take more varied courses in their chosen majors).

(a) This data has been loaded in the `studyhours` dataframe (with variables `class` (Freshmen or Sophomores) and `hours` (spent studying per week).

(b) Conduct a randomization-based test to see if the data support my hypothesis. Write out your null hypothesis, the value of your test statistic, and the estimated p-value. Plot the distribution of randomized variance ratios.

(c) Use bootstrap methods to construct an 88% confidence interval for the ratio of variances.

(d) Use the `var.test` function to conduct an F-test to compare group variances. Make sure you evaluate the assumption(s) necessary to conduct this test.

(e) Based on all of this, what are you willing to conclude regarding the variances in hours studying for first-year and sophomore students? ```{r} studyhours <- read.csv("http://www.bradthiessen.com/html5/data/studyhours.csv") head(studyhours) ```

![](Transparent.gif)

14. Complete [assignment 2](http://bradthiessen.com/html5/stats/m301/assign2.pdf)?

![](Transparent.gif)

***** # Publishing Your Turn solutions - Download the [Your Turn Solutions template](http://bradthiessen.com/html5/stats/m301/2/YTT2.Rmd) - RStudio will automatically open it if it downloads as a *.Rmd* file. - If it downloads as a *.txt* file, then you can: - open the file in a text editor and copy its contents - open RStudio and create a new **R Markdown** file (html file) - delete everything and paste the contents of the template - Type your solutions into the [Your Turn Solutions template](http://bradthiessen.com/html5/stats/m301/2/YTT2.Rmd) - I've tried to show you *where* to type each of your solutions/answers - You can run the code as you type it. - When you've answered every question, click the **Knit HTML** button located at the top of RStudio: Drawing

- RStudio will begin working to create a .html file of your solutions. - It may take a few minutes to compile everything (if you've conducted any simulations) - While the file is compiling, you may see some red text in the console. - When the file is finished, it will open automatically (or it will give you an error message). - If the file does not open automatically, you'll see an error message in the console. - Once you're satisfied with your solutions, email that html file to [thiessenbradleya@sau.edu](mailto:thiessenbradleya@sau.edu) or give me a printed copy. - If you run into any problems, [email me](mailto:thiessenbradleya@sau.edu) or work with other students in the class.

**Sources** This document is released under a [Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0) license.