Scenario: Grading Essays

If multiple teachers scored the same essay, would the scores be consistent? Could we improve consistency by training teachers to score essays?

Drawing

To investigate this, you find 50 teachers who volunteer to score the same essay. Before scoring, you randomly assign 28 of the teachers to receive training on how to score essays. The other 22 teachers receive no training.

Our question of interest is: Does training increase consistency of essay scores? In other words, Do the scores from the trained teachers have a lower variance than the scores from the untrained teachers?

To answer this question, we’ve decided to compare the variances from both groups.


Data

Let’s simulate a dataset for this scenario:

# Set seed for random number generator
set.seed(3141)

# Generate 28 trained and 22 untrained teachers
group <- c(rep("trained", 28), rep("untrained", 22))

# Generate essay scores for each group from a normal distribution
# Trained scores ~ normal(mean = 4.5, std. dev = 2.1)
# Untrained scores ~ normal(mean = 7.3, std. dev = 4.2)
# I round scores to the nearest integer (0 decimal places)
score <- c(round(rnorm(28, 4.5, 2.1),0), round(rnorm(22, 7.3, 4.2),0))

# Combine the group and score variables into the same data frame
essay <- data.frame(group, score)
# Take a look at the first 6 rows of our data
head(essay)
##     group score
## 1 trained     6
## 2 trained     1
## 3 trained     5
## 4 trained     6
## 5 trained     6
## 6 trained     9


Here’s a summary of our simulated data: Drawing


In this study, we’re modeling essay scores as a function of the group (trained vs. untrained): score ~ group.

Let’s first visualize our data:

# Side-by-side densityplots
densityplot(~score | group, data=essay, lw=4, bw=2, col="steelblue", layout=c(1,2), cex=1.1)

# Strip plot
stripplot(score ~ group, data = essay, jitter.data = TRUE, xlab = "treatment group",
          ylab = "essay score", cex=1.3, col="steelblue", aspect="1", pch=20)

  1. From these visualizations, what do you think we’ll conclude about the variances of the two groups?







Let’s take a look at the summary statistics:

favstats(score ~ group, data=essay)
##       group min   Q1 median   Q3 max     mean       sd  n missing
## 1   trained   1 3.75      5 7.25   9 5.214286 2.233701 28       0
## 2 untrained   0 5.00      6 7.75  16 6.318182 3.400089 22       0

You’ll notice the means differ, but we don’t care about the means. We want to compare variances.

If we square the standard deviations, we get the variances of each group:

var(score ~ group, data=essay)
##   trained untrained 
##  4.989418 11.560606

Model

When we compare means, our test statistic is simply \(\overline{X}_{1}-\overline{X}_{2}\). What test statistic do we use to compare variances?

To gain an intuition, consider the following two scenarios:


Scenario A: A hospital must select between two blood pressure monitors. After careful calibration, they test each monitor 20 times. The first monitor, an inexpensive armband-pump device, had a variance in measurements of 36 units. The second monitor, an expensive automated device, had a variance of 9 units.


Scenario B: Upholstery fabric from one supplier has an average durability rating of 74,000 DR with a variance of 22,000,000 DR. Fabric from another supplier has a similar durability rating with a variance of 20,000,000 DR.


  1. Explain why we would be interested in comparing variances in each scenario. Then, determine which scenario represents the larger discrepancy between group variances. Explain your answer.










  1. What test statistic should we use to compare variances?








Let’s define our test statistic as the ratio of variances from our two groups:

test.stat = \(\frac{s_{1}^{2}}{s_{2}^{2}}=\frac{11.5606}{4.9894}=2.317\)

test.stat <- var(essay$score[group=="untrained"]) / var(essay$score[group=="trained"])
test.stat
## [1] 2.317025


In our sample data, the variance of essay scores from the untrained group is more than twice as big as the variance from the trained group. Does this give us evidence that the population variances from these two groups differ or is it possible to get these results even if the population variances are equal?

To determine the likelihood of observing a ratio of 2.317 or greater if the population variances were equal, we can use randomization- or theory-based methods to estimate the sampling distribution of our ratio of variances.



Randomization-based framework

Randomization-based null hypothesis significance testing

In our study, teachers were randomly assigned to receive training or not.

  1. Write out our randomization null hypothesis in this scenario.






  1. The first teacher in this study, assigned to the “training” group, scored the essay a 6. Under our randomization null hypothesis, what score would this teacher have given if he or she had been randomly assigned to the “untrained” group?





Using the actual data from this study, our test statistic is:   \(\frac{s_{1}^{2}}{s_{2}^{2}}=2.317\)

We want to know the distribution of test statistics we’d get if we could go back in time and repeatedly assign our subjects at random to the two groups. To do this, we’ll:

• Randomly shuffle the groups assigned to the 50 subjects in this study. Each time, we’ll randomly assign 28 teachers to be in the “trained” group.

• Calculate our test statistic, \(\frac{s_{1}^{2}}{s_{2}^{2}}\)

• Repeat


I’m not aware of any built-in function to do this in R, so I created a (slow) function called varianceratio.

# Function to calculate the ratio of variances between two groups
varianceratio <- function (x, ..., data = parent.frame(), only.2 = TRUE) 
    {
      v <- var(x, ..., data = data)
      res <- v/lag(v)  # This calculates the 2nd variance / 1st variance
      res[2]
}

# Test this function on our dataset
varianceratio(score~group, data=essay)
## untrained 
##  2.317025

It looks like my function is calculating the correct ratio, so let’s calculate the ratio for 10,000 randomizations of our data:

# Ratio of variances from 10,000 randomizations of our data
ratios <- do(10000) * varianceratio(score ~ shuffle(group), data=essay)

Now let’s visualize our randomization distribution of the variance ratio and estimate our p-value.

histogram(~ ratios, data=ratios, groups=ratios >= test.stat,
          xlim=c(0,10), width=.25,
          xlab="Ratio of sample variances assuming equal population variances")
ladd(panel.abline(v=test.stat))

prop(~ (ratios >= test.stat), data=ratios)
##   TRUE 
## 0.0934

Notice this randomization distribution is positively skewed; it’s definitely not a normal distribution.

  1. Interpret this p-value and state your conclusion.









Bootstrap confidence interval

  1. Based on the p-value we just estimated, you should know something about the confidence interval we’re about to construct. Will this confidence interval contain the value 1.0? Explain how you’re able to answer this question.








Let’s go ahead and construct our bootstrap confidence interval.

# Generate 10,000 bootstrap samples and calculate our variance ratio for each
bootratios <- do(10000) * varianceratio(score ~ group, data=resample(essay))

# Create a plot of the bootstrap estimates of our test statistic
densityplot(~ratios, data=bootratios, plot.points = FALSE, col="steelblue", lwd=4)

# Find the 2.5 and 97.5 percentiles
bootstrapCI <- confint(bootratios, level = 0.95, method = "quantile")
bootstrapCI
##        name     lower    upper level     method estimate
## 1 untrained 0.6904839 5.064112  0.95 percentile 2.317025

Notice this bootstrap distribution is also positively skewed; it’s not a normal distribution.

  1. Verify your answer to the previous question. Also, explain how this bootstrap method was able to generate a confidence interval for our ratio of variances.









Theory-based framework

Sampling distribution of a ratio of variances

Simulation approach

From our randomization and bootstrap distributions, it appears as though the distribution of the ratio of variances (from two independent samples) is positively skewed. Let’s run some more simulations to see if we continue to get positively skewed distributions.

First, let’s generate two normal populations. The populations will follow normal distributions with identical variances (100) but different means.

# Plot a normal distribution with
# mean = 100, variance = 100
plotDist('norm', mean=100, sd=10, lw=4, xlim = c(60, 200), main="Normal distributions with equal variances")

# Plot a normal distribution with
# mean = 150, variance = 100
plotDist('norm', mean=150, sd=10, lw=4, add=TRUE, col="red")

Suppose we take a single sample from each of those normal distributions and calculate the variance

  1. Suppose we take a sample of size n from the first normal distribution and calculate the variance. We then do the same thing with the second normal distribution. If we take the ratio of those variances, what value would you expect for that ratio?






  1. Now suppose we repeatedly take samples from each distribution and calculate the ratio of variances each time. If we do this 10,000 times, what distribution would you expect for the ratios of variances? Why?









Let’s simulate this process to check your answer to the previous question. We’ll start with a sample size of n=10.

# Set sample size
n <- 10

# 10,000 times, we want to:
#  take a sample of size n from each distribution
#    and calculate the ratio of variances
set.seed(3141)
ratios <- do(10000) * var(rnorm(n, mean = 100, sd = 10)) / var(rnorm(n, 150, 10))

# Plot the distribution of the ratios of variances
densityPlot(~var, data=ratios, 
            xlab=paste("n =", n, "; mean =", round(mean(ratios$var),3),
                       "; std. dev =", round(sd(ratios$var),3)))

  1. Suppose we take a larger sample, say n=200. How would that impact the distribution of the ratios of variances?








I won’t show the code I used to get this plot, but I simply changed n <- 200 and ran the same simulation. Here’s the distribution of the ratios of variances.



It looks like the distribution is a little more symmetric, but it’s still not a normal distribution. We can check this with a Q-Q plot:

# Q-Q Plot
qqnorm(ratios2$var, ylab = "Ratio of variances"); qqline(ratios2$var)


So it looks like the distribution of variance ratios is positively-skewed if we sample from two normal populations that have the same variance. The skew lessens when n gets larger.

But what happens if we sample from two normal populations that have different variances? Let’s run a simulation.

# Plot our two populations
# One with a variance of 900
plotDist('norm', mean=100, sd=30, lw=4, xlim = c(0, 200), main="Normal distributions with different variances", ylim = c(0, .045))

# One with a variance of 100
plotDist('norm', mean=150, sd=10, lw=4, xlim = c(0, 200), add=TRUE, col="red")

  1. Suppose we repeatedly take samples from each distribution and calculate the ratio of variances each time. If we do this 10,000 times, what distribution would you expect for the ratios of variances? Why?









Let’s run the simulation to see what distribution we get. We’ll start with a sample size of n=10 from each population:

# Set sample size
n <- 10

# 10,000 times, we want to:
#  take a sample of size n from each distribution
#    and calculate the ratio of variances
set.seed(3141)
ratios3 <- do(10000) * var(rnorm(n, mean = 100, sd = 30)) / var(rnorm(n, 150, 10))

# Plot the distribution of the ratios of variances
densityPlot(~var, data=ratios3, 
            xlab=paste("n =", n, "; mean =", round(mean(ratios3$var),3),
                       "; std. dev =", round(sd(ratios3$var),3)))

  1. The distribution is still positively-skewed, but how does it differ from the distribution of variance ratios when the populations had equal variances?









Let’s see if anything changes with a bigger sample size of n=200:


  1. Fill-in-the-blanks: If we repeatedly take samples of size n from two populations with equal variances and calculate the ratio of variances, the distribution will have a __________ skew and will be centered at __________. If we repeatedly take samples of size n from populations with unequal variances, the distribution will have a __________ skew and will be centered at __________.









This time, let’s start with non-normal populations. I’ll make one of them positively-skewed and the other negatively-skewed.

# Generate two datasets
# a ~ beta(parameters=10,1)
# b ~ beta(parameters=1,10)
plotDist('beta', shape1=10, shape2=1, lw=4, xlim = c(0, 1), main="Skewed distributions with equal variances")
plotDist('beta', shape1=1, shape2=10, lw=4, xlim = c(0, 1), add=TRUE, col="red")

  1. Because the population variances are nearly identical, we expect the distribution of the variance ratios to be near _____.





Now let’s simulate 10,000 samples of size n=10 from each distribution. We’ll calculate the ratio of variances from each pair of samples and sketch the distribution of those variance ratios.

# Set sample size
n <- 10

# 10,000 times, we want to:
#  take a sample of size n from each distribution
#    and calculate the ratio of variances
set.seed(3141)
ratios5 <- do(10000) * var(rbeta(n, shape1=10, shape2 = 1)) / var(rbeta(n, shape1=1, shape2=10))

# Plot the distribution of the ratios of variances
densityPlot(~var, data=ratios5, 
            xlab=paste("n =", n, "; mean =", round(mean(ratios5$var),3),
                       "; std. dev =", round(sd(ratios5$var),3)))

  1. The distribution is positively-skewed (which we expected) but something’s not quite right. Identify what’s unusual with this distribution.









F-distribution

Let’s make the following assumptions:

• We have two normally-distributed populations

• The population distributions have equal variances

If we were to repeatedly sample n observations from each of these populations and calculate \(\frac{s_{1}^{2}}{s_{2}^{2}}\) from each sample, the distribution of those variance ratios would follow an F-distribution with \(n_{1}-1\) and \(n_{2}-1\) degrees of freedom.


Try constructing different F-distributions (with different degrees of freedom) using this Statkey F-distribution applet.


This means that if we want to compare two sample variances, we simply take their ratio and compare it to the appropriate F-distribution. If our sample variance ratio is unlikely to have come from the F-distribution, then we can either conclude:

• The population variances are not equal

or

• The population distributions are not normally distributed.


F-test

Let’s use the theory-based F-distribution to test if our essay score variances are equal.

  1. Identify the assumptions (or conditions) necessary to use the F-distribution to compare our sample variances. Remember to include the null hypothesis as a condition.









Remember our test statistic in this scenario was calculated to be:

\(\frac{s_{1}^{2}}{s_{2}^{2}}=\frac{11.5606}{4.9894}=2.317\sim F_{28-1}^{22-1}\)

If our assumptions are true, this test statistic comes from an F-distribution with 21 and 27 degrees of freedom. Let’s take a look at this distribution.

# Plot F distribution with 27, 21 degrees of freedom
plotDist("f", df1=21, df2=27, lw=5, col="steelblue")


Let’s see how likely it was for our test statistic of 2.317 to come from this distribution:

# Plot F distribution and label the test stat of 2.317
xpf(c(2.317), df1=21, df2=27)

## [1] 0.9795299


When we calculated our test statistic, we chose to put the larger variance on top. We could have just as easily put the smaller variance on top:

\(\frac{s_{1}^{2}}{s_{2}^{2}}=\frac{4.9894}{11.5606}=0.4316\sim F_{22-1}^{28-1}\)

If we had done this, our p-value would be:

# Plot F distribution and label the test stat of 2.317
xpf(c(0.4316), df1=27, df2=21)

## [1] 0.02047216

Notice it’s the same p-value (just located on the other side of the distribution). Because each of these scenarios were equally likely (we had to choose one variance to be in the numerator), we should add these p-values to get our overall p-value of: p = 0.04.


R has a built-in F-test for variances. To use this test, you use the function var.test:

# Run the F-test to compare two variances
var.test(score ~ group, data=essay)
## 
##  F test to compare two variances
## 
## data:  score by group
## F = 0.43159, num df = 27, denom df = 21, p-value = 0.04094
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1848461 0.9653508
## sample estimates:
## ratio of variances 
##          0.4315879
  1. From this F-test, what conclusions can we make?









Note: If you’d like more information about where the F-distribution comes from, see pages 4-17 of the derivation of F-distribution activity.



  1. Notice the p-values (and possibly conclusions) from the randomization-based and theory-based methods differ. Which framework is more defensible in this scenario? Why would the different frameworks give us different results?










Your turn

  1. In 2013, 219 freshmen at St. Ambrose responded to a survey asking how many hours they study per week. In 2014, those same 219 students (as sophomores) responded to the same question. Load this studyhours dataset and examine the first several rows of data with the head(studyhours) command.

  2. Now that you have the studyhours dataset loaded, create a densityplot to examine the distribution of hours for each class (Freshmen and Sophomores). Then, calculate the variance in hours for each class. Finally, use the varianceratio() function to calculate the ratio of variances (and store it as test.stat).

  3. Suppose we believe that freshmen take similar classes and, therefore, tend to study the same amount. Sophomores, on the other hand, begin to take courses in their chosen majors and, therefore, have a larger variance in the amount of time they study each week. Run a randomization-based test to test the hypothesis that class (freshmen vs. sophomore) has no impact on hours studying per week. Display the distribution of your randomized variance ratio and report a p-value.

  4. Use bootstrap methods to construct a 95% confidence interval for the ratio of variances. Based on the p-value you just reported, you should be able to predict whether the interval will contain the value 1.0.

  5. The observed variance ratio in this example is approximately 1.397. It comes from taking the ratio of variances from two groups that each have a sample size of n=219. If the population distributions of each group have equal variances – and if our data come from populations with normal distributions – our observed variance ratio comes from an F-distribution. Sketch this F-distribution (with the correct degrees of freedom in the numerator and denominator).

  6. Use var.test() to conduct an F-test to test if the population variances are equal.

  7. Considering everything you just did, what’s your conclusion regarding the variance in hours studying for freshmen and sophomores?

  8. Complete assignment 2




Publishing your solutions

When you’re ready to try the Your Turn exercises: