Scenario: Ambiguous Prose

In 1972, 57 high school students were randomly assigned to one of three groups:

19 students were shown a picture before reading a passage
19 students were shown a picture after reading a passage
19 students were not shown the picture at all

The students were asked to read the following passage and remember details from it:

Passage

If the balloons popped, the sound wouldn’t be able to carry since everything would be too far away from the correct floor. A closed window would also prevent the sound from carrying, since most buildings tend to be well insulated. Since the whole operation depends on a steady flow of electricity, a break in the middle of the wire would also cause problems. Of course, the fellow could shout, but the human voice is not loud enough to carry that far. An additional problem is that a string could break on the instrument. Then there could be no accompaniment to the message. It is clear that the best situation would involve less distance. Then there would be fewer potential problems. With face to face contact, the least number of things could go wrong.

Students were asked to:

Rate how easy it was to comprehend the passage on a 1-7 scale (1 = very difficult; 7 = very easy)
Recall the passage by writing out as many of the ideas as you remember.

Does seeing the picture influence a student’s ability to comprehend and/or recall the passage?

Picture

Data Exploration

For now, let’s focus on the comprehension scores. The code shows how the data were imported –>

# We'll download the data from my website
ambiguous <- read.csv("http://www.bradthiessen.com/html5/data/ambiguousprose.csv")

# Look at the first several rows to get a sense of the data
head(ambiguous)
#    # A tibble: 6 × 2
#      Condition Comprehension
#    *    <fctr>         <int>
#    1     After             6
#    2     After             5
#    3     After             4
#    4     After             2
#    5     After             1
#    6     After             3

# Get summary statistics
ambiguous %>%
  group_by(Condition) %>%
  summarize(n = n(),
            mean = mean(Comprehension),
            sd = sd(Comprehension))
#    # A tibble: 3 × 4
#      Condition     n     mean       sd
#         <fctr> <int>    <dbl>    <dbl>
#    1     After    19 3.210526 1.397575
#    2    Before    19 4.947368 1.311220
#    3      None    19 3.368421 1.256562

ambiguous %>%
  summarize(n = n(),
            mean = mean(Comprehension),
            sd = sd(Comprehension))
#    # A tibble: 1 × 3
#          n     mean       sd
#      <int>    <dbl>    <dbl>
#    1    57 3.842105 1.521154

Condition	n	mean	sd
After	19	3.211	1.398
Before	19	4.947	1.311
None	19	3.368	1.257
Combined	N = 57	M = 3.842	s = 1.521

What do M and s represent? From these summary statistics, do you think we’ll conclude the picture influenced student comprehension?

Dotplot


ambiguous %>%
  ggplot(aes(x = Condition, y = Comprehension)) +
  geom_dotplot(aes(fill=Condition), binaxis = "y", binpositions="all", color = "white") +
  scale_y_continuous(breaks=seq(0, 7, 1), minor_breaks=NULL) +
  theme(axis.title.x = element_text(size = 12, color = "#777777")) +
  theme(axis.text.x = element_text(size = 12)) +
  theme(axis.title.y = element_text(size = 12, color="#777777")) +
  theme(axis.text.y = element_text(size = 12)) +
  theme(legend.position="none") +
  labs(
    title = "Comprehension by Condition"
  ) +
  coord_flip(xlim = NULL, ylim = NULL, expand = TRUE)

Boxplot


ambiguous %>%
  ggplot(aes(x = Condition, y = Comprehension)) +
  geom_dotplot(aes(fill=Condition), binaxis = "y", binpositions="all", color = "white", alpha=0.5) +
  geom_boxplot(fill = "white", color = "black", alpha = 0.6) +
  scale_y_continuous(breaks=seq(0, 7, 1), minor_breaks=NULL) +
  theme(axis.title.x = element_text(size = 12, color = "#777777")) +
  theme(axis.text.x = element_text(size = 12)) +
  theme(axis.title.y = element_text(size = 12, color="#777777")) +
  theme(axis.text.y = element_text(size = 12)) +
  theme(legend.position="none") +
  labs(
    title = "Comprehension by Condition"
  ) +
  coord_flip(xlim = NULL, ylim = NULL, expand = TRUE)

Means with standard errors


ambiguous %>%
  ggplot(aes(x = Condition, y = Comprehension)) +
  stat_summary(fun.y = mean, geom = "point") + 
  stat_summary(fun.y = mean, geom = "line", aes(group = 1),
               colour = "Blue", linetype = "dashed") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  labs(x = "Treatment", y = "Mean Comprehension Score")

Hypotheses

Write out null and alternative hypotheses to compare the means of all three groups. How could we determine the likelihood of this null hypothesis?

Idea #1: multiple t-tests

If an independent samples t-test compares two group means, could we use multiple t-tests to compare all possible pairs of our three group means?

How many t-tests would we need to test all possible pairs of three group means? How many t-tests would we need if we had G groups?

Suppose we conduct 3 t-tests. If we set \(\alpha =0.05\) for each test, how likely would we be to make at least one alpha error in our 3 tests? In other words, what’s \(P(\mathrm{at\ least\ one\ }\alpha\ \mathrm{error\ in\ 3\ tests})\)?

Suppose we conduct t-tests to compare all possible pairs of means from G groups. If we have a Type I error rate of \(\alpha\) for each test, what is the probability we make at least one alpha error across all our tests?

Problems with multiple t-tests

Explain why we should not conduct multiple t-tests to compare pairs of group means from a single dataset.

Despite this problem, let’s conduct multiple t-tests using the pairwise.t.test() command to run all possible t-tests for a dataset.

# Wait!!!!!!!!!!
# Why would R have a pairwise.t.test function if we're NOT
# supposed to conduct multiple t-tests?  Good question.
# We'll come back to this when we learn about post hoc tests.

# To use this function, we need to identify the variables within
# the dataframe using the $ identifier.
# p.adjust.method="none" just means I don't want R to modify the p-value
# In the next lesson, we'll see why we might want to modify the p-value
pairwise.t.test(ambiguous$Comprehension, ambiguous$Condition, p.adjust.method="none")
#    
#       Pairwise comparisons using t tests with pooled SD 
#    
#    data:  ambiguous$Comprehension and ambiguous$Condition 
#    
#           After   Before 
#    Before 0.00017 -      
#    None   0.71444 0.00054
#    
#    P value adjustment method: none

The output states the t-tests used a “pooled SD.” What does that mean?

What could we conclude from the p-values listed in the output? Why are we uncertain about our conclusions? Is there anything we could do to reduce this uncertainty?

Idea #2: get MAD (or SAD)

If running multiple t-tests is problematic, could we conduct a single test of our three group means using randomization-based methods?

Sure! If we can identify a single test statistic that compares all three group means, we can use randomization- or theory-based methods.

What could we conclude from the p-values listed in the output? Why are we uncertain about our conclusions? Is there anything we could do to reduce this uncertainty?

Test statistic

To compare two means, we use: \(\overline{X}_{1}-\overline{X}_{2}\). If two means are similar, the test statistic is near zero. As the means differ by a greater amount, the value of the test statistic increases.

Propose a single test statistic to measure the difference among 3 group means. What happens to the value of that test statistic as the group means get closer together or further apart?

Using the means of our three groups:

\(\overline{X}_{after}=\overline{X}_{a}=3.211\)
\(\overline{X}_{before}=\overline{X}_{b}=4.947\)
\(\overline{X}_{none}=\overline{X}_{n}=3.368\)

We can calculate:

\(SAD = |\overline{X}_{a}-\overline{X}_{b}|+|\overline{X}_{a}-\overline{X}_{n}|+|\overline{X}_{b}-\overline{X}_{n}|=3.474\)

\(MAD = \frac{|\overline{X}_{a}-\overline{X}_{b}|+|\overline{X}_{a}-\overline{X}_{n}|+|\overline{X}_{b}-\overline{X}_{n}|}{3}=1.158\)

Expand the code to see how to calculate this in R –>

SAD( mean(Comprehension ~ Condition, data=ambiguous) )
#    [1] 3.473684
MAD( mean(Comprehension ~ Condition, data=ambiguous) )
#    [1] 1.157895

test_stat <- SAD( mean(Comprehension ~ Condition, data=ambiguous) )

We’ll store SAD = 3.4736842 as our test_stat.

Randomization-based SAD test

Assume your null hypothesis is true. In our sample data, the first subject in the after group rated the passage a 6 in comprehension. If we went back in time and randomly assigned this subject to the none group, what rating would the subject give to the passage?

How many ways could 57 subjects be assigned to 3 groups (with 19 in each group)? Expand the code to see this calculation.

# Eliminate scientific notation
options(scipen=999)

# Calculate number of randomizations (group ID does not matter)
ways <- ( choose(n=57, k=19) * choose(38, 19) * choose(19, 19) ) / factorial(3)

# Put comma separators into the number
format(ways, big.mark=",", scientific=FALSE)
#    [1] "3,752,394,405,341,099,206,901,760"

We’re never going to list all those randomizations, but we can get a random, representative sample of them. For each randomization, we’ll calculate SAD. We’ll then plot our distribution of randomized SADs and determine the likelihood that our test statistic came from this distribution.

# Randomize our data 10,000 times (shuffling condition)
SADrand <- Do(10000) * SAD( mean(Comprehension ~ shuffle(Condition), data=ambiguous) )

# Here's one way to do this in dplyr
# SADrand <- Do(10000) * ambiguous %>%
#   mutate(shuffled = sample(Comprehension)) %>%
#   group_by(Condition) %>%
#   summarize(shuffled_means = mean(shuffled)) %>%
#   mutate(shuffled_SAD = SAD(shuffled_means)) %>%
#   filter(Condition == "After") %>%   # Eliminates triplicate results
#   select(shuffled_SAD)

# Plot the distribution

# Here's the most straight-forward way to plot this
# histogram(~SAD, data=SADrand,
#           xlim=c(0,5),
#           width=.2,
#           xlab="Possible mean differences assuming null hypothesis is true")
# ladd(panel.abline(v=test_stat))   # Add vertical line at test statistic

# Using ggplot2 with a density plot
ggplot(data = SADrand, aes(x = SAD)) +
  geom_density(fill="lightblue", color="white", alpha = 0.8) +
  annotate("segment", x = test_stat, xend = test_stat, y = 0, yend = .4, color = "red") +
  annotate("text", x = test_stat, y = .45, label = "observed SAD = 3.474", color = "red") +
  labs(
      title = "Randomized SAD",
      x = "SAD"
      ) +
  scale_x_continuous(limits = c(0,5), breaks=seq(0, 5, 1), minor_breaks=NULL) +
    theme(
    axis.text.x = element_text(size = 11, color="grey10"),
    legend.position = "none",
    panel.grid.major.y = element_line(colour = "white"),
    panel.grid.major.x = element_line(colour = "white", size=.15),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "grey93")
  )



# Estimate p-value
pvalue <- prop(SADrand$SAD >= test_stat)
pvalue
#      TRUE 
#    0.0008

When we compared two means, the randomized distribution of mean differences was centered near zero with a symmetric, unimodal appearance.

When we compared two variances, the randomized distribution of variance ratios was centered near 1.0 with a positive skew.

Explain why the randomized distribution of SAD is not centered at zero or 1.0. Explain why it has a positive skew.

Explain, one last time, the process of randomization, what the plot represents, and how the p-value was estimated. Then, interpret the p-value of 0.0008.

Problems with randomized SAD (or MAD) test

Three scenarios are plotted in the dotplots displayed below. Scenario B (the middle column) represents the actual data from this study. Data for scenarios A and C were simulated with (roughly) the same means as the actual data. This means that in all three scenarios, the value of SAD would be the same.

Which scenario (A, B, C) provides the strongest evidence against the null hypothesis? In other words, in which scenario does the condition group matter the most?

Identify a limitation of SAD as a test statistic to compare group means.

# A = Simulate data with same means, smaller standard deviations
dataA <- tibble(
  Condition = as.factor(c(rep("After", 19), rep("Before", 19), rep("None",19))),
  Comprehension = c(rnorm(19,3.211,0.5), rnorm(19,4.947,0.5), rnorm(19,3.368,0.5)),
  dataset = c(rep("A", 57))
)

# B = actual data from study
dataB <- ambiguous
dataB$dataset = c(rep("B", 57))

  
# C = simulate data with same means, larger standard deviations
dataC <- tibble(
  Condition = as.factor(c(rep("After", 19), rep("Before", 19), rep("None",19))),
  Comprehension = c(rnorm(19,3.211,3), rnorm(19,4.947,3), rnorm(19,3.368,3)),
  dataset = c(rep("C", 57))
)

# merge into a single data frame
data_sim <- bind_rows(dataA, dataB, dataC)

# Plot with dataset as facet
data_sim %>%
  ggplot(aes(x = Condition, y = Comprehension)) +
  geom_dotplot(aes(fill=Condition), binwidth=.5, binaxis = "y", binpositions="all", color = "white", alpha=0.9) +
  scale_y_continuous(limits = c(-5, 9), breaks=c(3, 5), minor_breaks=NULL) +
  stat_summary(fun.y = mean, geom = "point", size=3, aes(group=1)) +
  theme(axis.title.x = element_text(size = 12, color = "#777777")) +
  theme(axis.text.x = element_text(size = 12)) +
  theme(axis.title.y = element_text(size = 12, color="#777777")) +
  theme(axis.text.y = element_text(size = 12)) +
  theme(legend.position="none") +
  coord_flip(xlim = NULL, ylim = NULL, expand = FALSE) +
  facet_grid(. ~ dataset)
#    Warning: Removed 2 rows containing non-finite values (stat_bindot).
#    Warning: Removed 2 rows containing non-finite values (stat_summary).

To address the limitations of SAD, our test statistic should consider the variability within the groups along with the variability among the group means.

Assuming we can do this, how will we compare the variance among groups to the variance within groups? What distribution will this follow?

Idea #3: ANOVA (using the mean square ratio)

Our test statistic, then, will be: \(F=\frac{\textrm{variance between group means}}{\textrm{variance within groups}}\)

We’ll call this process analysis of variance or ANOVA. Despite its name, realize the goal is to compare means (by analyzing the variance between and within the groups).

What conditions are necessary for a ratio of variances to follow an F-distribution? This is a drawback to ANOVA – we’ll need to make sure some conditions are met.

Below, I’ve sketched hypothetical population distributions for each group in our study. List the assumptions I made in creating the plot on the left and the plot on the right.


# Null hypothesis is true
ggplot(data.frame(x = c(0, 8.5)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 3.6, sd = 1.25), color="red", size=1) +
  stat_function(fun = dnorm, args = list(mean = 3.7, sd = 1.25), color="skyblue", size=1) +
  stat_function(fun = dnorm, args = list(mean = 3.8, sd = 1.25), color="267", size=1) +
  geom_dotplot(data = ambiguous, aes(x = Comprehension, y = 1, fill=Condition, color=Condition), 
               binwidth=.1, binaxis="x", stackgroups=TRUE, binpositions="all") +
  scale_y_continuous(limits = c(0, .4), breaks=NULL, minor_breaks=NULL) +
  scale_x_continuous(limits = c(-.5, 9), breaks=seq(1,7,1), minor_breaks=NULL) +
  theme(axis.title.x = element_text(size = 12, color = "#777777")) +
  theme(axis.text.x = element_text(size = 12)) +
  theme(legend.position="none")  +
    labs(
      title = "Null Hypothesis",
      x = "comprehension score",
      y="")

# Alternative hypothesis is true
ggplot(data.frame(x = c(0, 8.5)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 2, sd = .6), color="red", size=1) +
  stat_function(fun = dnorm, args = list(mean = 3.7, sd = .6), color="skyblue", size=1) +
  stat_function(fun = dnorm, args = list(mean = 5.4, sd = .6), color="267", size=1) +
  geom_dotplot(data = ambiguous, aes(x = Comprehension, y = 1, fill=Condition, color=Condition), 
               binwidth=.1, binaxis="x", stackgroups=TRUE, binpositions="all") +
  scale_y_continuous(limits = c(0, .75), breaks=NULL, minor_breaks=NULL) +
  scale_x_continuous(limits = c(-.5, 9), breaks=seq(1,7,1), minor_breaks=NULL) +
  theme(axis.title.x = element_text(size = 12, color = "#777777")) +
  theme(axis.text.x = element_text(size = 12)) +
  theme(legend.position="none")  +
    labs(
      title = "Alternative Hypothesis",
      x = "comprehension score",
      y="") +
    annotate("text", x = 2.0, y = .7, label = "after", color = "red") +
    annotate("text", x = 3.7, y = .7, label = "none", color = "steelblue") +
    annotate("text", x = 5.4, y = .7, label = "before", color = "267")

Model

Let’s construct a model for this ambiguous prose study. Think about 3 subjects in this study:

One randomly assigned to the after group who rated the passage a 6
One assigned to the before group who rated the passage a 3
One assigned to the none group who rated the passage a 1

Pick one of those individuals and list at least three reasons why that person assigned that comprehension rating to the passage.

Our model is:

\(y_{ig}=\mu +\alpha _{g}+\epsilon_{ig}=\mu+(\mu _{g}-\mu )+(y_{ig}-\mu _{g})\).

Identify the components of this model and explain why the model must be true. What assumptions will we make regarding the error term?

With ANOVA, we partition the total variation in our data (the differences among comprehension ratings from all subjects in the study) into two components:

Treatment or between-groups = variation in comprehension due to the photo
Error or within-groups = variation in comprehension among subjects in the same group

We can (roughly) visualize these sources of variation as:


# Null hypothesis is true
ggplot(data.frame(x = c(0, 8.5)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 3.6, sd = 1.25), color="red", size=1, alpha=0.5) +
  stat_function(fun = dnorm, args = list(mean = 3.7, sd = 1.25), color="skyblue", size=1, alpha=0.5) +
  stat_function(fun = dnorm, args = list(mean = 3.8, sd = 1.25), color="267", size=1, alpha=0.5) +
  scale_y_continuous(limits = c(0, .4), breaks=NULL, minor_breaks=NULL) +
  scale_x_continuous(limits = c(-.5, 9), breaks=seq(1,7,1), minor_breaks=NULL) +
  theme(axis.title.x = element_text(size = 12, color = "#777777")) +
  theme(axis.text.x = element_text(size = 12)) +
  theme(legend.position="none")  +
    labs(
      title = "Null Hypothesis",
      x = "comprehension score",
      y="") +
    annotate("segment", x = 3.6, xend = 3.8, y = .333, yend = .333, color = "black", size=1.5) +
    annotate("text", x = 3.7, y = .35, label = "between groups variation", color = "black") +
    annotate("segment", x = 2.1, xend = 5.3, y = .12, yend = .12, color = "black", size=1.5) +
    annotate("text", x = 3.7, y = .14, label = "within groups variation", color = "black") +
    annotate("segment", x = 0, xend = 8, y = .0, yend = .0, color = "black", size=1.5) +
    annotate("text", x = 3.7, y = .02, label = "total variation", color = "black") 


# Alternative hypothesis is true
ggplot(data.frame(x = c(0, 8.5)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 2, sd = .6), color="red", size=1, alpha=0.5) +
  stat_function(fun = dnorm, args = list(mean = 3.7, sd = .6), color="skyblue", size=1, alpha=0.5) +
  stat_function(fun = dnorm, args = list(mean = 5.4, sd = .6), color="267", size=1, alpha=0.5) +
  scale_y_continuous(limits = c(0, .75), breaks=NULL, minor_breaks=NULL) +
  scale_x_continuous(limits = c(-.5, 9), breaks=seq(1,7,1), minor_breaks=NULL) +
  theme(axis.title.x = element_text(size = 12, color = "#777777")) +
  theme(axis.text.x = element_text(size = 12)) +
  theme(legend.position="none")  +
    labs(
      title = "Alternative Hypothesis",
      x = "comprehension score",
      y="") +
    annotate("segment", x = 2, xend = 5.4, y = .69, yend = .69, color = "black", size=1.5) +
    annotate("text", x = 3.7, y = .73, label = "between groups variation", color = "black") +
    annotate("segment", x = 3, xend = 4.4, y = .3, yend = .3, color = "black", size=1.5) +
    annotate("text", x = 3.7, y = .33, label = "within", color = "black") +
    annotate("segment", x = 0, xend = 8, y = .0, yend = .0, color = "black", size=1.5) +
    annotate("text", x = 3.7, y = .02, label = "total variation", color = "black")

We’re going to compare the between groups variation to the within groups variation. If that ratio is large, what does that say about our null hypothesis? What does it say if the ratio is relatively small? How are you defining large and small?

Partitioning (calculating) the sources of variation

Recall the formula for the unbiased estimate of the population variance:

\(s^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\overline{X})^{2}}{n-1}=\frac{\textrm{sum of squared deviations from the mean}}{\textrm{degrees of freedom}}=\frac{SS}{df}=\textrm{mean square}\)

What do SS and MS represent? What are degrees of freedom?

Total variation = MStotal

Fill-in-the-blanks to derive the formula to calculate the total variation in comprehension scores. Then, considering what MStotal represents, rewrite the numerator (SStotal). Finally, calculate MStotal for this ambiguous prose dataset and explain what it represents.

\(\mathbf{\textrm{MS}_{\textrm{Total}}=\frac{\textrm{SS}_{\textrm{total}}}{\textrm{df}_{\textrm{total}}}=\frac{\sum_{i=1}^{n}(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ )^{2}}{(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ )}}=\)

Calculation:

SStotal <- sum( ( ambiguous$Comprehension - mean(ambiguous$Comprehension) )^2 )
SStotal
#    [1] 129.5789

dftotal <- length(ambiguous$Comprehension) - 1
dftotal
#    [1] 56

MStotal <- SStotal / dftotal
MStotal
#    [1] 2.31391

# Notice MStotal is just the variance of all the data
var(ambiguous$Comprehension)
#    [1] 2.31391

Within-group variation = MSerror

SStotal = 129.579. Let’s partition that into the within- and between-groups source.

MSerror represents the variation within each group (variation in comprehension that is not due to seeing the photo).

Fill-in-the-blanks to derive formulas for \(SS_{error}\), \(df_{error}\), and \(MS_{error}\). Explain what MSerror represents and write a simpler formula to calculate SSerror. Finally, calculate MSerror for our data.

\(\mathbf{\textrm{MS}_{\textrm{Error}}=\frac{\textrm{SS}_{\textrm{error}}}{\textrm{df}_{\textrm{error}}}=\frac{\sum(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ )^{2}}{(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ )}}=\)

Calculation:

SSerror <- sum( (19 - 1) * var(Comprehension ~ Condition, data=ambiguous) )
SSerror
#    [1] 94.52632

dferror <- length(ambiguous$Comprehension) - 3
dferror
#    [1] 54

MSerror <- SSerror / dferror
MSerror
#    [1] 1.750487

Between-group variation = MSphoto

If we know SStotal = 129.579 and SSerror = 94.526, how can we easily calculate SSphoto?

Fill-in-the-blanks to derive formulas for \(SS_{photo}\), \(df_{photo}\), and \(MS_{photo}\). Explain what they represent and calculate MSphoto for our data.

\(\mathbf{\textrm{MS}_{\textrm{A}}=\mathbf{\textrm{MS}_{\textrm{treatment}}}=\frac{\textrm{SS}_{\textrm{A}}}{\textrm{df}_{\textrm{A}}}=\frac{\sum\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ )^{2}}{(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ )}}\)

Calculation:

# This is not at all the easiest way to calculate this...
SSphoto <- as.numeric(
  ambiguous %>%
  group_by(Condition) %>%
  summarize(squaredmeandiff = (mean(Comprehension)-mean(ambiguous$Comprehension))^2) %>%
  mutate(ssphoto = 19 * sum(squaredmeandiff)) %>%
  filter(Condition=="After") %>%
  select(ssphoto)
)

SSphoto
#    [1] 35.05263

# Check to see if it's the same as SStotal - SSerror
SStotal - SSerror
#    [1] 35.05263

dfphoto <- n_distinct(ambiguous$Condition) - 1
dfphoto
#    [1] 2

MSphoto <- SSphoto / dfphoto
MSphoto
#    [1] 17.52632

ANOVA summary of calculations

For our data:

Source	SS	df	MS	MSR
Treatment	35.053	2	17.526	?
Error	94.526	54	1.75	?
Total	129.579	56	2.314	?

Will \(\mathbf{\textrm{SS}_{\textrm{total}}}\) always equal \(\mathbf{\textrm{SS}_{\textrm{A}}}+\mathbf{\textrm{SS}_{\textrm{E}}}\)? Explain what is happening at each step of the following derivation:

\(\mathbf{\textrm{SS}_{\textrm{total}}}=\sum \sum (x_{i}-M)^{2}=\sum \sum \left [ \left ( x_{i}- \overline{X}_{a}\right )-\left ( \overline{X}_{a}-M \right ) \right ]^{2}=\sum \sum \left [ \left ( \overline{X}_{a}- M\right )-\left ( x_{i}-\overline{X}_{a} \right ) \right ]^{2}=\)

\(\mathbf{\textrm{SS}_{\textrm{total}}}=\sum \sum\left ( \overline{X}_{a}- M\right )^{2}+\sum \sum\left ( x_{i}-\overline{X}_{a} \right )^{2}+2\sum \sum\left ( \overline{X}_{a}- M\right )\left ( x_{i}-\overline{X}_{a} \right )\)

Since \(\sum \left ( x_{i}-\overline{X}_{a} \right )=0\)

\(\mathbf{\textrm{SS}_{\textrm{total}}}=\sum n_{a}\left ( \overline{X}_{a}- M\right )^{2}+\sum \sum\left ( x_{i}-\overline{X}_{a} \right )^{2}=\)

\(\mathbf{\textrm{SS}_{\textrm{total}}}=\mathbf{\textrm{SS}_{\textrm{A}}}+\mathbf{\textrm{SS}_{\textrm{E}}}\)

One more look at MSerror

If mean square is a fancy term for variance, MSerror must represent the variance within a single group.

If we have 3 groups in our dataset, how can MSE represent the variance of a single group? Which group do we “choose” when we calculate MSE?

If we assume our data come from groups with equal population variances, we don’t have to choose a group. Instead, we can calculate a pooled variance of all our groups. In this sense, MSE represents the average variance of each of our groups.

The concept of an average, or pooled, variance should sound familiar. We’ve used Spooled to conduct independent samples t-tests (assuming equal variances). If we take the formula for Spooled and extend it to 3 groups, we have:

\(s_{\textrm{pooled}}^{2}=\frac{\left ( n_{1}-1 \right )s_{1}^{2}+\left ( n_{2}-1 \right )s_{2}^{2}+\left ( n_{3}-1 \right )s_{3}^{2}}{\left ( n_{1}-1 \right )+\left ( n_{2}-1 \right )+\left ( n_{3}-1 \right )}\)

\(s_{\textrm{pooled}}^{2}=\frac{\left ( n_{1}-1 \right )\frac{\sum\left ( x_{i1}-\overline{X}_{1} \right )^{2}}{n_{1}-1} + \left ( n_{2}-1 \right )\frac{\sum\left ( x_{i2}-\overline{X}_{2} \right )^{2}}{n_{2}-1} + \left ( n_{3}-1 \right )\frac{\sum\left ( x_{i3}-\overline{X}_{3} \right )^{2}}{n_{3}-1}}{\left ( n_{1}-1 \right )+\left ( n_{2}-1 \right )+\left ( n_{3}-1 \right )}\)

\(s_{\textrm{pooled}}^{2}=\frac{\sum \left ( x_{i1}-\overline{X}_{1} \right )^{2} + \sum \left ( x_{i2}-\overline{X}_{2} \right )^{2} + \sum \left ( x_{i3}-\overline{X}_{3} \right )^{2}}{\left ( n_{1}-1 \right )+\left ( n_{2}-1 \right )+\left ( n_{3}-1 \right )}\)

\(s_{\textrm{pooled}}^{2}=\frac{\sum \left ( x_{ia}-\overline{X}_{a} \right )^{2}}{N-a}=\textrm{MS}_{\textrm{E}}\)

Under a true null hypothesis, what’s the expected value of MSerror? Does that expectation change if the null hypothesis were false?

Under a true null hypothesis, what’s the expected value of MSphoto? Does that expectation change if the null hypothesis were false?

This is the logic behind ANOVA. Under a true null hypothesis, both MSphoto and MSerror represent unbiased estimates of the variance within a group. When the null hypothesis is false, MSphoto becomes larger.

How will we compare MSphoto to MSerror? What sampling distribution will we use? How many degrees of freedom will we have? What’s the expected value of our test statistic if the null hypothesis were true?

ANOVA summary table

Complete the following ANOVA summary table and estimate the p-value. We can estimate the p-value directly in R or use the Statkey F-distribution applet.

Source	SS	df	MS	MSR
Treatment	35.053	2	17.53	MSR =
Error	94.526	54	1.75	p =
Total	129.579	56	2.31	\(\eta^2=\)

A quick look at an F-distribution table will give you a sense of what values of the MSR will yield low p-values.

What does that p-value tell us? Does it inform us about the magnitude of the differences among the group means?

ANOVA in R

Conducting ANOVA in R is simple. Expand the code to see –>

# Our data frame is "ambiguous"
# Our model is:  Comprehension ~ Condition
# We use aov() to conduct the analysis of variance
# I'll store the results in a list called "model"
model <- aov( Comprehension ~ Condition, data = ambiguous )

# The default output isn't too helpful
# It only shows SS and df, with a residual standard error
model
#    Call:
#       aov(formula = Comprehension ~ Condition, data = ambiguous)
#    
#    Terms:
#                    Condition Residuals
#    Sum of Squares   35.05263  94.52632
#    Deg. of Freedom         2        54
#    
#    Residual standard error: 1.32306
#    Estimated effects may be unbalanced

# We can use the anova() function to produce more useful output
anova(model)
#    # A tibble: 2 × 5
#         Df `Sum Sq` `Mean Sq` `F value`     `Pr(>F)`
#    * <int>    <dbl>     <dbl>     <dbl>        <dbl>
#    1     2 35.05263 17.526316  10.01225 0.0002002136
#    2    54 94.52632  1.750487        NA           NA

# Verify that these values match our calculations

# The aov() function produces much more output than we're seeing
# We can glance at some of this output with the broom package
# The following code will install the broom package (if necessary)
list.of.packages <- c("broom")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(broom)
tidy(model)
#    # A tibble: 2 × 6
#           term    df    sumsq    meansq statistic      p.value
#          <chr> <dbl>    <dbl>     <dbl>     <dbl>        <dbl>
#    1 Condition     2 35.05263 17.526316  10.01225 0.0002002136
#    2 Residuals    54 94.52632  1.750487        NA           NA
glance(model)
#    # A tibble: 1 × 11
#      r.squared adj.r.squared   sigma statistic      p.value    df    logLik
#          <dbl>         <dbl>   <dbl>     <dbl>        <dbl> <int>     <dbl>
#    1 0.2705118     0.2434937 1.32306  10.01225 0.0002002136     3 -95.29557
#    # ... with 4 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
#    #   df.residual <int>

# By the end of this course, you'll understand what all those
# terms mean: r.squared, logLik, AIC, deviance

Effect size: \(\eta^2\) and \(\omega^2\)

We can define an effect size (eta-squared) as: \(\eta ^{2}=\frac{\textrm{SS}_\textrm{A}}{\textrm{SS}_\textrm{T}}=\frac{35.05}{129.58}=0.27\)

Interpret eta-squared in this scenario.

Eta-squared is based entirely on sums of squares from the sample and no adjustment is made to estimate the effect size for the population. As we’ll learn later in this course, our models typically overfit our sample data.

As an adjusted effect size, we can calculate omega-squared: \(\omega ^{2}=\frac{\textrm{SS}_\textrm{A}-\left (\textrm{df}_{\textrm{A}} \right )\left ( \textrm{MS}_\textrm{E} \right )}{\textrm{SS}_\textrm{T}+\textrm{MS}_\textrm{E}}\)

For our ambiguous prose data: \(\omega ^{2}=\frac{35.05- \left (2 \right )\left ( 1.75 \right )}{129.58+1.75}=0.25\)

Why is \(\omega^2 < \eta^2\)? Based on the p-value and effect size estimates, what conclusions are you willing to make?

Evaluating ANOVA conditions (assumptions)

List the conditions necessary to get valid results from an ANOVA. Identify why we need each condition.

To assess homogeneity of variances, we can use:

Hartley’s F-max test (sensitive to violations of normality)
Bartlett’s test (sensitive to violations of normality)
Levene’s test (robust against violations of normality)
Brown-Forsythe test (robust against violations of normality)

F-max test

The Fmax test, which should only be applied if data for each group are normally distributed, is something you can quickly estimate in your head. To do so, you calculate the following statistic and compare it to the Fmax distribution.

\(F_{max}=\frac{s^2_\mathrm{{biggest }}}{s^2_\mathrm{{smallest }}}=\frac{1.398^2}{1.257^2}=1.237\sim F_{max}\)

Levene’s test

Levene’s test, which is more appropriate if your data do not follow a normal distribution, is calculated by first converting every score to:

\(z_{ai}=\left |x_{ai}-\overline{X}_a \right |\)

The test statistic is calculated as:

\(W=\frac{\left ( N-a \right )}{a-1}\frac{\sum n_{a}\left ( \overline{z}_{a}-\overline{\overline{z}} \, \right )^2}{\sum \sum \left ( z_{ia}- \overline{z}_{a}\right )^2}\sim{F}^{a-1}_{N-a}\)

Expand the code to see how to conduct Levene’s test in R –>

# We need the 'car' package
list.of.packages <- c("car")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(car)

# Levene Test
leveneTest(Comprehension ~ Condition, data = ambiguous, center = mean)
#    # A tibble: 2 × 3
#         Df  `F value`  `Pr(>F)`
#    * <int>      <dbl>     <dbl>
#    1     2 0.08474171 0.9188715
#    2    54         NA        NA

# Brown-Forsythe Test (uses medians instead of means)
leveneTest(Comprehension ~ Condition, data = ambiguous, center = median)
#    # A tibble: 2 × 3
#         Df  `F value`  `Pr(>F)`
#    * <int>      <dbl>     <dbl>
#    1     2 0.02432432 0.9759798
#    2    54         NA        NA

Based on the output from Levene’s test, can we conclude our group variances are equal? Explain.

Evaluating normality assumption

There are several ways to test for normality:

Histomancy: The art of divining likelihood functions from empirical histograms. This sorcery is used, for example, when testing for normality before deciding whether or not to use a non-parametric procedure. Histomancy is a false god, because even perfectly good Gaussian varriables may not look Gaussian when displayed as a histogram. (McElreath, p. 282)
Q-Q plots (which have the same issues as looking at histograms)
Back-of-the-envelope test: Convert the maximum and minimum scores to z-scores. You shouldn’t have extreme z-scores in small datasets (\(z>4\) should only occur with at least 15,000 observations in the group)
Shapiro-Wilk test (seems to be the most powerful)
Kolmogorov-Smirnov test
Székely and Rizzo’s energy test

These methods (except for the energy test) are easy to understand. Expand the code to see how to conduct the Shapiro-Wilk test in R.

# The function is called 'shapiro.test`.  It would test for normality
# of all the data in our data frame (ignoring the groups)
shapiro.test(ambiguous$Comprehension)
#    
#       Shapiro-Wilk normality test
#    
#    data:  ambiguous$Comprehension
#    W = 0.94333, p-value = 0.009909

# If we want to apply (tapply) the shapiro.test to all our groups, we use:
tapply(ambiguous$Comprehension, ambiguous$Condition, shapiro.test)
#    $After
#    
#       Shapiro-Wilk normality test
#    
#    data:  X[[i]]
#    W = 0.94309, p-value = 0.2995
#    
#    
#    $Before
#    
#       Shapiro-Wilk normality test
#    
#    data:  X[[i]]
#    W = 0.90423, p-value = 0.05805
#    
#    
#    $None
#    
#       Shapiro-Wilk normality test
#    
#    data:  X[[i]]
#    W = 0.94886, p-value = 0.3779

# We could then look at a Q-Q plot for that "before" group
# qqnorm(ambiguous$Comprehension[ambiguous$Condition=="Before"], ylab = "Essay scores"); qqline(ambiguous$Comprehension[ambiguous$Condition=="Before"])

Dealing with violations of conditions

What can we do if the conditions aren’t satisfied (or if the assumptions aren’t reasonable)? We’ll deal with this issue all semester. Some options include:

Transform your data to force it to better approximate a normal distribution (perhaps by using logarithms or Box-Cox transformation).
Use a nonparametric test (randomization-based methods or a test like the Kruskal-Wallis test)
Use robust ANOVA procedures (which we’ll investigate later in the course)
Use Bayesian methods (which we’ll investigate a bit in this course and in-detail in STAT 305)

Recall from the first lesson that if we want to conduct a t-test without a normality assumption, we could use the Welch-Satterthwaite approximation. We can also do this in ANOVA.

Expand the code to see how –>

# Notice the degrees of freedom differ from our "regular" ANOVA
oneway.test(Comprehension ~ Condition, data=ambiguous, var.equal=FALSE)
#    
#       One-way analysis of means (not assuming equal variances)
#    
#    data:  Comprehension and Condition
#    F = 9.8552, num df = 2.000, denom df = 35.933, p-value = 0.0003871

Randomized mean-square-ratios

We’ve seen how to conduct randomization-based tests of the SAD statistic (which do not account for variance within each group). We’ve also seen how to conduct an ANOVA (which requires conditions of normality and equal variances). Can we get the best of both worlds?

Let’s see if we can conduct a randomization-based test using the F-statistic (mean square ratio) as our test statistic. Expand the code to see how this is done.

# We'll store our observed mean square ratio of 10.012 as MSRobserved
# I could have used:  MSRobserved <- 10.012
MSRobserved <- tidy(aov(Comprehension~Condition, data=ambiguous))$statistic[1]

# 10,000 replications, shuffling the Condition each time
MSRrandomized <- Do(10000) * tidy(aov(Comprehension ~ shuffle(Condition), data=ambiguous))$statistic[1]

# Plot results
ggplot(data = MSRrandomized, aes(x = result)) +
  geom_density(fill="lightblue", color="white", alpha = 0.8) +
  annotate("segment", x = MSRobserved, xend = MSRobserved, y = 0, yend = .2, color = "red") +
  annotate("text", x = MSRobserved, y = .25, label = "observed MSR = 10.012", color = "red") +
  labs(
      title = "Randomized mean square ratios",
      x = "mean square ratios"
      ) +
  scale_x_continuous(limits = c(0,12), breaks=seq(0, 12, 2), minor_breaks=NULL) +
    theme(
    axis.text.x = element_text(size = 11, color="grey10"),
    legend.position = "none",
    panel.grid.major.y = element_line(colour = "white"),
    panel.grid.major.x = element_line(colour = "white", size=.15),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "grey93")
  )


# Estimate p-value
pvalue <- prop(MSRrandomized$result >= MSRobserved)
pvalue
#      TRUE 
#    0.0002

ANOVA vs. t-test

Let’s (finally) finish things up by looking at the relationship between the F-statistic (mean square ratio) in an ANOVA and the t-statistic in a t-test:

\(t_{\textrm{n}_1+\textrm{n}_2-2}=\frac{\overline{X}_{1}-\overline{X}_{2}}{\sqrt{\frac{1}{\textrm{n}_1}+\frac{1}{\textrm{n}_2}}\sqrt{\frac{\left ( \textrm{n}_1-1 \right )s_{1}^{2}+\left ( \textrm{n}_2-1 \right )s_{2}^{2}}{\textrm{n}_1+\textrm{n}_2-2}}}=\frac{\overline{X}_{1}-\overline{X}_{2}}{\sqrt{\frac{1}{\textrm{n}_1}+\frac{1}{\textrm{n}_2}}\sqrt{s_{pooled}^{2}}}\)

\(t_{{\textrm{n}_1+\textrm{n}_2-2}}^{2}=\frac{\left (\overline{X}_{1}-\overline{X}_{2} \right )^{2}}{\left ( \frac{1}{n_{1}}+ \frac{1}{n_{2}}\right )}s_{pooled}^{2}=\frac{\left ( n_{1}+n_{2} \right )\left (\overline{X}_{1}-\overline{X}_{2} \right )^{2}}{s_{pooled}^{2}}=\frac{n_{1}\left (\overline{X}_{1}-\overline{X}_{2} \right )^{2}+n_{2}\left (\overline{X}_{1}-\overline{X}_{2} \right )^{2}}{{s_{pooled}^{2}}}\)

\(t_{{\textrm{n}_1+\textrm{n}_2-2}}^{2}=\frac{\sum n_{a}\left (\overline{X}_{a}-M \right )^{2}/1}{\textrm{MS}_{\textrm{E}}}=\frac{\textrm{SS}_{\textrm{A}}/\textrm{df}_{\textrm{A}}}{\textrm{MS}_{\textrm{E}}}=\frac{\textrm{MS}_{\textrm{A}}}{\textrm{MS}_{\textrm{E}}}=F\)

Explain the consequences of what was just derived. Also, verify this result with the following output.

# Recreate the spider dataset from lesson #1
spider <- tibble(
  group = c( rep("photo", 12), rep("real", 12) ),
  anxiety = c(30, 35, 45, 40, 50, 35, 55, 25, 30, 45, 40, 50, 
              40, 35, 50, 55, 65, 55, 50, 35, 30, 50, 60, 39))

# Run a t-test
t.test(anxiety ~ group, data=spider, alternative = c("less"), var.equal = TRUE, conf.level = 0.95)
#    
#       Two Sample t-test
#    
#    data:  anxiety by group
#    t = -1.6813, df = 22, p-value = 0.05342
#    alternative hypothesis: true difference in means is less than 0
#    95 percent confidence interval:
#          -Inf 0.1490421
#    sample estimates:
#    mean in group photo  mean in group real 
#                     40                  47
# Store the value of the t-statistic (-1.68)
tstat <- t.test(anxiety ~ group, data=spider, alternative = c("less"), var.equal = TRUE, conf.level = 0.95)$statistic

# Run an ANOVA on the same data
anova(aov(anxiety ~ group, data=spider))
#    # A tibble: 2 × 5
#         Df `Sum Sq` `Mean Sq` `F value`  `Pr(>F)`
#    * <int>    <dbl>     <dbl>     <dbl>     <dbl>
#    1     1      294       294  2.826923 0.1068392
#    2    22     2288       104        NA        NA
# Notice the F-statistic = 2.8269
# Is that the same as the tstat squared?  Let's see...
tstat^2
#           t 
#    2.826923

Condition	n	mean	sd
After	19	3.211	1.398
Before	19	4.947	1.311
None	19	3.368	1.257
Combined	N = 57	M = 3.842	s = 1.521

aov( anova(Comprehension ~ Condition, data = ambiguous) )

Comparing 2+ means: ANOVA

Scenario: Ambiguous Prose

Passage

Picture

Data Exploration

Dotplot

Boxplot

Means with standard errors

Hypotheses

Idea #1: multiple t-tests

Problems with multiple t-tests

Idea #2: get MAD (or SAD)

Test statistic

Randomization-based SAD test

Problems with randomized SAD (or MAD) test

Idea #3: ANOVA (using the mean square ratio)

Model

Partitioning (calculating) the sources of variation

Total variation = MStotal

Within-group variation = MSerror

Between-group variation = MSphoto

ANOVA summary of calculations

One more look at MSerror

ANOVA summary table

ANOVA in R

Effect size: \(\eta^2\) and \(\omega^2\)

Evaluating ANOVA conditions (assumptions)

F-max test

Levene’s test

Evaluating normality assumption

Dealing with violations of conditions

Randomized mean-square-ratios

ANOVA vs. t-test

Your turn

Publishing Your Turn solutions