Scenario: Job interviews

Does having a disability influence how others perceive you perform on a job interview? If so, is the effect positive or negative?


Drawing


To study this, researchers prepared five videotaped job interviews using the same two male actors for each. A set script was designed to reflect an interview with an applicant of average qualifications. The tapes differed only in that the applicant appeared with a different handicap:

     • In one, he appeared in a wheelchair

     • In a second, he appeared on crutches

     • In a third, his hearing was impaired

     • In a fourth, he appeared to have one leg amputated

     • In a fifth, he appeared to have no disability


70 undergraduate students from an American university were randomly assigned to view the tapes, 14 to each tape. After viewing the tape, each subject rated the qualifications of the applicant on a 0-10 point scale.


Our primary questions are:

     • Is having a disability associated with higher or lower interview scores?

     • Do permanent disabilities (amputee, hearing, wheelchair) receive higher/lower scores than temporary disabilities (crutches)?



Data, assumptions, ANOVA

Here’s a summary of the interview scores given by the 70 undergraduate students:


Drawing


Let’s download this data, visualize it, and test some assumptions for an ANOVA.

# Download data from my website and store it as "interviews"
interviews <- read.csv("http://www.bradthiessen.com/html5/data/disability.csv")

# See the first several rows of data to get a sense of the contents
head(interviews)
##   disability score
## 1       none   1.9
## 2       none   2.5
## 3       none   3.0
## 4       none   3.6
## 5       none   4.1
## 6       none   4.2
# Stripplot
stripplot(score ~ disability, data=interviews, jitter.data = TRUE, 
          xlab = "anxiety", cex=1.3, col="steelblue", aspect="1", pch=20)

# Boxplot
bwplot(score ~ disability, data=interviews, col="steelblue", xlab="Group", ylab="scores",
       par.settings=list(box.rectangle=list(lwd=3), box.umbrella=list(lwd=3)))

# Error bars showing mean and standard errors
# I need to load the ggplot2 library to generate this plot
library(ggplot2)
ggplot(interviews, aes(disability, score)) + 
  stat_summary(fun.y = mean, geom = "point") + 
  stat_summary(fun.y = mean, geom = "line", aes(group = 1),
               colour = "Blue", linetype = "dashed") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  labs(x = "Disability", y = "Interview score")

# Run Shapiro-Wilk test for normality
# I run the test for each disability and then print all p-values
a <- round(shapiro.test(interviews$score[interviews$disability=="Amputee"])$p.value,3)
c <- round(shapiro.test(interviews$score[interviews$disability=="Crutches"])$p.value,3)
h <- round(shapiro.test(interviews$score[interviews$disability=="Hearing"])$p.value,3)
n <- round(shapiro.test(interviews$score[interviews$disability=="none"])$p.value,3)
w <- round(shapiro.test(interviews$score[interviews$disability=="Wheelchair"])$p.value,3)
paste("Amputee: p =", a, ". Crutch: p =", c, ". Hear: p =", h, ". none: p =", n, ". W-chair: p =", w)
## [1] "Amputee: p = 0.805 . Crutch: p = 0.548 . Hear: p = 0.932 . none: p = 0.971 . W-chair: p = 0.425"
# Levene's test for equal variances
leveneTest(score ~ disability, data=interviews)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  4  0.1974 0.9389
##       65
  1. Evaluate the assumptions necessary to conduct an ANOVA on this dataset.










Let’s run the ANOVA:

# Run the analysis of variance (aov) model
model <- aov(score ~ disability, data=interviews)

# Show the output in an ANOVA summary table
anova(model)
## Analysis of Variance Table
## 
## Response: score
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## disability  4  30.521  7.6304  2.8616 0.03013 *
## Residuals  65 173.321  2.6665                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


  1. From this, what can we conclude? Calculate and interpret an effect size.










Even though the p-value indicates it’s unlikely we would have observed this data if the group population means were equal, the ANOVA does not allow us to identify which group means differ from one another.

Since the questions of interest in this study are related to comparisons among specific group means, we need to derive some methods to do just that. We’ll classify these post-hoc methods as:

     • Pairwise-comparisons

     • Flexible comparisons (contrasts)



Pairwise comparisons

Since we found a statistically significant result with our ANOVA, why don’t we just compare each pair of group means using t-tests?

  1. Calculate the probability of making at least one Type I error if we conducted t-tests on all pairs of group means in this study (using \(\alpha =0.05\) for each t-test).










  1. If we wanted the overall Type I error rate for all pairwise t-tests combined to be \(\alpha =0.05\), what could we do?










Bonferroni correction

This is the idea behind the Bonferroni correction. If we want to maintain an overall Type I error rate, we need to divide \(\alpha\) for each test by the number of tests we want to conduct:


Bonferroni correction:   \(\alpha_{\textrm{each test}}=\frac{\alpha_{\textrm{overall}}}{\textrm{\# of tests}}\).


So, for example, if we wanted to conduct 10 t-tests and maintain an overall Type I error rate of 0.05, we would need to use \(\alpha_{\textrm{each test}}=0.005\).

Instead of adjusting the Type I error rate for each test, we could correct the p-values after running the t-tests. To do this, we would use:


\(\left (\textrm{p-value from each test} \right )\left ( \textrm{\# of tests to be conducted} \right )\)


Let’s see the Bonferroni correction in action. Remember, the p-value from the ANOVA we conducted earlier implies at least one of our group means differs from the others. (This isn’t exactly true. It actually tells us at least one linear combination of our group means will differ significantly from zero)

We’ll have R run all the pairwise t-tests to compare pairs of group means. First, we’ll have R run t-tests with no adjustment to the p-values. Then, we’ll adjust the p-values with the Bonferroni correction.


# Pairwise t-tests with no adjustment
with(interviews, pairwise.t.test(score, disability, p.adjust.method="none"))
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  score and disability 
## 
##            Amputee Crutches Hearing none  
## Crutches   0.0184  -        -       -     
## Hearing    0.5418  0.0035   -       -     
## none       0.4477  0.1028   0.1732  -     
## Wheelchair 0.1433  0.3520   0.0401  0.4756
## 
## P value adjustment method: none
# Pairwise t-tests with Bonferroni correction on the p-values
# Verify the p-values increase by a factor of 10 (in this example)
with(interviews, pairwise.t.test(score, disability, p.adjust.method="bonferroni"))
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  score and disability 
## 
##            Amputee Crutches Hearing none 
## Crutches   0.184   -        -       -    
## Hearing    1.000   0.035    -       -    
## none       1.000   1.000    1.000   -    
## Wheelchair 1.000   1.000    0.401   1.000
## 
## P value adjustment method: bonferroni


  1. From this output (using the Bonferroni correction), what conclusions can we make? Which disability group means differ from others?










  1. Remember, the Bonferroni correction increases the magnitude of our p-values (by lowering the Type I error rate). This makes it less likely to reject a null hypothesis and conclude a pair of group means differs from one another. What does that do to the power of our statistical tests? Identify a drawback with the Bonferroni correction.











  1. Should we only conduct a post-hoc test (like pairwise t-tests with the Bonferroni correction) when our overall ANOVA produces a statistically significant result? Explain.











Because Bonferroni’s correction is conservative (and can struggle to find mean differences that actually exist), several other post-hoc strategies have been developed.


Holm’s method

One alternative strategy that’s relatively easy to explain is Holm’s method. You first conduct all pairwise t-tests to obtain a p-value from each test. Then, you rank order those p-values from largest to smallest. For our data, the ranked order of our (uncorrected) p-values is:

  • 1: Hearing vs Amputee (p=0.5418)
  • (I’ll skip 2 through 5)
  • 6: Wheelchair vs Amputee (p=0.1433)
  • 7: None vs Crutches (p=0.1028)
  • 8: Hearing vs Wheelchair (p=0.0401)
  • 9: Crutches vs Amputee (p=0.0184)
  • 10: Crutches vs Hearing (p=0.0035)

We then correct each of those p-values by multiplying the p-value by its rank:

  • 1: Hearing vs Amputee (p=0.5418 multiplied by 10 takes us above 1.000)
  • (I’ll skip 2 through 5)
  • 6: Wheelchair vs Amputee (p=0.1433 multiplied by 6 = 0.860)
  • 7: None vs Crutches (p=0.1028 multiplied by 7 = 0.719)
  • 8: Hearing vs Wheelchair (p=0.0401 multiplied by 8 = 0.321)
  • 9: Crutches vs Amputee (p=0.0184 mutiplied by 9 = 0.165)
  • 10: Crutches vs Hearing (p=0.0035 multiplied by 10 = 0.035)

Notice that for the smallest p-value, this method is exactly the same as the Bonferroni correction. In subsequent comparisons, however, the p-value is corrected only for the remaining number of comparisons. This makes it more likely to find a difference between a pair of group means.


Let’s have R calculate the pairwise tests using Holm’s method:

# Holm's method
with(interviews, pairwise.t.test(score, disability, p.adjust.method="holm"))
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  score and disability 
## 
##            Amputee Crutches Hearing none 
## Crutches   0.165   -        -       -    
## Hearing    1.000   0.035    -       -    
## none       1.000   0.719    0.866   -    
## Wheelchair 0.860   1.000    0.321   1.000
## 
## P value adjustment method: holm


Other post-hoc methods

There are many more post-hoc methods (e.g., Hochberg’s False Discovery Rate, Tukey’s Honestly Significant Difference, Dunnett’s method) that we simply don’t have time to investigate. It is important, though, that you get a sense for which methods perform best.

As you learned in a previous statistics class, the Type I error rate is linked to the statistical power of a test. If we increase the probability of making a Type I error, the power declines (and vice versa). Because of this, it’s important that post-hoc methods control the overall Type I error rate without sacrificing too much power.

Bonferroni’s correction and Tukey’s HSD control the Type I error rate well, but they lack statistical power. Bonferroni’s correction has more power when testing a small number of comparisons, while Tukey’s HSD performs better with a larger number of comparisons. Holm’s method is more powerful than both Bonferroni and Tukey (and Hochberg’s method is more powerful still).

Each of these methods, however, performs poorly if the population group variances differ and the sample sizes are unequal. If you have an unequal sample sizes and unequal variances, you should probably use a robust post-hoc comparison method.


In case you’re interested, here’s the code to conduct a Tukey’s HSD:

# Conduct the post-hoc test
TukeyHSD(model, conf.level=.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = score ~ disability, data = interviews)
## 
## $disability
##                           diff        lwr        upr     p adj
## Crutches-Amputee     1.4928571 -0.2388756  3.2245899 0.1232819
## Hearing-Amputee     -0.3785714 -2.1103042  1.3531613 0.9724743
## none-Amputee         0.4714286 -1.2603042  2.2031613 0.9399911
## Wheelchair-Amputee   0.9142857 -0.8174470  2.6460185 0.5781165
## Hearing-Crutches    -1.8714286 -3.6031613 -0.1396958 0.0277842
## none-Crutches       -1.0214286 -2.7531613  0.7103042 0.4686233
## Wheelchair-Crutches -0.5785714 -2.3103042  1.1531613 0.8812293
## none-Hearing         0.8500000 -0.8817328  2.5817328 0.6442517
## Wheelchair-Hearing   1.2928571 -0.4388756  3.0245899 0.2348141
## Wheelchair-none      0.4428571 -1.2888756  2.1745899 0.9517374
# See a plot of the confidence intervals for each pairwise mean difference
plot(TukeyHSD(model))



Scheffe method for flexible comparisons (contrasts)

The previous methods allow us to compare pairs of means. Sometimes, as in this scenario, we’ll want more flexibility. For example, think about the key questions in this study:

To address these questions, we’ll need to compare linear combinations of group means:


Let’s focus on that first contrast (no disability vs. the 4 disability groups). To compare this linear combination of means, we can use the Scheffe method.


With the Scheffe method, we first need to calculate our contrast of interest by choosing weights in our linear combination:

Contrast (linear combination of means):   \(\psi =\sum_{a=1}^{a}c_{a}\mu _{a}=c_{1}\mu _{1}+c_{2}\mu _{2}+c_{3}\mu _{3}+c_{4}\mu _{4}+c_{5}\mu _{5}, \ \ \ \textrm{where} \ \sum c_{a}=0\)


  1. Calculate an estimate of the contrast to compare the none group mean with a combination of the other 4 group means. Under a true null hypothesis, the expected value of this contrast would equal what?










Our contrast isn’t zero, but we wouldn’t expect any contrast calculated from real data to equal zero. Even if the population values of our group means were identical, we’d expect our sample group means to differ and our sample contrast to be nonzero.

So our task is to determine the likelihood of observing a contrast as or more extreme than the one we calculated (assuming the null hypothesis is true). To do this, we need to derive the sampling distribution of our contrast.

Recall the general form of a t-test. We may be able to use the t-distribution to determine the likelihood of our contrast:

      \(t=\frac{\textrm{observed}-\textrm{hypothesized}}{\textrm{standard error}}=\frac{\hat{\psi }-0}{\textrm{SE}_{\psi }}\)


If this works, we only need to derive the standard error of a contrast. Then, we’d know the distribution of our sample contrast and we could estimate its likelihood.

To derive the standard error of a contrast, we need to remember two properties of variances:


With these two facts, we can calculate the variance of our contrast:

      \(\textrm{var}\left ( \hat{\psi} \right )=\textrm{var}\left ( c_{1}\overline{X}_{1} \right )+\textrm{var}\left ( c_{2}\overline{X}_{2} \right )+\textrm{var}\left ( c_{3}\overline{X}_{3} \right )+\textrm{var}\left ( c_{4}\overline{X}_{4} \right )+\textrm{var}\left ( c_{5}\overline{X}_{5} \right )\)

      \(\textrm{var}\left ( \hat{\psi} \right )=c_{1}^{2}\textrm{var}\left ( \overline{X}_{1} \right )+c_{2}^{2}\textrm{var}\left ( \overline{X}_{2} \right )+c_{3}^{2}\textrm{var}\left ( \overline{X}_{3} \right )+c_{4}^{2}\textrm{var}\left ( \overline{X}_{4} \right )+c_{5}^{2}\textrm{var}\left ( \overline{X}_{5} \right )\)


We now need one other fact about the variance of a mean:   \(\textrm{var}\left ( \overline{X} \right )=\frac{\sigma _{x}^{2}}{n}\)


      \(\textrm{var}\left ( \hat{\psi} \right )=c_{1}^{2}\frac{\sigma _{1}^{2}}{n_{1}}+c_{2}^{2}\frac{\sigma _{2}^{2}}{n_{2}}+c_{3}^{2}\frac{\sigma _{3}^{2}}{n_{3}}+c_{4}^{2}\frac{\sigma _{4}^{2}}{n_{4}}+c_{5}^{2}\frac{\sigma _{5}^{2}}{n_{5}}\)


If we’re going to assume the group variances are equal (which we already assumed in our ANOVA), we have:


      \(\textrm{var}\left ( \hat{\psi} \right )=\sum \frac{c_{a}^{2}}{n_{a}}\sigma _{x}^{2}\)


Finally, we know MSE is our best estimate of the pooled standard deviation of our groups, so we can substitute:


      \(\textrm{var}\left ( \hat{\psi} \right )=\textrm{MS}_{\textrm{E}}\sum \frac{c_{a}^{2}}{n_{a}}\)


That’s the standard error of our contrast. We can now substitute this into our t-statistic:


      \(t=\frac{\textrm{observed}-\textrm{hypothesized}}{\textrm{standard error}}=\frac{\hat{\psi }-0}{\textrm{SE}_{\psi }}=\frac{\hat{\psi }-0}{\sqrt{\textrm{MS}_{\textrm{E}}\sum \frac{c_{a}^{2}}{n_{a}}}} \sim\sqrt{\left ( a-1 \right )F}\)


  1. Use what was just derived to estimate the p-value of our contrast of interest. What conclusions can we make?












Now let’s try to calculate this Scheffe statistic in R. Remember, we’re comparing the mean of the “no disability” group to the combination of the 4 disability group means:

# Create vector of contrast coefficients
coeff <- c(-1/4,-1/4,-1/4,1,-1/4)

# Store group means as mm
mm <-mean(score ~ disability, data=interviews)

# Store group sample sizes as nn
nn <- tapply(interviews$score, interviews$disability, length)

# Calculate contrast (psi)
psi <- sum(coeff*mm)

# Calculate standard error of contrast
# First, run the ANOVA to get the mean squares
omnianova <- anova(aov(score ~ disability, data=interviews))
# Then, store mean square error as MSE
MSE <- omnianova$`Mean Sq`[2]
# Finally, calculate the SE of our psi
SEpsi <- sqrt(MSE) * sqrt( sum((coeff^2)/nn) )

# Calculate our test statistic (psi / std. error)
PSIoverSE <- psi / SEpsi

# Find critical value
# Choose alpha
alpha = 0.05
# sqrt(a-1) x sqrt(F)
cv <- sqrt(omnianova$Df[1]) * sqrt(qf(1-alpha, omnianova$Df[1], omnianova$Df[2]))
# Paste results
paste("observed test.stat =", round(PSIoverSE,4), ". Critical value =", round(cv,4))
## [1] "observed test.stat = -0.0732 . Critical value = 3.1705"


We can get the same result with another contrast:


coeff <- c(-1,-1,-1,4,-1)
mm <-mean(score ~ disability, data=interviews)
nn <- tapply(interviews$score, interviews$disability, length)
psi <- sum(coeff*mm)
omnianova <- anova(aov(score ~ disability, data=interviews))
MSE <- omnianova$`Mean Sq`[2]
SEpsi <- sqrt(MSE) * sqrt( sum((coeff^2)/nn) )
PSIoverSE <- psi / SEpsi
alpha = 0.05
cv <- sqrt(omnianova$Df[1]) * sqrt(qf(1-alpha, omnianova$Df[1], omnianova$Df[2]))
paste("observed test.stat =", round(PSIoverSE,4), ". Critical value =", round(cv,4))
## [1] "observed test.stat = -0.0732 . Critical value = 3.1705"


Because the observed test statistic is less than the critical value, we’d conclude that we do not have evidence that the mean of the no disability group differs from the mean of the combination of the disability groups.


  1. Write out the contrast you would use to compare the temporary disabilities (none and crutches) to the permanent disabilities (amputee, hearing, and wheelchair).










Let’s use the Scheffe method to compare this linear combination of group means:


coeff <- c(-2,3,-2,3,-2)
mm <-mean(score ~ disability, data=interviews)
nn <- tapply(interviews$score, interviews$disability, length)
psi <- sum(coeff*mm)
omnianova <- anova(aov(score ~ disability, data=interviews))
MSE <- omnianova$`Mean Sq`[2]
SEpsi <- sqrt(MSE) * sqrt( sum((coeff^2)/nn) )
PSIoverSE <- psi / SEpsi
alpha = 0.05
cv <- sqrt(omnianova$Df[1]) * sqrt(qf(1-alpha, omnianova$Df[1], omnianova$Df[2]))
paste("observed test.stat =", round(PSIoverSE,4), ". Critical value =", round(cv,4))
## [1] "observed test.stat = 2.017 . Critical value = 3.1705"


  1. From this, what conclusions can you make?












Your turn

  1. A 2003 study investigated the effect of background music on the amount of money diners spend in restaurants. A British restaurant alternated among 3 types of background music – None (silence), Pop music, and Classical music – over the course of 18 days. Each type of music was played for 6 nights, with the order of the music being randomly determined.

    The data from N = 393 diners have been loaded in a data.frame called music. The variables are MusicType and Amount (the amount of money spent per person).

    (a) The variances in amount spent for each type of music range from 5.269 (for Classical music) to 11.445 (for None). That means the largest variance of a treatment is only a little more than twice as big as the smallest variance. When I conduct a Levene’s Test for equal variances (as displayed below), the p-value is 0.002. Explain why the p-value for this test is so low for this dataset if the sample variances only differ by a factor of 2.

    (b) Conduct an ANOVA to compare the average amount spent for the 3 types of music.

    (c) Use the Bonferroni correction to test all pairs of group means.

    (d) Use Holm’s method to compare all pairs of group means.

    (e) Write out all conclusions you can make about the effect of background music on the amount of money diners spend.

    (f) Use Scheffe’s method to compare no music (None) against the combination of Classical and Pop music.

    (g) Write a conclusion.


Drawing


# Load data
music <- read.csv("http://www.bradthiessen.com/html5/data/music.csv")

# Means plot with standard error bars
ggplot(music, aes(MusicType, Amount)) + 
  stat_summary(fun.y = mean, geom = "point") + 
  stat_summary(fun.y = mean, geom = "line", aes(group = 1),
               colour = "Blue", linetype = "dashed") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  labs(x = "Music Type", y = "Money Spent")

# Levene's test for equal variances
leveneTest(Amount ~ MusicType, data=music)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   2  6.2569 0.002115 **
##       390                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




Publishing your solutions

When you’re ready to try the Your Turn exercises:


Sources:

Study: Cesare, Tannenbaum, Dalessio (1990). Interviewers’ Decisions Related to Applicant Handicap Type and Rater Empathy. Human Performance, 3(3)

This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.