Lab #7: Comparing two groups with NHST


Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab7report.Rmd


Theory-based NHST (t-test)


Doctors and obesity

Let’s begin with the example we went through in class: Do doctors spend less time with obese patients? We’ll attempt to answer this question via null hypothesis significance testing (NHST). We’ll conduct a t-test first; then we’ll use randomization-based methods.

Let’s load the data and take a look at its structure:

doctors <- read.csv(url("http://bradthiessen.com/html5/stats/m300/doctors.csv"))
str(doctors)
## 'data.frame':    71 obs. of  2 variables:
##  $ weight: Factor w/ 2 levels "average","obese": 1 1 1 1 1 1 1 1 1 1 ...
##  $ time  : int  15 15 45 40 45 20 40 30 40 30 ...

We want to test the difference in time between average and obese patients. Let’s visualize the distribution of the time variable by weight:

# Side-by-side dotplots
dotPlot( ~time | weight, data = doctors, 
          width=1,            #Width of each "bar" = 1
          layout = c(1,2),    #Plot 1 column, 2 rows
          xlab = "Minutes spent with patient")

# Side-by-side density plots
densityplot(~doctors$time | doctors$weight, 
            lw=4,          # Increase width of line
            bw=7,          # Increase bandwidth
            layout=c(1,2), # Plot 1 col. and 2 rows
            xlab = "Minutes with patient")

# Boxplots
bwplot(~time | weight, data=doctors, 
   ylab="Patient weight", xlab="Minutes with patient", 
   layout=(c(1,2)),
   par.settings = list(box.rectangle = list(col = "steelblue", lwd=3),
                             box.umbrella=list(lwd=3, col="steelblue")))

Each plot shows the average amount of time doctors report they would spend with a patient is smaller if the patient is obese.

Using the favstats() command will give us summary statistics:

# Summary statistics (time as a function of weight)
favstats(time ~ weight, data=doctors)
##    weight min Q1 median Q3 max     mean       sd  n missing
## 1 average  15 25     30 40  50 31.36364 9.864134 33       0
## 2   obese   5 20     25 30  60 24.73684 9.652571 38       0

The difference in means (obese - average) equals:

24.737 - 31.364 = -6.627

We can get this difference with the diffmean() command:

diffmean(time ~ weight, data=doctors)
##  diffmean 
## -6.626794


Assumptions

We’re going to conduct an independent samples t-test to determine if the mean time spent with obese patients is less than the mean time spent with non-obese patients. Before we do this, let’s investigate a couple of the assumptions (conditions): normality and homogeneity of variances.

To check for normality, we can look at the density plots we constructed earlier. They didn’t look too bad, outside of the outlier in the obese group. We can also check for normality via Q-Q plots:

qqnorm(doctors$time[doctors$weight=="obese"], ylab = "Minutes with patients", main="Obese Patients"); qqline(doctors$time[doctors$weight=="obese"])   # Obese patients

qqnorm(doctors$time[doctors$weight=="average"], ylab = "Minutes with patients", main="Non-obese Patients"); qqline(doctors$time[doctors$weight=="average"])   # Non-obese patients

We could also run a Shapiro-Wilk normality test. This tests the hypothesis that the data follow a normal distribution. Let’s run this test for both the obese and non-obese groups:

shapiro.test(doctors$time[doctors$weight=="obese"])
## 
##  Shapiro-Wilk normality test
## 
## data:  doctors$time[doctors$weight == "obese"]
## W = 0.83843, p-value = 7.024e-05
shapiro.test(doctors$time[doctors$weight=="average"])
## 
##  Shapiro-Wilk normality test
## 
## data:  doctors$time[doctors$weight == "average"]
## W = 0.90958, p-value = 0.009531

The low p-values tell us that there’s a small chance we would have gotten our data if, in fact, they came from normal distributions.

To check for equal variances, we can run the F-max test we discussed in class:

varNonobese <- var(doctors$time[doctors$weight=="average"])
varObese <- var(doctors$time[doctors$weight=="obese"])
Fmax <- varNonobese / varObese
Fmax
## [1] 1.044316

Since this Fmax value is close to 1.00, we have evidence to support our decision to assume equal variances. We could also test equal variances with Bartlett’s test:

bartlett.test(time ~ weight, data=doctors)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  time by weight
## Bartlett's K-squared = 0.015916, df = 1, p-value = 0.8996

The p-value of 0.8996 tells us there’s a good chance of getting this data if, in fact, the groups have equal variances.


t-test

To conduct the t-test, we can use the t.test() command. The arguments include:

t.test(y ~ x, data, alternative = c("two.sided", "less", "greater"), var.equal = FALSE, sig.level = 0.05)

In this scenario…

  • y ~ x = time ~ weight
  • data = doctors
  • alternative = "greater" (we’ll test if non-obese > obese)
  • var.equal = TRUE (we’ll try FALSE later)
  • sig.level = 0.05 (to match the example in the activity)

Let’s run the command:

t.test(time ~ weight, data=doctors,
       var.equal=TRUE, alternative="greater", sig.level=0.05)
## 
##  Two Sample t-test
## 
## data:  time by weight
## t = 2.856, df = 69, p-value = 0.002831
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.758328      Inf
## sample estimates:
## mean in group average   mean in group obese 
##              31.36364              24.73684

The output shows our observed t-statistic (2.856) along with its degrees of freedom (69). The p-value (0.0028) provides evidence that the difference is not zero (which we already knew).

If we want to get the corresponding confidence interval (like we did in class), we can find a two-sided 90% CI by adding $conf.int to the end of our command:

# Notice the conf.level is 0.90
# Also notice that I did not specify the alternative
t.test(time ~ weight, var.equal=TRUE, conf.level=0.90, data=doctors)$conf.int
## [1]  2.758328 10.495260
## attr(,"conf.level")
## [1] 0.9

Since both the lower and upper bounds of our confidence interval are positive, we have reason to believe the difference is positive (and not zero).


Effect size:

In activity #19 (which we may not have yet completed in class), we discuss effect sizes. Specifically, we calculate and interpret Cohen’s d. We can do this manually in R using:

mean1 <- mean(doctors$time[doctors$weight=="average"])
mean2 <- mean(doctors$time[doctors$weight=="obese"])
var1 <- var(doctors$time[doctors$weight=="average"])
var2 <- var(doctors$time[doctors$weight=="obese"])
spooled <- sqrt( (var1 + var2) / 2 )

effectsize <- (mean1 - mean2) / spooled
effectsize
## [1] 0.6790496

That effect size, d = 0.679, indicates we have a medium-size effect.


  1. A 1999 study investigated the relationship between breastfeeding and intelligence. The researchers assessed 322 4-year old children on the McCarthy Scales of Children’s Abilities and calculated an overall measure of intelligence (the General Cognitive Index, GCI) for each child. Of the 322 children, 237 were breastfed.

    The data have been loaded as bfeed with the variables:
    Feeding: whether each child was “Breastfed” or “NotBreastfed”
    GCI: a score of intelligence.

    a) Create a couple visualizations of the data comparing GCI scores for the two groups of children.

    b) Calculate the mean and standard deviation of the GCI scores for each group.

    c) Investigate the normality and equal variance assumptions (using plots or tests).

    d) Conduct an independent samples t-test to compare mean GCI scores for children who were and were not breastfed.

    e) Construct a 90% confidence interval for the mean difference in GCI scores

    f) Calculate an effect size.


Welch-Satterthwaite Method

If we aren’t willing to make an equal-variance assumption, we can use the Welch-Satterthwaite method. In R, we use the same t.test() command but we include the argument var.equal=FALSE:

t.test(time ~ weight, data=doctors, var.equal=FALSE, alternative="greater", sig.level=0.05)
## 
##  Welch Two Sample t-test
## 
## data:  time by weight
## t = 2.8516, df = 67.174, p-value = 0.002887
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.750899      Inf
## sample estimates:
## mean in group average   mean in group obese 
##              31.36364              24.73684

As you can see, the p-value didn’t change much. This is because our group variances were so similar – the assumption of equal variances wasn’t much of a stretch. Notice, though, the degrees of freedom changed from 69 to 67.174.


  1. Using the same bfeed data, run an independent-samples t-test without the equal variance assumption.



Randomization-based methods


Comparing 2 group means

We can also compare time spent with obese and non-obese patients using randomization-based methods. These methods don’t require the conditions of normality or homogeneity of variances.

Our first step is to identify our test statistic of interest. The test statistic should be a single value that measures what we want to know. In this case, we want to know the difference between the means of our two groups.

As we did earlier, we can calculate this test statistic with the diffmean() command:

test.stat <- diffmean(time ~ weight, data = doctors)

Remember that for this dataset, the difference in means is: test.stat = -6.63.

Now that we have our test statistic, let’s randomize our data. Our null hypothesis is that the weight of a patient has no effect on the time doctors would spent with that patient. If this is the case, we can shuffle the weight of our patients with no effect on the time spent with them.

For example, the first doctor in this study was randomly assigned a non-obese “patient.” This doctor reported that he or she would spend 15 minutes with the patient. Now suppose we go back in time, run this study again, and randomly assign the doctor to an obese patient. Under our null hypothesis, this doctor would still report spending 15 minutes with the patient.

With this logic, we can randomly shuffle our data so that we randomly assign non-obese patients to 33 doctors and obese patients to 38 doctors (just like was done in the actual study). Once we shuffle our data like this, we calculate our test statistic.

We repeat this process 10,000 times — randomly shuffling our data and calculating our test statistic. This gives us a nice, large, representative sample of test statistic values if our null hypothesis were true. From this, we can estimate the likelihood of our actual test statistic (-6.63).

Let’s see how this works, using the shuffle() command. Let’s shuffle our data two times and calculate the test statistic in each randomization:

set.seed(3141)  #Set the random number seed
diffmean(time ~ shuffle(weight), data=doctors)
## diffmean 
##  -6.3437

Take a look at the syntax and try to make sense of it. It might help to start at the right and work our way left.

diffmean(time ~ shuffle(weight), data=doctors)

  • Our dataframe is called doctors
  • We’re shuffling the weight of our patients with shuffle(weight)
  • time is a function of the shuffled weight We're calculating the difference in means withdiffmean`

The output shows our test statistic in this first randomization equals -6.34. That’s fairly similar to our actual test statistic (-6.63). Let’s try a second randomization:

set.seed(3142)  #Set the random number seed
diffmean(time ~ shuffle(weight), data=doctors)
## diffmean 
## 1.016746

In this randomization, the test statistic equals +1.02. That means that in this randomization, doctors actually reported spending 1 more minute (on average) with obese patients.

Let’s get the values of our test statistic under 10,000 randomizations:

randomized <- do(10000) * diffmean(time ~ shuffle(weight), data=doctors)

Let’s plot the values of our test statistic for all 10,000 randomizations:

histogram(~ diffmean, data=randomized, groups=diffmean <= test.stat,
          xlim=c(-12,12), width=1,
          xlab="Mean differences assuming null hypothesis is true")
ladd(panel.abline(v=test.stat))

You can already see that our actual test statistic value of -6.63 is an unusual value (if our null hypothesis were true). Let’s see how unlikely our observed test statistic would be:

# We want the proportion of our 10,000 values
# in which the mean difference is less than
# or equal to our actual test.stat of -6.63
prop(~ (diffmean<= test.stat), data=randomized)
##   TRUE 
## 0.0034

That’s our p-value: p = 0.0034. Notice how close it is to the p-value from our t-test (p = 0.0028).


  1. Use randomization-based methods to compare the means of the Breastfed and NonBreastfed groups in the bfeed dataset.

    a) Calculate the test statistic of interest and store it as test.stat

    b) Generate at least 10,000 randomizations of the data.

    c) Plot the distribution of test statistics generated from the randomizations.

    d) Estimate the p-value.


Randomization-based test of medians

We can extend our randomization-based methods to test other test statistics, such as the difference between the group medians. In our actual dataset, the difference in medians is 5 minutes:

median(time ~ weight, data=doctors)
## average   obese 
##      30      25

Let’s randomize our data and get 10,000 values of this test statistic:

# Calculate medians for randomized weight groups
randmedians <- do(10000) * median(time ~ shuffle(weight), data=doctors)

# Calculate difference between group medians for each
# of the 10,000 randomizations
randmedians$diffmedians <- randmedians$average - randmedians$obese

# Plot a histogram and highlight differences > 5 minutes
histogram(~ diffmedians, data=randmedians, groups=diffmedians >= 5,
          xlim=c(-12,12), width=1,
          xlab="Mean differences assuming null hypothesis is true")
ladd(panel.abline(v=5))

# Estimate the p-value
prop(~ (diffmedians >= 5), data=randmedians)
##   TRUE 
## 0.0277

There’s our p-value of p = 0.0277.


  1. Use randomization-based methods to test the difference in medians in the bfeed dataset.



Randomization-based test to compare 3+ group means


In class, we looked at the smiles dataset. This dataset contains leniency scores for each of 4 different types of smiles:

smiles <- read.csv("http://www.bradthiessen.com/html5/data/smiles2.csv")
str(smiles)
## 'data.frame':    136 obs. of  2 variables:
##  $ expression: Factor w/ 4 levels "felt","feltfalse",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ leniency  : num  7 3 6 4.5 3.5 4 3 3 3.5 4.5 ...
favstats(leniency ~ expression, data=smiles)
##   expression min    Q1 median    Q3 max     mean       sd  n missing
## 1       felt 2.5 3.500   4.75 5.875   9 4.911765 1.680866 34       0
## 2  feltfalse 2.5 3.625   5.50 6.500   9 5.367647 1.827023 34       0
## 3  miserable 2.5 4.000   4.75 5.500   8 4.911765 1.453682 34       0
## 4    neutral 2.0 3.000   4.00 4.875   8 4.117647 1.522850 34       0

One test statistic that could be used to measure the difference among these 4 groups is the sum of absolute deviations or SAD.

SAD = |mean1 - mean2| + |mean1 - mean3| + |mean1 - mean4| + |mean2 - mean3| + |mean2 - mean4| + |mean3 - mean4|

With this dataset, SAD is calculated to be 3.75:

test.stat <- SAD(mean(leniency ~ expression, data = smiles))
test.stat
## [1] 3.75

We can randomly shuffle the smile types and calculate SAD for each of 10,000 randomizations:

randomSAD = do(10000) * SAD(mean(leniency ~ shuffle(expression), data = smiles))

From this, we can visualize our randomization-based distribution and estimate a p-value:

# Distribution of SAD values
histogram(~ result, xlim=c(0,6), groups=result >= test.stat, data=randomSAD, width=.2)
ladd(panel.abline(v=test.stat))

# Estimate p-value
prop(~ (result>= test.stat), data=randomSAD)                    
##  TRUE 
## 0.022


  1. Do college graduates earn more money than non-graduates? How do the earnings compare among individuals with no college degree, Bachelor’s degrees, and graduate degrees? To investigate, you’ll take a look at the earn dataframe. It contains the following variables for 250 individuals:
    Degree: The highest level of education achieved by each individual (Some college, Associate degree, Bachelor degree, Master degree, or Doctorate.
    Earnings: income of each individual (in tens of thousands of dollars)

    a) Calculate the SAD to compare mean earnings for the 5 groups in the dataset

    b) Generate at least 10,000 randomizations and calculate the SAD for each

    c) Plot the distribution of those SAD values.

    d) Estimate the p-value



Code for Bayesian Estimation


I won’t run it in this lab, but here’s the code I would use to estimate the difference in group means via Bayesian estimation:

# Get times for obese patients in a vector called "obese"
obese <- doctors$time[doctors$weight == "obese"]
# Get times for non-obese patients in "average"
average <- doctors$time[doctors$weight == "average"]
# Load the BEST package
library(BEST)
# Run the MCMC chains
BESTout <- BESTmcmc(average, obese)
# Get output
BESTout
# Plot posterior distributions
plot(BESTout)
plot(BESTout, "sd")
plotPostPred(BESTout)
pairs(BESTout)



Generating (pubishing) your lab report

When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button: Drawing

Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.

Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.


You’ve done it! Congratulations on finishing your 7th (and final) lab!

Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.

This lab, released under a Creative Commons Attribution-ShareAlike 3.0 Unported license, is a derivative of templates originally developed by Mark Hansen (UCLA Statistics) and adapted by Andrew Bray (Mt. Holyoke) & Mine Çetinkaya-Rundel (Duke).