Lab #6: Bootstrap and Theory-based Confidence Intervals

Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab6report.Rmd

Theory-based confidence intervals in R

IQ scores

IQ scores are normally distributed with a mean of 100 and a standard deviation of 16.

plotDist("norm", params=list(mean=100, sd=16), col="steelblue", lw=5)

If we wanted to calculate probabilities under this normal distribution, we could use the xpnorm() command:

To get P(X < 72), we use xpnorm(72, mean = 100, sd = 16)

To calculate P(72 < X < 110), we use xpnorm(c(72,110), mean = 100, sd = 16)

xpnorm(72, mean = 100, sd = 16)        # P(X < 72)

## 
## If X ~ N(100,16), then 
## 
##  P(X <= 72) = P(Z <= -1.75) = 0.0401
##  P(X >  72) = P(Z >  -1.75) = 0.9599

## [1] 0.04005916

xpnorm(c(72,110), mean = 100, sd = 16) # P(72 < X < 110)

## 
## If X ~ N(100,16), then 
## 
##  P(X <= 72) = P(Z <= -1.75) = 0.0401
##      P(X <= 110) = P(Z <= 0.625) = 0.734
##  P(X >  72) = P(Z >  -1.75) = 0.9599
##      P(X >  110) = P(Z >  0.625) = 0.266

## [1] 0.04005916 0.73401447

For now, forget that you know anything about the distribution of IQ scores. Suppose you don’t know the shape, mean, or standard deviation of the distribution of IQ scores.

If you’re interested in this unknown population distribution, you might take a random sample of adults and measure their IQ scores. Let’s take a random sample of n = 16 adults from the “unknown” population. We’ll use set.seed() to ensure we get the same sample each time we run this lab:

n <- 16         # Set size of our sample
set.seed(3141)  # Set random number seed

# Take sample of size n from our population distribution
samp <- rnorm(n, mean = 100, sd = 16)

We can take a quick look at our sample data by plotting a dotplot:

# cex sets the size of the dots
dotPlot(samp, cex=.25, xlim=c(30,170))

With such a small sample size, this dotplot doesn’t really tell us what shape to expect for the population distribution. Let’s take a look at a Q-Q plot:

qqnorm(samp, ylab = "n = 16"); qqline(samp, ylab = "n = 16")

Based on this Q-Q plot, suppose we’re willing to believe the population distribution is normal (which it is in this case).

Without any explanation, let me show you the syntax to construct a 95% confidence interval in R. This confidence interval is theory-based – it uses the formula we derived in class:

(sample mean) +/- (t * SE) where SE = sd/sqrt(n)

confint(t.test(samp, conf.level = 0.95))

## mean of x     lower     upper     level 
## 104.11760  95.39791 112.83728   0.95000

We told R to get a confidence interval (confint) based on the t-distribution (t.test). Our sample data was called samp and we wanted a 95% interval (conf.level=0.95).

As you can see from the output, our confidence interval is: (95.398, 112.837).

Let’s verify this result with our formula. To use our formula, we need to know the following:

sample mean
sample standard deviation
sample size
t-score for a 95% CI with df = n-1
standard error

We already have our sample of data stored in samp (and we already told R to store the sample size as n), so let’s get the other components for our formula:

sampleMean <- mean(samp)
sampleSD <- sd(samp)
t = qt(p = 0.975, df = n-1)
SE <- sampleSD / sqrt(n)

The syntax for calculating the sampleMean, sampleSD, and SE should be easy enough to understand. To get the t-score, I needed to tell R the percentile to cut-off and the degrees of freedom.

We learned in class that for a CI for a single mean, degrees of freedom is equal to n-1. That means in this case, we have 15 degrees of freedom.

Since I want a 95% interval, I need to cut-off 2.5% in each tail of the distribution. That means I want the t-score for the 0.025 and 0.975 percentiles. Both p=0.025 and p=0.975 would give me the same t-score – one would be negative and one would be positive. I chose the positive one, as displayed in the following plot:

xqt(0.975, df=n-1)

## [1] 2.13145

Let’s summarize the values of all the components we need in our formula:

sample mean = sampleMean = 104.12
sample std. dev = sampleSD = 16.3639
sample size = n = 16
t-score (95% CI, df=n-1) = qt(p = 0.975, df = n-1) = 2.131
standard error = sampleSD / sqrt(n) = 4.091

From these components, we can calculate:

(sample mean) +/- (t * SE) where SE = sd/sqrt(n)

Let’s calculate the lower and upper bounds of our CI with this formula:

lower <- sampleMean - (t * SE)
upper <- sampleMean + (t * SE)

# This will format the output in a single line
CI <- paste("95% CI: (", round(lower,5), ",", round(upper,5), ")")
CI

## [1] "95% CI: ( 95.39791 , 112.83728 )"

This confidence interval matches the output from the confint(t.test(samp, conf.level = 0.95)) command.

Notice that the true population mean of 100 (that we pretended we didn’t know) does fall inside our confidence interval. If we were to repeat this process many times (constructing confidence intervals for random samples of size n=16), we’d find that approximately 95% of those intervals would capture our population mean.

To see this, we can conduct a simulation. Let’s simulate the construction of 100 95% confidence intervals from samples of size n=16:

set.seed(3140)   # Set the random number seed
CIsim(n=16, samples=100, rdist=rnorm, args = list(mean=100, sd=16), estimand=100)

## Did the interval cover?
##   No  Yes 
## 0.05 0.95

Don’t worry about the syntax for that command. Instead, take a look at that plot. The horizontal line across the middle of the plot represents the population mean of 100. Each vertical line represents a confidence interval calculated from a random sample of 16 observations.

As you can see, 5 of the 100 intervals constructed do not capture the population mean. That means 95% of our intervals do capture the population mean.

The 1974 Motor Trend Car Road Test dataset has been loaded into R. The variable mtcars$mpg contains fuel efficiency data (measured in miles per gallon) for a sample of 32 cars. Use the confint(t.test()) command to construct a 95% confidence interval for the population mean fuel efficiency for cars in 1974.
This time, use the confint(t.test()) command to construct a 90% confidence interval. Explain why this interval is wider or more narrow than the 95% confidence interval.

Let’s see what would happen if we started with a population that does not follow a normal distribution. Suppose the population follows an exponential distribution that looks like this:

plotDist("exp", col="steelblue", lw=5)

Let’s simulate 100 95% confidence intervals for the mean and see how many capture the population mean:

set.seed(3141)
CIsim(n=16, samples=100, rdist=rexp, estimand=1)

## Did the interval cover?
##   No  Yes 
## 0.11 0.89

When our population follows an exponential distribution, only 89% of our 100 simulated confidence intervals captured the population mean. Explain why.
Suppose we change our sample size to n=1000. If we repeatedly calculate 95% confidence intervals from a population that follows an exponential distribution, do you expect approximately 95% of them to capture the population mean? Explain. Then, run this simulation with the code that’s provided.

Bootstrap confidence intervals in R

IQ scores

This time, let’s use bootstrap methods to construct a 95% confidence interval for IQ scores. Remember, we’re pretending we don’t know IQ scores are normally distributed with a mean of 100 and a standard deviation of 16.

Instead, we only know about our sample of 16 observations that had sampleMean = 104.12 and sampleSD = 16.3639.

To take a bootstrap sample from our data, we select 16 observations with replacement from our sample. That means some observations may be chosen multiple times, while others aren’t chosen at all. We’ll use the resample() command to generate a bootstrap sample.

To verify this command works, lets look at all 16 observations in our sample. We’ll sort them from lowest-to-highest:

sort(samp)

##  [1]  72.25090  83.59202  86.72965  91.13124  92.89035  96.76862 100.34194
##  [8] 103.75708 106.23904 111.04474 111.75398 112.04855 116.36972 124.37817
## [15] 125.92229 130.66328

You can see all 16 IQ scores, from the lowest (72.25090) to the highest (130.66328).

Now, let’s take a single bootstrap sample, sort them from lowest-to-highest, and display the results:

set.seed(3141)
sort(resample(samp))

##  [1]  83.59202  83.59202  86.72965  91.13124  92.89035  92.89035  92.89035
##  [8]  92.89035 103.75708 103.75708 111.04474 111.04474 116.36972 124.37817
## [15] 125.92229 125.92229

From that output, you can see the IQ score of 72.25090 was not chosen in this bootstrap sample. The next value (83.59202) was chosen twice. Looking across this bootstrap sample, you can see the IQ score 92.89035 was chosen 4 times!

We could easily calculate the mean of this bootstrap sample with the command: mean(resample(samp)).

To construct a confidence interval, we’ll want to calculate the means of many bootstrap samples. Let’s generate 10,000 bootstrap samples and calculate the mean of each one. We’ll store those means in a dataframe named bootsamples:

set.seed(3141)
bootsamples <- do(10000) * mean(resample(samp))
head(bootsamples)

##     result
## 1 103.1490
## 2 107.9057
## 3 106.3548
## 4 104.3306
## 5 102.5325
## 6 100.8986

The output shows the means of the first 6 bootstrap samples (ranging from 100.8986 to 107.9057). To visualize all 10,000 means of our bootstrap samples, we can generate a histogram (or, in this case, a density plot):

densityplot(~result, data=bootsamples, plot.points = FALSE, col="steelblue", lwd=4)

To get a 95% confidence interval, we simply need to find the 2.5 and 97.5 percentiles of this distribution. In other words, we need to find the bootstrap sample means that cut-off 2.5% in the tails. To do this, we can use the confint() command:

confint(bootsamples, level = 0.95, method = "quantile")

##     name    lower    upper level   method
## 1 result 96.28272 111.8013  0.95 quantile

Our bootstrap methods yielded a 95% confidence interval of:

(96.283, 111.801)

The theory-based methods we used earlier resulted in this 95% CI:

(95.398, 112.837)

They’re similar to one another because our sample came from a population with a normal distribution.

Use bootstrap methods to construct a 95% confidence interval for the fuel efficiency of 1974 cars (the mtcars$mpg variable). Then, calculate the center and width of the theory-based and bootstrap confidence intervals.
Recall from the previous lab that the average income of employed Iowa adults in 2013 followed a positively-skewed distribution with a mean of $36,052. I’ve loaded a random sample of 30 observations and stored it in a vector named incomes. Using this sample data, construct a 95% confidence interval for the population mean using both theory-based and bootstrap methods. Briefly explain which method (theory-based or bootstrap) you think is more appropriate in this scenario.

Bootstrap CI for other population parameters

One advantage of the bootstrap method is that we can extend it to get confidence intervals for other parameters. For example, to construct a 95% confidence interval for the population median IQ, we could use:

bootmedians <- do(10000) * median(resample(samp))
densityplot(~result, data=bootmedians, plot.points = FALSE, col="steelblue", lwd=4)

confint(bootmedians, level = 0.95, method = "quantile")

##     name    lower    upper level   method
## 1 result 92.89035 114.0619  0.95 quantile

This interval does, in fact, capture our population median IQ of 100.

Let’s try to construct a confidence interval for the population standard deviation (which we know is equal to 16):

bootsd <- do(10000) * sd(resample(samp))
densityplot(~result, data=bootsd, plot.points = FALSE, col="steelblue", lwd=4)

confint(bootsd, level = 0.95, method = "quantile")

##     name    lower    upper level   method
## 1 result 11.05526 20.08566  0.95 quantile

Using the sample data in the incomes vector, construct 95% confidence intervals for the population median and standard deviation.

Scenario #3: Sleep time

Because it makes an important point about interpreting confidence intervals, I wanted to replicate the simple data collection example we did in class. Let’s construct a 90% confidence interval for the average number of hours slept by St. Ambrose students last night.

Here are the hours slept by a random sample of 20 (fictitious) St. Ambrose students:

sleep <-c(6, 6.5, 6.5, 7, 7, 4.5, 11, 8, 8, 6, 5, 9, 8, 7, 7, 8, 6, 7, 8, 10)
dotPlot(sleep, cex=.4, xlim=c(4,12))

qqnorm(samp, ylab = "n = 20"); qqline(samp, ylab = "n = 20")

From these 20 observations, we could calculate:

mean(sleep) = 7.28
sd(sleep) = 1.56

We could get a theory-based confidence interval with the conf.int command:

confint(t.test(sleep, conf.level = 0.90))

## mean of x     lower     upper     level 
##  7.275000  6.671838  7.878162  0.900000

We could also use bootstrap methods to get the CI:

bootsleep <- do(10000) * mean(resample(sleep))
confint(bootsleep, level = 0.90, method = "quantile")

##     name lower upper level   method
## 1 result 6.725 7.875   0.9 quantile

Rounded to a single decimal place, both methods yielded the same confidence interval:

90% CI for mean number of hours slept last night: (6.7, 7.9)

To get at a common misconception students have about confidence intervals, let me ask a question:

What proportion of our 20 (fictitious) students slept between 6.7 and 7.9 hours?

Because 6.7 and 7.9 are the bounds of our 90% confidence interval, many students think 90% of our data is between those two values. Let’s check to see if this is correct. Let’s find the proportion of our 20 observations that are between 6.7 and 7.9:

prop(~(sleep >= 6.7 & sleep <= 7.9))

## TRUE 
## 0.25

Only 25% of our data was in our confidence interval! Why is this the case? Remember, confidence intervals aren’t statements of confidence about individual observations; rather, confidence intervals try to capture population means

Find the proportion of incomes (out of the 30 in your dataset) that fall within your confidence interval.

Confidence Interval for a Proportion

In class, we derived a formula for constructing confidence intervals for proportions. We then learned our formula didn’t really yield good confidence intervals, so we briefly mentioned alternatives like the Clopper-Pearson and Wilson +4 methods. Let’s see how to construct these CIs in R.

First, we need a dataset…

In 2012, WIN-Gallup International published results from a survey that asked, among other things, the question: Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?

We can load the data from this survey with the command:

load(url("http://www.openintro.org/stat/data/atheism.RData"))

The atheism dataframe includes responses from a total of 88,032 individuals across 57 countries in the years 2005 and 2012. Let’s take a look at a random sample of this data:

sample(atheism, 10, orig.ids=FALSE)

##              nationality    response year
## 5374             Belgium non-atheist 2012
## 39494               Peru non-atheist 2012
## 68075      United States non-atheist 2005
## 18686            Germany non-atheist 2012
## 79044           Pakistan non-atheist 2005
## 65057 Russian Federation non-atheist 2005
## 8806            Bulgaria non-atheist 2012
## 62124              Spain non-atheist 2005
## 50163      United States non-atheist 2012
## 67141            Moldova non-atheist 2005

This dataset is big, so let’s look at a subset of the data. Let’s look at the results from the U.S. in the year 2012. To get this subset of data, let’s use the subset() command:

us12 <- subset(atheism, nationality == "United States" & year == "2012")

A table will summarize the responses to the question about religious beliefs:

table(us12$response)

## 
##     atheist non-atheist 
##          50         952

Out of 1002 total respondents, 50 (or 4.99%) identified themselves as atheists. Since this is a random sample of American adults, let’s construct a 95% confidence interval for the proportion of all American adults.

With the mosaic package that we’ve been using in our labs, the command to construct a confidence interval for a proportion is confint(prop.test(x, n, conf.level)).

In this scenario, x = 50 = number of atheists and n = 1002 = total number of respondents.

confint(prop.test(50, 1002, conf.level=0.95))

##          p      lower      upper      level 
## 0.04990020 0.03761982 0.06574456 0.95000000

This confidence interval includes a continuity correction (similar to the Wilson +4 intervals we saw in class). If we wanted to use our (not-as-good) formula without the correction, we can tell R:

confint(prop.test(50, 1002, conf.level=0.95, correct=FALSE))

##          p      lower      upper      level 
## 0.04990020 0.03805375 0.06518465 0.95000000

With such a large sample size, the correction didn’t matter too much.

If we wanted to get a Clopper-Pearson interval, we need to tell R to run a binomial test (binom.test) on our dataframe. We also need to tell R that we consider atheist to be a “win.”

binom.test(us12$response == "atheist")$conf.int

## [1] 0.03726042 0.06526103
## attr(,"conf.level")
## [1] 0.95
## attr(,"method")
## [1] "Score"

Pick another country and year (2005 or 2012) and construct a 95% confidence interval for the proportion of atheists in the population.

Generating (pubishing) your lab report

When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button: Drawing

Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.

Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.

You’ve done it! Congratulations on finishing your 6th lab!

Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.

This lab, released under a Creative Commons Attribution-ShareAlike 3.0 Unported license, is a derivative of templates originally developed by Mark Hansen (UCLA Statistics) and adapted by Andrew Bray (Mt. Holyoke) & Mine Çetinkaya-Rundel (Duke).