#### Loading the Mosaic Package

Let’s load the Mosaic package:

require(mosaic)

#### Examples from Activity #18

##### 1) Let’s assume ACT scores follow a normal distribution μ = 21 and σ = 6. If we select someone at random who has taken the ACT, what’s the probability of choosing someone who scored between 9.24 and 32.76?
# Calculate the probability
xpnorm(32.76, mean = 21, sd = 6, verbose=FALSE, plot=FALSE) - xpnorm(9.24, mean = 21, sd = 6, verbose=FALSE, plot=FALSE)
## [1] 0.95
## Graph the shaded normal distribution
# First, plot the distribution (plotdist didn't work, but I don't know why)
curve(dnorm(x,21,6), xlim=c(5, 37), col="steelblue", lwd=3)
# Then set coordinates for a shaded polygon
cord.x <- c(9.24,seq(9.24,32.76,0.01),32.76)
cord.y <- c(0,dnorm(seq(9.24, 32.76, 0.01), mean=21, sd=6),0)
# Graph the shaded region
polygon(cord.x,cord.y,col='skyblue')

From this, we see the probability is almost exactly 0.95.

##### 2) Suppose I randomly select someone and record that person’s ACT score (x). I then create an interval around that score by adding and subtracting 11.76 points. What’s the probability that the interval I create captures μ?

Let’s try this. Let’s randomly sample one person (and repeat this process 10,000 times). We’ll add and subtract 11.76 and see how many capture the population mean of 21.

## Sample 1 score from this normal distribution 10,000 times
# Create ACT score distribution
ACTscores<-rnorm(n=100000, mean=21, sd=6)
# Sample one score 10,000 times
sampleACT<- do(10000) * sample(ACTscores, 1)
## Loading required package: parallel
# Subtract and add 11.76 from each observation sampled
sampleACT$lower <- sampleACT$result-11.76
sampleACT$upper <- sampleACT$result+11.76
# Determine if the interval captures mu
sampleACT$captured <- sampleACT$lower<21 & sampleACT$upper>21 Let’s take a look at what we’ve generated: head(sampleACT) ## result lower upper captured ## 1 13.81 2.053 25.57 TRUE ## 2 26.09 14.333 37.85 TRUE ## 3 15.04 3.280 26.80 TRUE ## 4 23.60 11.845 35.36 TRUE ## 5 15.88 4.120 27.64 TRUE ## 6 16.94 5.183 28.70 TRUE The first column shows the ACT scores we sampled (obviously, the scores aren’t real ACT scores). The next two columns show the scores +/- 11.76. The last column shows TRUE if 21 (the population mean) is between the lower and upper bounds. From this, we can calculate the proportion of ACT score intervals (score +/- 11.76) that capture mu (21): prop(~captured, data=sampleACT) ## TRUE ## 0.9539 We replicated our answer of 95%. ##### 5) I am going to take a sample of n=16 individuals and calculate a sample average. From this, I will create an interval by adding and subtracting 2.94 points (1.96 standard errors) from that sample average. Before I actually calculate this interval, what’s the probability that this interval will capture μ? First, let’s calculate the (theoretical) probability: ## Graph the shaded normal distribution # First, plot the distribution (plotdist didn't work, but I don't know why) curve(dnorm(x,21,1.5), xlim=c(15, 26), col="steelblue", lwd=3) # Then set coordinates for a shaded polygon cord.x <- c((21-2.94),seq((21-2.94),(21+2.94),0.01),(21+2.94)) cord.y <- c(0,dnorm(seq((21-2.94), (21+2.94), 0.01), mean=21, sd=1.5),0) # Graph the shaded region polygon(cord.x,cord.y,col='skyblue') # Calculate the probability xpnorm((21+2.94), mean = 21, sd = 1.5, verbose=FALSE, plot=FALSE) - xpnorm((21-2.94), mean = 21, sd = 1.5, verbose=FALSE, plot=FALSE) ## [1] 0.95 Once again, we calculate 95%. In any normal curve, 95% of the observations are within 1.96 standard deviations of the mean. Now, let’s get simulated results. First, let’s sample 16 ACT scores 10,000 times and calculate an average for each sample. ## Sample 16 score from this normal distribution 10,000 times # Sample 16 scores, calculate an average, and repeat 10,000 times sampleACT16<- do(10000) * mean(sample(ACTscores, 16)) # Subtract and add 2.94 from each observation sampled sampleACT16$lower <- sampleACT16$result-2.94 sampleACT16$upper <- sampleACT16$result+2.94 # Determine if the interval captures mu sampleACT16$captured <- sampleACT16$lower<21 & sampleACT16$upper>21

Once again, the last column will tell us whether each sample captured the true population mean or not:

head(sampleACT16)
##   result lower upper captured
## 1  22.51 19.57 25.45     TRUE
## 2  18.83 15.89 21.77     TRUE
## 3  22.52 19.58 25.46     TRUE
## 4  22.19 19.25 25.13     TRUE
## 5  22.20 19.26 25.14     TRUE
## 6  22.22 19.28 25.16     TRUE

Now we can see what proportion lie within 21 +/- 2.94

Let’s take a look at what we’ve generated:

prop(~captured, data=sampleACT16)
##   TRUE
## 0.9497

Yep, it’s 95%.

We know 95% of observations fall within +/-1.96 standard deviations of the mean, so there’s a 95% chance this score comes within $29,400 (= 1.96 * 15000) of the mean. Therefore, if we add and subtract 29,400 from this observation, we’ll have a 95% chance of capturing mu. So the confidence interval, assuming normality, would be: 40000 +/- 29400 = ($10,600 , $69,400). We can use R to speed up these calculations. (Note: There’s a much easier way to do this when we do NOT know the population standard deviation). # Enter values for the observed sample mean, population std. dev., sample size, and confidence level observedMean = 40000 sigma=15000 n=1 confidence = 0.95 # This calculates the standard error alpha = (1-confidence)/2 stdError <- sigma/sqrt(n) # This calculates the confidence interval limits: lower<- observedMean - qnorm(1-alpha)*(stdError) upper<- observedMean + qnorm(1-alpha)*(stdError) paste("CI: (", lower, " , ", upper, ")") ## [1] "CI: ( 10600.5402318992 , 69399.4597681008 )" So our 95% confidence interval (based on 1 observation) is approximately ($10,600, $69400). That matches what we know it should be (assuming student debt load is normally distributed). Let’s calculate a confidence interval for question B: ##### Suppose we sample 100 students who graduates with an average debt of$25,000. Create a 95% confidence interval for μ.

Let’s use the code we used in the previous example:

# Enter values for the observed sample mean, population std. dev., sample size, and confidence level
observedMean = 25000
sigma=15000
n=100
confidence = 0.95
# This calculates the standard error
alpha = (1-confidence)/2
stdError <- sigma/sqrt(n)
# This calculates the confidence interval limits:
lower<- observedMean - qnorm(1-alpha)*(stdError)
upper<- observedMean + qnorm(1-alpha)*(stdError)
paste("CI: (", lower, " , ", upper, ")")
## [1] "CI: ( 22060.0540231899  ,  27939.9459768101 )"

Our CI is now ($22060,$27940).

Finally, let’s answer question C:

##### Suppose we sample 10,000 students who graduates with an average debt of $21,000. Create a 95% confidence interval for μ. # Enter values for the observed sample mean, population std. dev., sample size, and confidence level observedMean = 21000 sigma=15000 n=10000 confidence = 0.95 # This calculates the standard error alpha = (1-confidence)/2 stdError <- sigma/sqrt(n) # This calculates the confidence interval limits: lower<- observedMean - qnorm(1-alpha)*(stdError) upper<- observedMean + qnorm(1-alpha)*(stdError) paste("CI: (", lower, " , ", upper, ")") ## [1] "CI: ( 20706.005402319 , 21293.994597681 )" With such a large sample size, our confidence interval is fairly narrow: ($20,700 , \$21,294). In fact, the width of that interval is:

upper-lower
## [1] 588

If we wanted to be more confident, we’d have to expand that interval. As an example, here’s a 99% confidence interval for mu:

# Enter values for the observed sample mean, population std. dev., sample size, and confidence level
observedMean = 21000
sigma=15000
n=10000
confidence = 0.99
# This calculates the standard error
alpha = (1-confidence)/2
stdError <- sigma/sqrt(n)
# This calculates the confidence interval limits:
lower<- observedMean - qnorm(1-alpha)*(stdError)
upper<- observedMean + qnorm(1-alpha)*(stdError)
paste("CI: (", lower, " , ", upper, ")")
## [1] "CI: ( 20613.6256044677  ,  21386.3743955323 )"

The width of this interval is:

upper-lower
## [1] 772.7

##### 16) Construct a 95% confidence interval for the average number of hours slept by St. Ambrose students last night.

So far, we’ve assumed that we knew the population standard deviation in all our examples. As I explained in class, that’s not a very reasonable assumption in many cases. If we don’t know the population standard deviation, we can substitute the sample standard deviation if we use a t-distribution to construct our confidence intervals.

Let’s look at 3 examples of t-distributions.

plotDist("t", params=list(df=2000), col="lightblue", lwd=4)
plotDist("t", params=list(df=10), col="blue", lwd=4, add=TRUE)
plotDist("t", params=list(df=2), col="darkblue", lwd=4, add=TRUE)
plotDist("norm", col="red", lwd=1, add=TRUE)

The dark blue plot is a t-distribution with 2 degrees of freedom. The medium blue plot has 10 degrees of freedom. The light blue plot (which you can’t really see) has 2000 degrees of freedom and is almost exactly the same as the red plot (which is the standard normal distribution).

As you can see, as degrees of freedom increase, the t-distribution better approximates a normal distribution.

It’s May 30, 2014, and I don’t have any students right now, so I can’t get a real dataset of sleep times. I’ll make up some data for 20 hypothetical students:

sleep <-c(6, 6.5, 6.5, 7, 7, 4.5, 11, 8, 8, 6, 5, 9, 8, 7, 7, 8, 6, 7, 8, 10)
favstats(sleep)
##  min    Q1 median Q3 max  mean   sd  n missing
##  4.5 6.375      7  8  11 7.275 1.56 20       0

As you can see from our sample, * The sample size is 20 * The sample mean is 7.275 * The sample standard deviation is 1.56

Let’s let R construct a 90% confidence interval:

confint(t.test(~sleep, conf.level=0.90), data=sleep)
## mean of x     lower     upper     level
##     7.275     6.672     7.878     0.900

The interval is (6.6718, 7.8782). One question statistics teachers love to ask is: Does this mean 90% of students in this class slept between 6.6718 and 7.8782 hours?

To answer that question, let’s have the computer tell us what proportion of our 20 students slept between 6.6718 and 7.8782 hours:

prop(~(sleep >= 6.6718 & sleep <= 7.8782))
## TRUE
## 0.25

Only 25% of our observations fell in our confidence interval. That’s because our confidence interval deals with the population mean hours slept; not our individual sample data.

To get a 95% confidence interval, we’d use:

confint(t.test(~sleep, conf.level=0.95), data=sleep)
## mean of x     lower     upper     level
##     7.275     6.545     8.005     0.950

##### 17) We know the population average IQ score is 100. Suppose you find n=25 individuals who have reported seeing a UFO. You administer an IQ test to these individuals and find their average IQ is 101.5 with a standard deviation of 10. Construct a 90% confidence interval for the average IQ of all UFO observers.

Let’s quickly construct the 90% confidence interval. I’ll make up a dataset that has a mean of 101.5 and a standard deviation of 10:

# Simulate the dataset
UFO <- c(85.84339, 84.1224, 84.93813, 87.95693, 87.97534, 93.77714, 96.53963, 97.49079, 97.83092, 99.51939, 99.93461, 99.99142, 100.7256, 102.1713, 105.6651, 105.9388, 107.4627, 108.4752, 109.5886, 109.6127, 113.0563, 113.7874, 113.7894, 115.4635, 115.8434)
favstats(UFO)
##    min    Q1 median    Q3   max  mean sd  n missing
##  84.12 96.54  100.7 109.6 115.8 101.5 10 25       0
histogram(~UFO, width=4)

So you can see the dataset does have the appropriate mean and standard deviation. Now let’s construct a 90% confidence interval:

confint(t.test(~UFO, conf.level=0.90), data=UFO)
## mean of x     lower     upper     level
##    101.50     98.08    104.92      0.90

Our interval is (98.078, 104.922). Since the average population IQ is 100, we have no evidence that UFO observers have a higher IQ than average.

#### Bootstrap confidence intervals

So far, our confidence intervals have been constructed with some assumptions. We’ve either assumed the population distribution is approximately normal or our sample size is large enough for the CLT to kick-in.

What do we do if we’re unsure of those assumptions?

Another way to construct a confidence interval is by using bootstrap methods. Before we begin, let’s look at a dataset from Assignment 18 (question #5):

crashes <- read.csv(url("http://bradthiessen.com/html5/stats/m300/failures.csv"))

This dataset contains 136 times between successive computer crashes. Let’s take a look at the data:

histogram(~v1, data=crashes, width=250)

favstats(~v1, data=crashes)
##  min   Q1 median    Q3  max  mean   sd   n missing
##    0 80.5    272 831.8 6150 649.1 1036 136       0

As you can see, the data are heavily skewed to the right (the mean is much greater than the median). Suppose we wanted to construct a 95% confidence interval for the average time between crashes.

We could assume the sample is large enough for the CLT to kick-in, but I’m worried about how much that data is skewed.

For now, let’s go ahead and make that assumption.

confint(t.test(~crashes, conf.level=0.90), data=crashes)
## mean of x     lower     upper     level
##     649.1     502.0     796.3       0.9

Our confidence interval is (502, 796) and has a width of 294.

Rather than making the normality assumption, we could construct a bootstrap confidence interval. To do so, we do the following:

• We treat our sample of 136 observations as if it they are the entire population of interest. We randomly sample n=136 observations WITH REPLACEMENT from our population of size n=136. Since we’re sampling with replacement, the same observation could be chosen more than once in our bootstrap sample.

• Let’s repeat this process 10,000 times, so we’ll end up with 10,000 random bootstrap samples of size n=136

• For each bootstrap sample, let’s calculate the sample mean

• Now that we have 10,000 means, let’s find the means that represent the 5th and 95th percentiles. These percentiles would represent our 90% confidence interval bounds.

Let’s walk through this process by first taking a look at sampling with replacement. Suppose we have the numbers 1, 2, 3, 4, 5. Let’s sample 5 of these numbers with replacement:

simple = c(1, 2, 3, 4, 5)
resample(simple)
## [1] 2 3 4 5 4

As you can (probably) see, some numbers were chosen multiple times. This is because we replaced the numbers after we chose each one.

So let’s take 10,000 bootstrap samples from our dataset. For each sample, we’ll calculate the mean:

meanCrashes <- do(10000) * mean(~v1, data = resample(crashes))

Let’s see what that distribution looks like:

densityplot(~result, data=meanCrashes, plot.points = FALSE, col="darkblue", lwd=4)
plotDist("norm", mean(meanCrashes), sd(meanCrashes), col="red", add=TRUE)
plotDist("t", mean(meanCrashes), sd(meanCrashes), col="red", add=TRUE)

That dark blue distribution is the distribution of our bootstrap means. The red distribution is a normal curve. They’re close enough that we probably could have made our normality assumption.

Let’s get the confidence interval from our bootstrap samples:

confint(meanCrashes, level = 0.9, method = "quantile")
##     name lower upper level   method
## 1 result 512.3 799.1   0.9 quantile

Bootstrap Confidence Interval (511, 796) with a width of 285 and centered at 653.5

** Confidence Interval with Normality Assumption** (502, 796) with a width of 294 and centered at 649

#### Bayesian Estimation

One last estimate of a 90% confidence interval for our crashes dataset. I won’t explain this method here (I will in either STAT 301 or STAT 305) or even show the output.

# Load the Bayesian package
require(BEST)
# Extract the data to a vector
crashvector <- crashes[,1]
# Run an analysis - it took my laptop a little more than 2 minutes
BESTout1 <- BESTmcmc(crashvector)
# Get the output
BESTout1
plot(BESTout1)