Loading the Mosaic Package

Let’s load the Mosaic package:

require(mosaic)



Examples from Activity #16

1) Describe the distribution of sample means I would get by repeatedly sampling 2 babies. How would this compare to the distribution I would get by repeatedly sampling 100 babies?

I put the baby weight dataset on my website, so let’s download that data:

babyweights <- read.csv(url("http://bradthiessen.com/html5/stats/m300/babyweights.csv"))
head(babyweights)
##   mother birthorder birthwt age child
## 1     80          1    3175  18     1
## 2     80          2    3572  21     2
## 3     80          3    3317  24     3
## 4     80          4    4281  26     4
## 5     80          5    3827  28     5
## 6     84          1    2892  14     6

You can see the dataset contains:
mother = an ID number for each mother
birthorder = child number for each mother (1 = first born child)
birthwt = weight of baby, in grams
ch = age of mother when baby was born
child = an ID number for each child

Let’s plot the birthweights:

histogram(~birthwt, data = babyweights, col="grey", xlim=c(-1, 6001), xlab = "Baby weights, in grams", width=130)

plot of chunk unnamed-chunk-4

Now let’s get sampling distributions for the mean birthweight using sample sizes of 2, 16, and 100 (to match the activity):

# Create the estimated sampling distributions (10,000 samples) for each sample size
meanweight2<- do(10000) * mean(sample(babyweights$birthwt, 2))
## Loading required package: parallel
meanweight16<- do(10000) * mean(sample(babyweights$birthwt, 16))
meanweight100<- do(10000) * mean(sample(babyweights$birthwt, 100))

# Create histograms to display sampling distributions
histogram(~result, data = meanweight2, col="grey", xlim=c(1000, 5000), xlab = "n=2", width=100)

plot of chunk unnamed-chunk-5

plotDist("norm", params=list(mean=mean(babyweights$birthwt), sd=(sd(babyweights$birthwt)/(2)^.5)), col="red", lwd=1, add=TRUE)

plot of chunk unnamed-chunk-5

histogram(~result, data = meanweight16, col="grey", xlim=c(1000, 5000), xlab = "n=16", width=25)

plot of chunk unnamed-chunk-5

plotDist("norm", params=list(mean=mean(babyweights$birthwt), sd=(sd(babyweights$birthwt)/(16)^.5)), col="red", lwd=1, add=TRUE)

plot of chunk unnamed-chunk-5

histogram(~result, data = meanweight100, col="grey", xlim=c(1000, 5000), xlab = "n=100", width=5)

plot of chunk unnamed-chunk-5

plotDist("norm", params=list(mean=mean(babyweights$birthwt), sd=(sd(babyweights$birthwt)/(100)^.5)), col="red", lwd=1, add=TRUE)

plot of chunk unnamed-chunk-5

# Change axis
histogram(~result, data = meanweight100, col="grey", xlim=c(2900, 3400), xlab = "n=100", width=5)

plot of chunk unnamed-chunk-5

plotDist("norm", params=list(mean=mean(babyweights$birthwt), sd=(sd(babyweights$birthwt)/(100)^.5)), col="red", lwd=1, add=TRUE)

plot of chunk unnamed-chunk-5

# Q-Q plot to investigate normality
qqnorm(meanweight2$result, ylab = "Results"); qqline(meanweight2$result, ylab = "Results")

plot of chunk unnamed-chunk-5

qqnorm(meanweight16$result, ylab = "Results"); qqline(meanweight16$result, ylab = "Results")

plot of chunk unnamed-chunk-5

qqnorm(meanweight100$result, ylab = "Results"); qqline(meanweight100$result, ylab = "Results")

plot of chunk unnamed-chunk-5

We can now estimate the means and standard deviations of these distributions:

paste("(n=2) Mean: ", mean(meanweight2))
## [1] "(n=2) Mean:  3159.53205"
paste("(n=2) SD: ", sd(meanweight2))
## [1] "(n=2) SD:  403.106275087227"
paste("(n=16) Mean: ", mean(meanweight16))
## [1] "(n=16) Mean:  3156.25655"
paste("(n=16) SD: ", sd(meanweight16))
## [1] "(n=16) SD:  142.768550499571"
paste("(n=100) Mean: ", mean(meanweight100))
## [1] "(n=100) Mean:  3156.901646"
paste("(n=100) SD: ", sd(meanweight100))
## [1] "(n=100) SD:  55.8722978675263"


3) Do our results hold for a population that does not have a normal distribution?

This question deals with high school GPAs from St. Ambrose students in 2012, so let’s download that data:

hsgpa <- read.csv(url("http://bradthiessen.com/html5/stats/m300/sauhsgpa.csv"))
histogram(~hsgpa, data = hsgpa, col="grey", xlim=c(-.1, 4.1), xlab = "High school GPAs", width=.08)