Lab #5: Sampling Distributions


Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab5report.Rmd


Sampling Distributions


Salaries

In 2005, the U.S. Census Bureau introduced the American Community Survey (ACS). The ACS, sent to approximately 250,000 households monthly (or 3 million per year), gathers information about a variety of factors including demographic, economic, housing, and social data.

In this lab, we’re going to work with the responses from 26,134 Iowans to the 2013 ACS. Let’s download the dataset (from my website) and look at the header:

acs <- read.csv("http://bradthiessen.com/html5/stats/m300/acs13.csv")
head(acs)
##     Status WageSalary TotalIncome
## 1 Employed          0      100000
## 2 Employed          0      110000
## 3 Employed          0       80000
## 4 Employed          0       56760
## 5 Employed          0       30000
## 6 Employed          0       45500

The full dataset has many more variables, but I limited our investigation to three:

  • Status = employment status of each respondent
  • WageSalary = Salary and wages for each respondent
  • TotalIncome = Total income for each respondent

Let’s look at the structure of our data with the str() command:

str(acs)
## 'data.frame':    26134 obs. of  3 variables:
##  $ Status     : Factor w/ 6 levels "Child","Employed",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ WageSalary : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TotalIncome: int  100000 110000 80000 56760 30000 45500 75000 48000 12000 7000 ...

The str() command shows our dataset has 26,134 observations and 3 variables. The WageSalary and TotalIncome variables are stored as integers (as we should expect). The Status variable is a factor variable with 6 different levels. We know the first two levels are “child” and “employed,” but what are the other levels?

To find out, we can either look at that variable’s attributes or construct a simple table:

# The $ symbol tells R to use the Status variable in the acs dataset
# dataset$variable
attributes(acs$Status)  
## $levels
## [1] "Child"                    "Employed"                
## [3] "Employed but not at work" "Military at work"        
## [5] "Not in labor force"       "Unemployed"              
## 
## $class
## [1] "factor"
table(acs$Status)   
## 
##                    Child                 Employed Employed but not at work 
##                      406                    15286                      305 
##         Military at work       Not in labor force               Unemployed 
##                       17                     9418                      702

You can see survey respondents are classified into one of 6 Status categories:

  • Child (under the age of 16)
  • Employed
  • Employed but not at work
  • Military at work
  • Not in labor force
  • Unemployed

We’re going to look at the distribution of incomes, so let’s only look at individuals who are employed. How can we do this? How can we pull out the responses from only the employed respondents?

There are a few different ways to do this:

Option #1:subset

The subset() command, built into R, allows you to specify which data you’d like to keep. In this scenario, we want to keep all the data (in the acs dataset) for respondents with Status == "Employed".

We’ll store our subset of data as acsEmp with the command:

acsEmp <- subset(acs, Status == "Employed")

acsEmp <- subset(acs, Status == "Employed")

Let’s construct a quick table of our new dataset to make sure it only has employed respondents:

table(acsEmp$Status)
## 
##                    Child                 Employed Employed but not at work 
##                        0                    15286                        0 
##         Military at work       Not in labor force               Unemployed 
##                        0                        0                        0

Yep, it looks like it worked.

  1. A dataset names ames has been loaded for you. It’s described in the lab report template. Create a table of the Street variable to show how many houses are on paved and gravel roads. Then, use the subset() command to select only houses on gravel roads. Construct another table to verify you’ve selected the correct subset of data.


Option #2: []

A faster, better, yet more confusing way to subset your data is by using square brackets []. The general form of this syntax is:

dataframe[ (rows to keep), (columns to keep) ]

In this scenario, we want to keep all the columns (all 3 variables). To do this, we can just leave the (columns to keep) blank. Our syntax is now:

dataframe[ (rows to keep), ]

The rows (observations) we want to keep are those in which Status == "Employed". In other words, our condition is that Status == "Employed".

We can feed this condition into the general form of our syntax:

dataframe[ dataframe$variable (condition) ]

to obtain the command

acsEmp2 <- acs[acs$Status == "Employed", ]

Notice I stored the results in a dataframe called acsEmp2. We can get a table of the status variable within this dataframe to verify we correctly subsetted our data:

acsEmp2 <- acs[acs$Status == "Employed", ]
table(acsEmp2$Status)
## 
##                    Child                 Employed Employed but not at work 
##                        0                    15286                        0 
##         Military at work       Not in labor force               Unemployed 
##                        0                        0                        0

Once again, our subset of data contains 15,286 employed respondents.

  1. Use square brackets [] to select only houses on gravel roads.


Option #3: dplyr

Another popular way to select a subset of data is to use the dplyr package developed by Hadley Wickham.

The dplyr package is designed to help you manipulate data quickly and efficiently. To this end, dplyr allows you to use 5 verbs on your data:

  • select(): focus on a subset of variables
  • filter(): focus on a subset of rows
  • mutate(): add new columns
  • summarize(): reduce each group to a smaller number of summary statistics
  • arrange(): re-order the rows

It also lets you use pipes %>% to string together verbs to perform more complicated manipulations.

Let’s see how these verbs can be used on our simple dataset. First, we better install the dplyr package using the command: install.packages("dplyr"). I’ve already installed this package, so I’m going to load it now:

library(dplyr)

Our acs dataset has two variables related to income: WageSalary and TotalIncome. Suppose we only want to analyze the WageSalary variable. We can use the select() verb to focus on that variable (and our Status variable). I’ll store the new dataframe as acsSalary:

acsSalary <- acs %>%
             select(Status, WageSalary)

If we look at the header of this new dataframe, we see it has only the two variables we selected:

head(acsSalary)
##     Status WageSalary
## 1 Employed          0
## 2 Employed          0
## 3 Employed          0
## 4 Employed          0
## 5 Employed          0
## 6 Employed          0


Let’s now use the filter verb to do what we want – filter our data to keep only the cases in which Status == "Employed". We’ll chain this together with the select verb to keep only our Status and WageSalary variables:

acsEmp3 <- acs %>%
           select(Status, WageSalary) %>%
           filter(Status == "Employed")
head(acsEmp3)
##     Status WageSalary
## 1 Employed          0
## 2 Employed          0
## 3 Employed          0
## 4 Employed          0
## 5 Employed          0
## 6 Employed          0
  1. Use the filter verb in the dplyr package to select only houses on gravel roads.


Let’s take one more look at the dplyr package by chaining together a series of those verbs. Suppose, with our original acs dataframe, we want to find the mean and median total income for adults by employment status. Furthermore (just to make this more difficult), suppose we want these values in cents (rather than dollars).

To do this, we’d use:

acs %>%                                 # Take our acs dataframe
  select(Status, WageSalary) %>%        # Select variables of interest
  group_by(Status) %>%                  # Group data by employment status
  filter(Status != "Child") %>%         # Filter out children
  mutate(cents = WageSalary * 100) %>%  # Create new "cents" variable
  summarize(mean = mean(cents), sd = sd(cents)) # Calculate mean & sd
## Source: local data frame [5 x 3]
## 
##                     Status      mean      sd
## 1                 Employed 3605196.5 3877046
## 2 Employed but not at work 2442606.6 2975496
## 3         Military at work 6829411.8 7355505
## 4       Not in labor force  144407.2 1036304
## 5               Unemployed  826317.7 1388675

With the %>% pipes chaining together these verbs, the code is pretty easy to understand by reading from left-to-right.

  1. Use the dplyr package to (a) select the Street, Year.Built, and SalePrice variables; (b) group the data by Street type; (c) filter out houses sold for less than 100,000 (dollars); (d) mutate a new variable named PricePerFoot that measures price per square foot (of above-ground living area); (e) summarize the mean PricePerFoot.



Simulating a sampling distribution

We now have a dataset consisting of the salaries of 15286 employed adults in Iowa in 2013. Let’s pretend as though this is our population distribution.

As you can see below, the distribution has a positive skew with a mean of $36,051.96` and a standard deviation of $38770.46.

# Don't worry about the syntax here
# I set lots of options to create this density plot
densityplot(~WageSalary, data=acsEmp, 
            xlab=list(paste("mean = ", round(mean(acsEmp$WageSalary),2), 
                       "       std dev = ", round(sd(acsEmp$WageSalary),4)), 
                      cex=1.5),
            main="Distribution of Iowa Salaries in 2013", plot.points=FALSE, 
            from=-5000, to=350000, lw=5, 
            col="steelblue", bw=10000)

Suppose we didn’t know the average income of employed adults in Iowa in 2013 was $36,051.96. Suppose we were going to estimate it by surveying a sample of employed adults in Iowa.

Let’s start by assuming we take a random sample of n = 5. Let’s take one random sample and see what average we get.

mean( sample(acsEmp$WageSalary, 5, replace=TRUE))
## [1] 43160

From this single sample, we’d say our best estimate of the average income of Iowa adults is $43,160.

If we took another random sample of size n = 5, we’d get a different estimate:

mean( sample(acsEmp$WageSalary, 5, replace=TRUE))
## [1] 18200

From this sample, our best estimate of the average income of Iowa adults would be $18,200.

Because we know the population mean ($36,052), we can evaluate our sample estimates. The first estimate ($43,160) wasn’t too far off; the second estimate ($18,200) greatly underestimated the population mean.

But if we didn’t know the population mean, we wouldn’t be able to evaluate these sample estimates. If they came from random samples, they would be equally valid estimates.

We might, then, be interested in knowing how “good” of an estimate we would get if we take a single sample of size n=5. In other words, if we repeatedly took samples of size n=5, what would this distribution look like?

In class, we learned how to predict this sampling distribution (distribution of sample means). We learned that:

  • If we repeatedly take samples of size n=5 from the (skewed) population distribution
  • We calculate the mean of each of those samples
  • The mean of our sample means would equal the population mean
  • The standard deviation of our sample means would equal the population standard deviation divided by the square root of our sample size (5).

On top of that, if the Central Limit Theorem holds:

  • Our sampling distribution will approximate a normal distribution.


Let’s simulate the sampling distribution by taking 10,000 samples of size n=5. Will the CLT hold?

# We'll start by calculating means for 10,000 samples of size n=5
samples5 <- do(10000) * mean(sample(acsEmp$WageSalary, 5, replace=TRUE))

With a skewed population distribution and a small sample size, we can predict that the CLT will not hold and our sampling distribution will not approximate a normal distribution. Let’s create a histogram of our 10,000 sample means to see:

histogram(~result, data = samples5, col="grey", border="white", 
          xlim=c(0, 200000), xlab = "n=5", width=2000)

Yep, that’s not a normal curve. Theory tells us that we should expect:

  • The mean of our sample means to equal $36,052
  • The std. dev. of our sample means to equal $38,770.46 / sqrt(5) =round(38770.46 / sqrt(5) ,2)`

These results hold as the number of samples we take approaches infinity. We only took 10,000 samples, so our results won’t be perfect. Let’s see what we got:

mean(samples5)  #This should be close to 36052
## [1] 36300.88
sd(samples5)    #This should be close to 17339
## [1] 17623.9

Those aren’t too far off.

Our sampling distribution certainly doesn’t look normal, but let’s take a closer look. First, let’s see how it compares to a perfect normal distribution:

# Plot the histogram again
histogram(~result, data = samples5, col="grey", border="white", 
          xlim=c(0, 200000), xlab = "n=5", width=2000)

# Add a normal distribution on top of the histogram
plotDist("norm", params=list(mean=mean(acsEmp$WageSalary), sd=(sd(acsEmp$WageSalary)/(5)^.5)), col="red", lwd=3, add=TRUE)

Ok, that makes it pretty clear that our sampling distribution doesn’t approximate a normal distribution. We can also look at a Q-Q plot:

qqnorm(samples5$result, ylab = "n = 5"); qqline(samples5$result, ylab = "n = 5")

This, again, shows our sampling distribution doesn’t approximate a normal distribution.

Would the CLT hold if we take a larger sample?


The magic number (30)?

As many textbooks claim, the CLT holds if our sample size is “large enough.” The rough guideline is that large enough = 30. Let’s see how well that works in this example.

Let’s take 10,000 samples of size n=30, calculate the mean of each sample, and then plot the sampling distribution. We’ll then compare the sampling distribution to a normal distribution.

# We'll start by calculating means for 10,000 samples of size n=30
samples30 <- do(10000) * mean(sample(acsEmp$WageSalary, 30, replace=TRUE))

# Get the mean of our sample means
paste("Mean =", round(mean(samples30), 2), " (compared to the theoretical value of 36052)")
## [1] "Mean = 36045.87  (compared to the theoretical value of 36052)"
# Get the standard deviation of our sample means
paste("SD =", round(sd(samples30), 2), 
      " (compared to the theoretical value of ", round(sd(acsEmp$WageSalary)/sqrt(30),2))
## [1] "SD = 6984.56  (compared to the theoretical value of  7078.49"
# Plot a histogram of our sample means
histogram(~result, data = samples30, col="grey", border="white", 
          xlim=c(0, 100000), xlab = "n=30", width=1000)

#Superimpose a normal distribution on top of the histogram
plotDist("norm", params=list(mean=mean(acsEmp$WageSalary), sd=(sd(acsEmp$WageSalary)/(30)^.5)), col="red", lwd=3, add=TRUE)

Let’s look at a Q-Q plot to compare our sampling distribution with a normal distribution:

qqnorm(samples30$result, ylab = "n = 30"); qqline(samples30$result, ylab = "n = 30")

  1. Look at the sampling distribution we obtained by repeatedly taking samples of size n=30. Does it look as though the CLT held? Explain.


1,000 samples

Let’s quickly see the sampling distribution of 10,000 means from samples of size n = 1000. That’s a sample size big enough for the CLT to hold… right?

samples1000 <- do(10000) * mean(sample(acsEmp$WageSalary, 1000, replace=TRUE))

# Get the mean of our sample means
paste("Mean =", round(mean(samples1000), 2), " (compared to the theoretical value of 36052)")
## [1] "Mean = 36056.47  (compared to the theoretical value of 36052)"
# Get the standard deviation of our sample means
paste("SD =", round(sd(samples1000), 2), 
      " (compared to the theoretical value of ", round(sd(acsEmp$WageSalary)/sqrt(1000),2))
## [1] "SD = 1243.4  (compared to the theoretical value of  1226.03"
# Plot a histogram of our sample means
histogram(~result, data = samples1000, col="grey", border="white", 
          xlim=c(30000, 43000), xlab = "n=1000", width=300)

#Superimpose a normal distribution on top of the histogram
plotDist("norm", params=list(mean=mean(acsEmp$WageSalary), sd=(sd(acsEmp$WageSalary)/(1000)^.5)), col="red", lwd=3, add=TRUE)

qqnorm(samples1000$result, ylab = "n = 30"); qqline(samples1000$result, ylab = "n = 30")

It looks like this sampling distribution does approximate a normal distribution.


No place like home


Housing prices in Ames, IA

The prices of all 2,390 homes sold in Ames, Iowa between 2006-2010 have been loaded in a vector named price.

  1. Calculate the mean and standard deviation of this price variable.

  2. Construct a histogram of the price variable.

  3. Take 10,000 samples of size n = 50 from this price distribution. Calculate the mean of each sample. Then, generate a plot of the sampling distribution (of those 10,000 means).

  4. Use a graphical method to determine if your sampling distribution approximates a normal distribution.

  5. Calculate the mean and standard deviation of your sampling distribution.



Generating (pubishing) your lab report

When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button: Drawing

Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.

Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.


You’ve done it! Congratulations on finishing your 5th lab!

Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.

This lab, released under a Creative Commons Attribution-ShareAlike 3.0 Unported license, is a derivative of templates originally developed by Mark Hansen (UCLA Statistics) and adapted by Andrew Bray (Mt. Holyoke) & Mine Çetinkaya-Rundel (Duke).