Lab #5: Sampling Distributions

Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link:

Sampling Distributions


In 2005, the U.S. Census Bureau introduced the American Community Survey (ACS). The ACS, sent to approximately 250,000 households monthly (or 3 million per year), gathers information about a variety of factors including demographic, economic, housing, and social data.

In this lab, we’re going to work with the responses from 26,134 Iowans to the 2013 ACS. Let’s download the dataset (from my website) and look at the header:

acs <- read.csv("")
##     Status WageSalary TotalIncome
## 1 Employed          0      100000
## 2 Employed          0      110000
## 3 Employed          0       80000
## 4 Employed          0       56760
## 5 Employed          0       30000
## 6 Employed          0       45500

The full dataset has many more variables, but I limited our investigation to three:

  • Status = employment status of each respondent
  • WageSalary = Salary and wages for each respondent
  • TotalIncome = Total income for each respondent

Let’s look at the structure of our data with the str() command:

## 'data.frame':    26134 obs. of  3 variables:
##  $ Status     : Factor w/ 6 levels "Child","Employed",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ WageSalary : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TotalIncome: int  100000 110000 80000 56760 30000 45500 75000 48000 12000 7000 ...

The str() command shows our dataset has 26,134 observations and 3 variables. The WageSalary and TotalIncome variables are stored as integers (as we should expect). The Status variable is a factor variable with 6 different levels. We know the first two levels are “child” and “employed,” but what are the other levels?

To find out, we can either look at that variable’s attributes or construct a simple table:

# The $ symbol tells R to use the Status variable in the acs dataset
# dataset$variable
## $levels
## [1] "Child"                    "Employed"                
## [3] "Employed but not at work" "Military at work"        
## [5] "Not in labor force"       "Unemployed"              
## $class
## [1] "factor"
##                    Child                 Employed Employed but not at work 
##                      406                    15286                      305 
##         Military at work       Not in labor force               Unemployed 
##                       17                     9418                      702

You can see survey respondents are classified into one of 6 Status categories:

  • Child (under the age of 16)
  • Employed
  • Employed but not at work
  • Military at work
  • Not in labor force
  • Unemployed

We’re going to look at the distribution of incomes, so let’s only look at individuals who are employed. How can we do this? How can we pull out the responses from only the employed respondents?

There are a few different ways to do this:

Option #1:subset

The subset() command, built into R, allows you to specify which data you’d like to keep. In this scenario, we want to keep all the data (in the acs dataset) for respondents with Status == "Employed".

We’ll store our subset of data as acsEmp with the command:

acsEmp <- subset(acs, Status == "Employed")

acsEmp <- subset(acs, Status == "Employed")

Let’s construct a quick table of our new dataset to make sure it only has employed respondents:

##                    Child                 Employed Employed but not at work 
##                        0                    15286                        0 
##         Military at work       Not in labor force               Unemployed 
##                        0                        0                        0

Yep, it looks like it worked.

  1. A dataset names ames has been loaded for you. It’s described in the lab report template. Create a table of the Street variable to show how many houses are on paved and gravel roads. Then, use the subset() command to select only houses on gravel roads. Construct another table to verify you’ve selected the correct subset of data.

Option #2: []

A faster, better, yet more confusing way to subset your data is by using square brackets []. The general form of this syntax is:

dataframe[ (rows to keep), (columns to keep) ]

In this scenario, we want to keep all the columns (all 3 variables). To do this, we can just leave the (columns to keep) blank. Our syntax is now:

dataframe[ (rows to keep), ]

The rows (observations) we want to keep are those in which Status == "Employed". In other words, our condition is that Status == "Employed".

We can feed this condition into the general form of our syntax:

dataframe[ dataframe$variable (condition) ]

to obtain the command

acsEmp2 <- acs[acs$Status == "Employed", ]

Notice I stored the results in a dataframe called acsEmp2. We can get a table of the status variable within this dataframe to verify we correctly subsetted our data:

acsEmp2 <- acs[acs$Status == "Employed", ]
##                    Child                 Employed Employed but not at work 
##                        0                    15286                        0 
##         Military at work       Not in labor force               Unemployed 
##                        0                        0                        0

Once again, our subset of data contains 15,286 employed respondents.

  1. Use square brackets [] to select only houses on gravel roads.

Option #3: dplyr

Another popular way to select a subset of data is to use the dplyr package developed by Hadley Wickham.

The dplyr package is designed to help you manipulate data quickly and efficiently. To this end, dplyr allows you to use 5 verbs on your data:

  • select(): focus on a subset of variables
  • filter(): focus on a subset of rows
  • mutate(): add new columns
  • summarize(): reduce each group to a smaller number of summary statistics
  • arrange(): re-order the rows

It also lets you use pipes %>% to string together verbs to perform more complicated manipulations.

Let’s see how these verbs can be used on our simple dataset. First, we better install the dplyr package using the command: install.packages("dplyr"). I’ve already installed this package, so I’m going to load it now:


Our acs dataset has two variables related to income: WageSalary and TotalIncome. Suppose we only want to analyze the WageSalary variable. We can use the select() verb to focus on that variable (and our Status variable). I’ll store the new dataframe as acsSalary:

acsSalary <- acs %>%
             select(Status, WageSalary)

If we look at the header of this new dataframe, we see it has only the two variables we selected:

##     Status WageSalary
## 1 Employed          0
## 2 Employed          0
## 3 Employed          0
## 4 Employed          0
## 5 Employed          0
## 6 Employed          0

Let’s now use the filter verb to do what we want – filter our data to keep only the cases in which Status == "Employed". We’ll chain this together with the select verb to keep only our Status and WageSalary variables:

acsEmp3 <- acs %>%
           select(Status, WageSalary) %>%
           filter(Status == "Employed")
##     Status WageSalary
## 1 Employed          0
## 2 Employed          0
## 3 Employed          0
## 4 Employed          0
## 5 Employed          0
## 6 Employed          0
  1. Use the filter verb in the dplyr package to select only houses on gravel roads.

Let’s take one more look at the dplyr package by chaining together a series of those verbs. Suppose, with our original acs dataframe, we want to find the mean and median total income for adults by employment status. Furthermore (just to make this more difficult), suppose we want these values in cents (rather than dollars).

To do this, we’d use:

acs %>%                                 # Take our acs dataframe
  select(Status, WageSalary) %>%        # Select variables of interest
  group_by(Status) %>%                  # Group data by employment status
  filter(Status != "Child") %>%         # Filter out children
  mutate(cents = WageSalary * 100) %>%  # Create new "cents" variable
  summarize(mean = mean(cents), sd = sd(cents)) # Calculate mean & sd
## Source: local data frame [5 x 3]
##                     Status      mean      sd
## 1                 Employed 3605196.5 3877046
## 2 Employed but not at work 2442606.6 2975496
## 3         Military at work 6829411.8 7355505
## 4       Not in labor force  144407.2 1036304
## 5               Unemployed  826317.7 1388675

With the %>% pipes chaining together these verbs, the code is pretty easy to understand by reading from left-to-right.

  1. Use the dplyr package to (a) select the Street, Year.Built, and SalePrice variables; (b) group the data by Street type; (c) filter out houses sold for less than 100,000 (dollars); (d) mutate a new variable named PricePerFoot that measures price per square foot (of above-ground living area); (e) summarize the mean PricePerFoot.

Simulating a sampling distribution

We now have a dataset consisting of the salaries of 15286 employed adults in Iowa in 2013. Let’s pretend as though this is our population distribution.

As you can see below, the distribution has a positive skew with a mean of $36,051.96` and a standard deviation of $38770.46.

# Don't worry about the syntax here
# I set lots of options to create this density plot
densityplot(~WageSalary, data=acsEmp, 
            xlab=list(paste("mean = ", round(mean(acsEmp$WageSalary),2), 
                       "       std dev = ", round(sd(acsEmp$WageSalary),4)), 
            main="Distribution of Iowa Salaries in 2013", plot.points=FALSE, 
            from=-5000, to=350000, lw=5, 
            col="steelblue", bw=10000)