Lab #2: Counting, Probability, Permutations


Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab2report.Rmd


Generating data and random samples


Money for math majors

Recall the scenario from class in which a professor teaches a statistics class with 5 engineering majors and 3 math majors. The professor, through a process he claims is random, chooses to give money to 2 math majors. We were interested in the likelihood of the professor choosing 2 math majors if he did, in fact, choose students randomly.

In class, we solved this problem with a few different methods:

  • Listing the entire sample space
  • Using combinations
  • Using the hypergeometric distribution

Let’s learn how to apply these methods in R (and estimate the likelihood via simulation).


Simulation and random sampling

Let’s begin by generating the dataset for this scenario. We need to tell R to create a vector containing 5 engineering majors and 3 math majors.

Since the identifications of the individuals students aren’t important, we only need to generate a dataset of the students’ majors. If we let E = engineering and M = math, our dataset will look like:

E E E E E M M M

To input this dataset into R, we’ll use the c() command. You can think of c() as representing combine(). To construct our dataset, we could use:

students <- c("E", "E", "E", "E", "E", "M", "M", "M")

Notice that since our data consist of characters (letters), we need to put each value in quotation marks.

While this method works, it can be tedious. Is there any way I can input the data without having to type each individual student’s major?

Yes, as long as we learn the replicate function in R: rep(). This function takes two arguments: rep(x, times), where

  • x = a value that we want to replicate
  • times = the number of times we want to replicate that value

To construct our dataset, we can combine the c() and rep() functions, like this:

students <- c( rep("Eng.", 5), rep("Math", 3) )
students
## [1] "Eng." "Eng." "Eng." "Eng." "Eng." "Math" "Math" "Math"

The output shows that our dataset, students, was constructed by replicating “Eng.” 5 times and “Math” 3 times.

  1. In your lab report, construct a dataset named students that contains 12 engineering majors, 5 math majors, and 3 other majors.

Now that we have our dataset, let’s simulate the process of choosing 2 students at random. To do this, we can use the sample() command. Typing help(sample) into the console shows me this command allows the following arguments:

sample(x, size, replace = FALSE), where

  • x = a vector of elements from which to choose
  • size = the number of items to choose
  • replace = should sampling be with replacement?

Let’s assume the statistics professor was not going to choose the same student twice. In this case, we want to choose 2 students without replacement. To do this, we use:

sample(students, size=2, replace=FALSE)
## [1] "Eng." "Math"

The output shows the two students who were chosen at random.

  1. Randomly sample 3 students from your students dataset without replacement.

Just as we did in the first lab, we’ll want to replicate this sampling process many times with the do() function. Let’s get 10,000 replications of sampling 2 students without replacement:

sampled_students <- do(10000) * sample(students, size=2, replace=FALSE)

Now that we’ve simulated this process 10,000 times, let’s summarize the results in a table.

table(sampled_students)
##       V2
## V1     Eng. Math
##   Eng. 3671 2619
##   Math 2650 1060

The output gives us a 2x2 table showing the number of times the simulated professor chose:

  • two engineering majors (the top-left cell)
  • two math majors (the bottom-right cell)
  • one math major and one engineering major (the remaining off-diagonal cells)

We can tell R to give us a more nicely-formatted table if we use the tally() command. To use tally(), we need to identify which variable(s) we want to tally.

When we simulated the selection of two students, we stored the results in a dataframe called sampled_students. Let’s look at the first several rows of this dataframe to identify our variables:

head(sampled_students)
##     V1   V2
## 1 Math Eng.
## 2 Math Eng.
## 3 Eng. Math
## 4 Eng. Eng.
## 5 Eng. Eng.
## 6 Math Eng.

This output shows our two variables are named V1 (representing the first student selected) and V2 (the second student selected). Now that we know the names of our variables, we can use the tally() command:

# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE)
##        V2
## V1       Eng.  Math Total
##   Eng.   3671  2619  6290
##   Math   2650  1060  3710
##   Total  6321  3679 10000

Finally, if we’re interested in the likelihood of choosing 2 math majors at random, we can format our table to display proportions (instead of frequencies). To do this, I use:

# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE, format="proportion")
##        V2
## V1        Eng.   Math  Total
##   Eng.  0.3671 0.2619 0.6290
##   Math  0.2650 0.1060 0.3710
##   Total 0.6321 0.3679 1.0000

That value in the bottom-right cell represents the proportion of times our simulated professor chose two math majors.

  1. Run 10,000 replications of sampling 3 students from your students dataset without replacement. Then, construct a summary of your results with the table or tally commands. Finally, estimate the likelihood of choosing 3 math majors from your simulation.


Using combinations

We’ve used a simulation to estimate the likelihood of choosing 2 math majors at random. This time, let’s calculate the likelihood using combinations.

As we saw in class, we calculate this likelihood by calculating:

P(choose 2 math majors) / P(choose 2 students)

Since we’re sampling without replacement and order does not matter, we use combinations to calculate both the numerator and denominator.

To calculate combinations in R, we use the choose() command. In this command, we type the number of objects and then the number we want to choose:

# Choose 2 out of 3 math majors
choose(3,2)
## [1] 3
# Choose 2 out of 8 total students
choose(8,2)
## [1] 28

We can combine these in a single command to calculate the likelihood of choosing 2 math majors at random:

choose(3,2) / choose(8,2)
## [1] 0.1071429

That answer is close to the likelihood we estimated with our simulation.

  1. Use the choose() command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Verify that this answer is similar to the likelihood you estimated in exercise #3.


Using the hypergeometric distribution

Finally, let’s verify our answer using the hypergeometric distribution. We’ll discuss the hypergeometric distribution in-depth in unit #2. For now, let’s see how to calculate probabilities with this distribution in R.

The syntax for the hypergeometric distribution is:

dhyper(q, m, n, k), where

  • q = the number of TYPE 1 objects chosen
  • m = the total number of TYPE 1 objects
  • n = the number of TYPE 2 objects
  • k = the total number of objects (of both types) chosen

If we let math majors be our TYPE 1 objects, our scenario would need:

  • q = 2 (we want to choose 2 math majors)
  • m = 3 (total number of math majors)
  • n = 5 (total number of engineering majors)
  • k = 2 (we want to choose 2 students in total)

Let’s calculate this probability:

dhyper(2, 3, 5, 2)
## [1] 0.1071429

This answer (0.1071429) matches the answer we calculated via combinations.

  1. Use the dhyper() command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Although it seems as though you have 3 different types of objects, you really only have math majors and non-math majors.



Bank supervisors

Recall the scenario in class where 48 bank supervisors were asked if they’d promote an applicant to a managerial position. 24 of those supervisors were randomly chosen and told the applicant was male. The other 24 supervisors were told the applicant was female.

The results of the study were summarized in a table:

Observed Promoted Not promoted Total
Male App 21 3 24
Female App 14 10 24
Total 35 13 48

If the gender of the applicants had no affect on the promotion decisions, we’d expect a similar number of male and female applicants to be promoted. Instead, the results of this study indicate males were much more likely to be promoted than females.

Our question of interest is this: Assuming gender does not influence promotion decisions, how likely were we to observe 21 or more promoted males in this study?

We’ll estimate this likelihood via simulation and the hypergeometric distribution.

Simulation approach

The code for this simulation is going to be a bit complicated. Two factors make it complicated: (1) I want to make the code as readable as possible for you, and (2) I’m not a very skilled coder.

Let’s start by constructing our dataset. I need to create a dataset that has 21 promoted males, 3 unpromoted males, 14 promoted females, and 10 unpromoted females. To do this, I’m going to create two variables: gender and promoted.

For the gender variable, I need to create 24 males and 24 females using:

gender <- c( rep("Male",24), rep("Female", 24) )

Remember, this will create a list of the word “Male” repeated 24 times, followed by “Female” repeated 24 times. I want 21 of those 24 males to be promoted and 3 to not be promoted. I then want 14 promoted females and 10 unpromoted females. To do this, I’ll use:

promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10))

I’ll then combine the two variables (gender and promoted) into a single dataframe I’ll call promotions.

Let’s put this together in a single command:

promotions <- data.frame(
              gender = c( rep("Male",24), rep("Female", 24) ), 
              promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10) )
              )

I could have written that across a single line, but I’m hoping the spacing helps you interpret the code.

Let’s look at a tally (or table) of this data to make sure it matches the actual results from this experiment:

tally(promoted~gender, data=promotions)
##         gender
## promoted Female Male
##      No      10    3
##      Yes     14   21

The rows and columns are in a different (alphabetical) order, but this does match the results from the study.


To simulate the results of this study, we’re going to assume that gender had no affect on promotion decisions. If that’s true, the gender listed on the application (male or female) had no impact on the supervisors decision to promote the applicant.

Imagine one supervisor in this study who was randomly assigned a male applicant and decided to promote that male applicant. Now imagine we go back in time and, through random assignment, this supervisor was now assigned a female applicant. Because we’re assuming that gender does not influence promotion decisions, we assume that this supervisor would still choose to promote the applicant (even though the supervisor now believes the applicant is female).

This means that in order to simulate the results of this experiment, we simply need to shuffle the gender of the applicants assigned to the supervisors. To randomly shuffle the gender variable, we need to add shuffle() to the previous command. I’m also going to add a format="data.frame" argument to store the results of this shuffling process as a dataframe:

shuffled_gender <- tally(promoted ~ shuffle(gender), data=promotions, format="data.frame")
shuffled_gender
##   promoted shuffle.gender. Freq
## 1       No          Female    7
## 2      Yes          Female   17
## 3       No            Male    6
## 4      Yes            Male   18

As you can see, randomly shuffling the data yielded different results than the original experiment. While we still have 35 promotions, there’s now a different number of promoted males.

Notice that the number of promoted males is stored in the 4th row and 3rd column of our shuffled_gender dataframe. Using square brackets, I can tell R to give me only this single value:

shuffled_gender[4, 3]
## [1] 18

Yep, that’s the number of promoted males we obtained.

As we’ve done several times now, we’re going to replicate our simulation many times. We’re going to:

  • Assume gender does not influence promotion decisions
  • Randomly assign male or female applicants to our supervisors (shuffle the gender variable)
  • Record the number of promoted males
  • See how likely we were to get 21 or more promoted males

Here’s the code (with some comments to help explain) to run 10,000 replications of this simulation:

shuffled_gender <- do(10000) *     # Replicate 10,000 times
     tally(                        # Create a table
       promoted ~ shuffle(gender), # of promotions and (shuffled) gender.
     data=promotions,              # The data is stored in "promotions"
     format="data.frame")[4,3]     # Access the number of promoted males

Now we can examine the distribution of promoted males we got from our simulation:

# Tally the results
tally(~result, data=shuffled_gender)
## 
##   12   13   14   15   16   17   18   19   20   21   22   23 
##    1   33  210  688 1631 2424 2391 1606  751  222   39    4
# Create a histogram of the results
histogram(~result,                  # Plot the results
          data=shuffled_gender,     # Stored in shuffled_gender 
          type="count",             # Show frequencies on the y-axis
          width=1,                  # Make the bars have width = 1
          xlab = "Promoted males",  # Label the x-axis
          v = 21,                   # Draw a vertical line at X=21
          groups = (result >= 21) ) # Color all bars >= 21