Lab #2: Counting, Probability, Permutations


Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab2report.Rmd


Generating data and random samples


Money for math majors

Recall the scenario from class in which a professor teaches a statistics class with 5 engineering majors and 3 math majors. The professor, through a process he claims is random, chooses to give money to 2 math majors. We were interested in the likelihood of the professor choosing 2 math majors if he did, in fact, choose students randomly.

In class, we solved this problem with a few different methods:

  • Listing the entire sample space
  • Using combinations
  • Using the hypergeometric distribution

Let’s learn how to apply these methods in R (and estimate the likelihood via simulation).


Simulation and random sampling

Let’s begin by generating the dataset for this scenario. We need to tell R to create a vector containing 5 engineering majors and 3 math majors.

Since the identifications of the individuals students aren’t important, we only need to generate a dataset of the students’ majors. If we let E = engineering and M = math, our dataset will look like:

E E E E E M M M

To input this dataset into R, we’ll use the c() command. You can think of c() as representing combine(). To construct our dataset, we could use:

students <- c("E", "E", "E", "E", "E", "M", "M", "M")

Notice that since our data consist of characters (letters), we need to put each value in quotation marks.

While this method works, it can be tedious. Is there any way I can input the data without having to type each individual student’s major?

Yes, as long as we learn the replicate function in R: rep(). This function takes two arguments: rep(x, times), where

  • x = a value that we want to replicate
  • times = the number of times we want to replicate that value

To construct our dataset, we can combine the c() and rep() functions, like this:

students <- c( rep("Eng.", 5), rep("Math", 3) )
students
## [1] "Eng." "Eng." "Eng." "Eng." "Eng." "Math" "Math" "Math"

The output shows that our dataset, students, was constructed by replicating “Eng.” 5 times and “Math” 3 times.

  1. In your lab report, construct a dataset named students that contains 12 engineering majors, 5 math majors, and 3 other majors.

Now that we have our dataset, let’s simulate the process of choosing 2 students at random. To do this, we can use the sample() command. Typing help(sample) into the console shows me this command allows the following arguments:

sample(x, size, replace = FALSE), where

  • x = a vector of elements from which to choose
  • size = the number of items to choose
  • replace = should sampling be with replacement?

Let’s assume the statistics professor was not going to choose the same student twice. In this case, we want to choose 2 students without replacement. To do this, we use:

sample(students, size=2, replace=FALSE)
## [1] "Eng." "Math"

The output shows the two students who were chosen at random.

  1. Randomly sample 3 students from your students dataset without replacement.

Just as we did in the first lab, we’ll want to replicate this sampling process many times with the do() function. Let’s get 10,000 replications of sampling 2 students without replacement:

sampled_students <- do(10000) * sample(students, size=2, replace=FALSE)

Now that we’ve simulated this process 10,000 times, let’s summarize the results in a table.

table(sampled_students)
##       V2
## V1     Eng. Math
##   Eng. 3671 2619
##   Math 2650 1060

The output gives us a 2x2 table showing the number of times the simulated professor chose:

  • two engineering majors (the top-left cell)
  • two math majors (the bottom-right cell)
  • one math major and one engineering major (the remaining off-diagonal cells)

We can tell R to give us a more nicely-formatted table if we use the tally() command. To use tally(), we need to identify which variable(s) we want to tally.

When we simulated the selection of two students, we stored the results in a dataframe called sampled_students. Let’s look at the first several rows of this dataframe to identify our variables:

head(sampled_students)
##     V1   V2
## 1 Math Eng.
## 2 Math Eng.
## 3 Eng. Math
## 4 Eng. Eng.
## 5 Eng. Eng.
## 6 Math Eng.

This output shows our two variables are named V1 (representing the first student selected) and V2 (the second student selected). Now that we know the names of our variables, we can use the tally() command:

# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE)
##        V2
## V1       Eng.  Math Total
##   Eng.   3671  2619  6290
##   Math   2650  1060  3710
##   Total  6321  3679 10000

Finally, if we’re interested in the likelihood of choosing 2 math majors at random, we can format our table to display proportions (instead of frequencies). To do this, I use:

# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE, format="proportion")
##        V2
## V1        Eng.   Math  Total
##   Eng.  0.3671 0.2619 0.6290
##   Math  0.2650 0.1060 0.3710
##   Total 0.6321 0.3679 1.0000

That value in the bottom-right cell represents the proportion of times our simulated professor chose two math majors.

  1. Run 10,000 replications of sampling 3 students from your students dataset without replacement. Then, construct a summary of your results with the table or tally commands. Finally, estimate the likelihood of choosing 3 math majors from your simulation.


Using combinations

We’ve used a simulation to estimate the likelihood of choosing 2 math majors at random. This time, let’s calculate the likelihood using combinations.

As we saw in class, we calculate this likelihood by calculating:

P(choose 2 math majors) / P(choose 2 students)

Since we’re sampling without replacement and order does not matter, we use combinations to calculate both the numerator and denominator.

To calculate combinations in R, we use the choose() command. In this command, we type the number of objects and then the number we want to choose:

# Choose 2 out of 3 math majors
choose(3,2)
## [1] 3
# Choose 2 out of 8 total students
choose(8,2)
## [1] 28

We can combine these in a single command to calculate the likelihood of choosing 2 math majors at random:

choose(3,2) / choose(8,2)
## [1] 0.1071429

That answer is close to the likelihood we estimated with our simulation.

  1. Use the choose() command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Verify that this answer is similar to the likelihood you estimated in exercise #3.


Using the hypergeometric distribution

Finally, let’s verify our answer using the hypergeometric distribution. We’ll discuss the hypergeometric distribution in-depth in unit #2. For now, let’s see how to calculate probabilities with this distribution in R.

The syntax for the hypergeometric distribution is:

dhyper(q, m, n, k), where

  • q = the number of TYPE 1 objects chosen
  • m = the total number of TYPE 1 objects
  • n = the number of TYPE 2 objects
  • k = the total number of objects (of both types) chosen

If we let math majors be our TYPE 1 objects, our scenario would need:

  • q = 2 (we want to choose 2 math majors)
  • m = 3 (total number of math majors)
  • n = 5 (total number of engineering majors)
  • k = 2 (we want to choose 2 students in total)

Let’s calculate this probability:

dhyper(2, 3, 5, 2)
## [1] 0.1071429

This answer (0.1071429) matches the answer we calculated via combinations.

  1. Use the dhyper() command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Although it seems as though you have 3 different types of objects, you really only have math majors and non-math majors.



Bank supervisors

Recall the scenario in class where 48 bank supervisors were asked if they’d promote an applicant to a managerial position. 24 of those supervisors were randomly chosen and told the applicant was male. The other 24 supervisors were told the applicant was female.

The results of the study were summarized in a table:

Observed Promoted Not promoted Total
Male App 21 3 24
Female App 14 10 24
Total 35 13 48

If the gender of the applicants had no affect on the promotion decisions, we’d expect a similar number of male and female applicants to be promoted. Instead, the results of this study indicate males were much more likely to be promoted than females.

Our question of interest is this: Assuming gender does not influence promotion decisions, how likely were we to observe 21 or more promoted males in this study?

We’ll estimate this likelihood via simulation and the hypergeometric distribution.

Simulation approach

The code for this simulation is going to be a bit complicated. Two factors make it complicated: (1) I want to make the code as readable as possible for you, and (2) I’m not a very skilled coder.

Let’s start by constructing our dataset. I need to create a dataset that has 21 promoted males, 3 unpromoted males, 14 promoted females, and 10 unpromoted females. To do this, I’m going to create two variables: gender and promoted.

For the gender variable, I need to create 24 males and 24 females using:

gender <- c( rep("Male",24), rep("Female", 24) )

Remember, this will create a list of the word “Male” repeated 24 times, followed by “Female” repeated 24 times. I want 21 of those 24 males to be promoted and 3 to not be promoted. I then want 14 promoted females and 10 unpromoted females. To do this, I’ll use:

promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10))

I’ll then combine the two variables (gender and promoted) into a single dataframe I’ll call promotions.

Let’s put this together in a single command:

promotions <- data.frame(
              gender = c( rep("Male",24), rep("Female", 24) ), 
              promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10) )
              )

I could have written that across a single line, but I’m hoping the spacing helps you interpret the code.

Let’s look at a tally (or table) of this data to make sure it matches the actual results from this experiment:

tally(promoted~gender, data=promotions)
##         gender
## promoted Female Male
##      No      10    3
##      Yes     14   21

The rows and columns are in a different (alphabetical) order, but this does match the results from the study.


To simulate the results of this study, we’re going to assume that gender had no affect on promotion decisions. If that’s true, the gender listed on the application (male or female) had no impact on the supervisors decision to promote the applicant.

Imagine one supervisor in this study who was randomly assigned a male applicant and decided to promote that male applicant. Now imagine we go back in time and, through random assignment, this supervisor was now assigned a female applicant. Because we’re assuming that gender does not influence promotion decisions, we assume that this supervisor would still choose to promote the applicant (even though the supervisor now believes the applicant is female).

This means that in order to simulate the results of this experiment, we simply need to shuffle the gender of the applicants assigned to the supervisors. To randomly shuffle the gender variable, we need to add shuffle() to the previous command. I’m also going to add a format="data.frame" argument to store the results of this shuffling process as a dataframe:

shuffled_gender <- tally(promoted ~ shuffle(gender), data=promotions, format="data.frame")
shuffled_gender
##   promoted shuffle.gender. Freq
## 1       No          Female    7
## 2      Yes          Female   17
## 3       No            Male    6
## 4      Yes            Male   18

As you can see, randomly shuffling the data yielded different results than the original experiment. While we still have 35 promotions, there’s now a different number of promoted males.

Notice that the number of promoted males is stored in the 4th row and 3rd column of our shuffled_gender dataframe. Using square brackets, I can tell R to give me only this single value:

shuffled_gender[4, 3]
## [1] 18

Yep, that’s the number of promoted males we obtained.

As we’ve done several times now, we’re going to replicate our simulation many times. We’re going to:

  • Assume gender does not influence promotion decisions
  • Randomly assign male or female applicants to our supervisors (shuffle the gender variable)
  • Record the number of promoted males
  • See how likely we were to get 21 or more promoted males

Here’s the code (with some comments to help explain) to run 10,000 replications of this simulation:

shuffled_gender <- do(10000) *     # Replicate 10,000 times
     tally(                        # Create a table
       promoted ~ shuffle(gender), # of promotions and (shuffled) gender.
     data=promotions,              # The data is stored in "promotions"
     format="data.frame")[4,3]     # Access the number of promoted males

Now we can examine the distribution of promoted males we got from our simulation:

# Tally the results
tally(~result, data=shuffled_gender)
## 
##   12   13   14   15   16   17   18   19   20   21   22   23 
##    1   33  210  688 1631 2424 2391 1606  751  222   39    4
# Create a histogram of the results
histogram(~result,                  # Plot the results
          data=shuffled_gender,     # Stored in shuffled_gender 
          type="count",             # Show frequencies on the y-axis
          width=1,                  # Make the bars have width = 1
          xlab = "Promoted males",  # Label the x-axis
          v = 21,                   # Draw a vertical line at X=21
          groups = (result >= 21) ) # Color all bars >= 21

From this, it doesn’t look as though it was likely to get 21 or more promoted males if gender did not influence promotion decisions. To estimate this likelihood, we can use our prop() command:

prop(~result>=21, data=shuffled_gender, format="prop")
##   TRUE 
## 0.0265

That’s our p-value – the likelihood of observing 21+ promoted males if gender does not influence promotion decisions. Because this likelihood is low, we have evidence that gender may, in fact, influence promotion decisions (at least for these bank supervisors).

  1. In assignment #3, you were asked to analyze data regarding Nurse Gilbert. Out of the 257 shifts Nurse Gilbert worked, 40 deaths occurred. Out of the 1384 shifts she did not work, 34 deaths occurred. Use a simulation to estimate the likelihood of observing 40 or more deaths on her shifts if, in fact, Nurse Gilbert had no influence on those deaths.


Bank Supervisors - Hypergeometric Distribution

We can use the hypergeometric distribution to estimate the likelihood of observing 21 or more promoted males. To do this, we need to think of our experiment as a process in which we choose a certain number of objects from two groups.

One way to do this would be to think of our data as:

  • q = 21 (we want to choose 21 males for promotion)
  • m = 24 (total number of males)
  • n = 24 (total number of females)
  • k = 35 (we want to choose 35 total applications for promotion)

The phyper() command tells R to calculate the probability of choosing q or fewer objects. In this case, it would calculate the probability of choosing 21 or fewer promoted males. We’re interested in the likelihood of choosing 21 or more promoted males, so we can use the complement rule to calculate 1 - P(20 or fewer):

1 - phyper(20, 24, 24, 35)
## [1] 0.02449571

That’s our likelihood: 0.0244957. It’s similar to what we estimated with our simulation methods.

Another way of characterizing our experiment under the hypergeometric distribution would be to consider:

  • q = 21 (we want to choose 21 promotions to be males)
  • m = 35 (total number of promotions)
  • n = 13 (total number of non-promotions)
  • k = 24 (we want to choose 24 applications in total to be males)
1-phyper(21, 35, 13, 24)
## [1] 0.003920274

Notice that this gives us exactly the same likelihood.

  1. Use the hypergeometric distribution to estimate the likelihood of observing 40 or more deaths on Nurse Gilbert’s shifts. From this, what conclusions can you make?



Permutation Tests


Need for speed

In class, we analyzed a simple (fictitious) dataset to see if a new drug made people run faster. Out of 8 randomly chosen subjects, 4 were randomly assigned to take a drug. All 8 subjects then ran a race. The subjects who took the drug finished in 1st, 2nd, 4th, and 5th place.

Let’s input this data:

race <- data.frame(
              group = c( rep("Drug",4), rep("Placebo", 4) ), 
              finish = c(1, 2, 4, 5, 3, 6, 7, 8) )

# Get a tally of the data
tally(group ~ finish, data=race)
##          finish
## group     1 2 3 4 5 6 7 8
##   Drug    1 1 0 1 1 0 0 0
##   Placebo 0 0 1 0 0 1 1 1

In class, we decided that one good way of measuring the relative performance of each group was to sum their ranks.

So, for the drug group, the sum is 1 + 2 + 4 + 5 = 12. The placebo group has a sum of 3 + 6 + 7 + 8 =24. Lower numbers are better, so our data suggests the drug is associated with better race results.

We can get these sums in R with the sum() command:

sum(finish ~ group, data = race)
##    Drug Placebo 
##      12      24

Let’s look a little more in-depth at the arguments in this command. Specifically, let’s look at the part in bold:

sum(**finish ~ group**, data=race)

The y ~ x argument is a sort of functional notation in the mosaic package. In this case, we’re telling R that we want to sum the finish variable as a function of the group variable (i.e., sum the race ranks for each group separately).

Assuming the drug has no effect on an individual’s speed, what’s the likelihood we would have obtained a sum of 12 or less in our study?

To estimate this likelihood, we’re going to shuffle the group variable. Since we’re assuming the drug doesn’t matter, we’re free to randomly assign subjects to the drug and placebo groups (all the while assuming the results of the race wouldn’t change).

Let’s run 10,000 replications of this shuffling simulation:

race_simulation <- do(10000) * sum(finish ~ shuffle(group), data=race)

Now, let’s visualize the results and estimate the p-value:

histogram(~Drug,                    # Plot the sums of the drug group
          data=race_simulation,     # Stored in race_simulation 
          type="count",             # Show frequencies on the y-axis
          width=1,                  # Make the bars have width = 1
          xlab = "Sum of Ranks in Drug Group",      # Label x-axis
          v = 12,                   # Draw a vertical line at X=12
          groups = (Drug <= 12) )   # Color all bars <= 21

prop(~ Drug<=12, data=race_simulation)
##   TRUE 
## 0.0573

In class, we began listing all 70 possible outcomes of this study. We stopped once we had found the 4 most extreme outcomes and calculated our p-value to be 0.0571429.

In this simulation approach, we didn’t bother figuring out how many outcomes were possible. In many scenarios (with datasets bigger than n=8 subjects), it wouldn’t be feasible to list out all possible outcomes. Instead, we rely on the computer to generate a large representative sample of possible outcomes.

As you can see, the p-value estimated with this simulation approach was very similar to the “exact” p-value we calculated in class.

  1. Use simulation methods (shuffling) to replicate the free-throw example we completed at the end of activity #3. Follow the Need for speed example you just read in this lab. You’ll need to generate a dataset, calculate sums, run 10,000 replications of a shuffling simulation, visualize the results, and estimate the p-value. You already know you should get a p-value very close to 0.30.




Generating (pubishing) your lab report

When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button: Drawing

Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.

Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.


You’ve done it! Congratulations on finishing the second lab!

Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.

This lab, released under a Creative Commons Attribution-ShareAlike 3.0 Unported license, is a derivative of the labs developed by by Andrew Bray (Mt. Holyoke) & Mine Çetinkaya-Rundel (Duke).