Remember to download the **report template** for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab2report.Rmd

Recall the scenario from class in which a professor teaches a statistics class with 5 engineering majors and 3 math majors. The professor, through a process he claims is random, chooses to give money to 2 math majors. We were interested in the **likelihood of the professor choosing 2 math majors** if he did, in fact, choose students randomly.

In class, we solved this problem with a few different methods:

- Listing the entire sample space
- Using combinations
- Using the hypergeometric distribution

Let’s learn how to apply these methods in R (and estimate the likelihood via simulation).

Let’s begin by generating the dataset for this scenario. We need to tell R to create a vector containing 5 engineering majors and 3 math majors.

Since the identifications of the individuals students aren’t important, we only need to generate a dataset of the students’ majors. If we let `E = engineering`

and `M = math`

, our dataset will look like:

E E E E E M M M

To input this dataset into R, we’ll use the `c()`

command. You can think of `c()`

as representing `combine()`

. To construct our dataset, we could use:

`students <- c("E", "E", "E", "E", "E", "M", "M", "M")`

Notice that since our data consist of **characters** (letters), we need to put each value in quotation marks.

While this method works, it can be tedious. Is there any way I can input the data without having to type each individual student’s major?

Yes, as long as we learn the **replicate** function in R: `rep()`

. This function takes two arguments: `rep(x, times)`

, where

- x = a value that we want to replicate
- times = the number of times we want to replicate that value

To construct our dataset, we can combine the `c()`

and `rep()`

functions, like this:

```
students <- c( rep("Eng.", 5), rep("Math", 3) )
students
```

`## [1] "Eng." "Eng." "Eng." "Eng." "Eng." "Math" "Math" "Math"`

The output shows that our dataset, **students**, was constructed by replicating “Eng.” 5 times and “Math” 3 times.

- In your lab report, construct a dataset named
**students**that contains**12 engineering**majors,**5 math**majors, and**3 other**majors.

Now that we have our dataset, let’s simulate the process of choosing 2 students at random. To do this, we can use the `sample()`

command. Typing `help(sample)`

into the console shows me this command allows the following arguments:

`sample(x, size, replace = FALSE)`

, where

- x = a vector of elements from which to choose
- size = the number of items to choose
- replace = should sampling be with replacement?

Let’s assume the statistics professor was not going to choose the same student twice. In this case, we want to choose 2 students **without** replacement. To do this, we use:

`sample(students, size=2, replace=FALSE)`

`## [1] "Eng." "Math"`

The output shows the two students who were chosen at random.

- Randomly sample 3 students from your
**students**dataset**without replacement.**

Just as we did in the first lab, we’ll want to replicate this sampling process many times with the `do()`

function. Let’s get 10,000 replications of sampling 2 students without replacement:

`sampled_students <- do(10000) * sample(students, size=2, replace=FALSE)`

Now that we’ve simulated this process 10,000 times, let’s summarize the results in a table.

`table(sampled_students)`

```
## V2
## V1 Eng. Math
## Eng. 3671 2619
## Math 2650 1060
```

The output gives us a 2x2 table showing the number of times the simulated professor chose:

- two engineering majors (the top-left cell)
- two math majors (the bottom-right cell)
- one math major and one engineering major (the remaining off-diagonal cells)

We can tell R to give us a more nicely-formatted table if we use the `tally()`

command. To use `tally()`

, we need to identify which variable(s) we want to tally.

When we simulated the selection of two students, we stored the results in a dataframe called `sampled_students`

. Let’s look at the first several rows of this dataframe to identify our variables:

`head(sampled_students)`

```
## V1 V2
## 1 Math Eng.
## 2 Math Eng.
## 3 Eng. Math
## 4 Eng. Eng.
## 5 Eng. Eng.
## 6 Math Eng.
```

This output shows our two variables are named `V1`

(representing the first student selected) and `V2`

(the second student selected). Now that we know the names of our variables, we can use the `tally()`

command:

```
# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE)
```

```
## V2
## V1 Eng. Math Total
## Eng. 3671 2619 6290
## Math 2650 1060 3710
## Total 6321 3679 10000
```

Finally, if we’re interested in the likelihood of choosing 2 math majors at random, we can format our table to display proportions (instead of frequencies). To do this, I use:

```
# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE, format="proportion")
```

```
## V2
## V1 Eng. Math Total
## Eng. 0.3671 0.2619 0.6290
## Math 0.2650 0.1060 0.3710
## Total 0.6321 0.3679 1.0000
```

That value in the bottom-right cell represents the proportion of times our simulated professor chose two math majors.

- Run 10,000 replications of sampling 3 students from your
**students**dataset**without replacement.**Then, construct a summary of your results with the`table`

or`tally`

commands. Finally, estimate the likelihood of choosing 3 math majors from your simulation.

We’ve used a simulation to estimate the likelihood of choosing 2 math majors at random. This time, let’s calculate the likelihood using combinations.

As we saw in class, we calculate this likelihood by calculating:

`P(choose 2 math majors) / P(choose 2 students)`

Since we’re sampling without replacement and order does **not** matter, we use **combinations** to calculate both the numerator and denominator.

To calculate combinations in R, we use the `choose()`

command. In this command, we type the number of objects and then the number we want to choose:

```
# Choose 2 out of 3 math majors
choose(3,2)
```

`## [1] 3`

```
# Choose 2 out of 8 total students
choose(8,2)
```

`## [1] 28`

We can combine these in a single command to calculate the likelihood of choosing 2 math majors at random:

`choose(3,2) / choose(8,2)`

`## [1] 0.1071429`

That answer is close to the likelihood we estimated with our simulation.

- Use the
`choose()`

command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Verify that this answer is similar to the likelihood you estimated in exercise #3.

Finally, let’s verify our answer using the hypergeometric distribution. We’ll discuss the hypergeometric distribution in-depth in unit #2. For now, let’s see how to calculate probabilities with this distribution in R.

The syntax for the hypergeometric distribution is:

`dhyper(q, m, n, k)`

, where

- q = the number of
**TYPE 1**objects chosen - m = the total number of
**TYPE 1**objects - n = the number of
**TYPE 2**objects - k = the total number of objects (of both types) chosen

If we let **math majors** be our **TYPE 1** objects, our scenario would need:

- q = 2 (we want to choose 2 math majors)
- m = 3 (total number of math majors)
- n = 5 (total number of engineering majors)
- k = 2 (we want to choose 2 students in total)

Let’s calculate this probability:

`dhyper(2, 3, 5, 2)`

`## [1] 0.1071429`

This answer (0.1071429) matches the answer we calculated via combinations.

- Use the
`dhyper()`

command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Although it seems as though you have 3 different types of objects, you really only have math majors and non-math majors.

Recall the scenario in class where 48 bank supervisors were asked if they’d promote an applicant to a managerial position. 24 of those supervisors were randomly chosen and told the applicant was male. The other 24 supervisors were told the applicant was female.

The results of the study were summarized in a table:

Observed | Promoted | Not promoted | Total |
---|---|---|---|

Male App | 21 | 3 | 24 |

Female App | 14 | 10 | 24 |

Total | 35 | 13 | 48 |

If the gender of the applicants had no affect on the promotion decisions, we’d expect a similar number of male and female applicants to be promoted. Instead, the results of this study indicate males were much more likely to be promoted than females.

Our question of interest is this: **Assuming gender does not influence promotion decisions, how likely were we to observe 21 or more promoted males in this study?**

We’ll estimate this likelihood via simulation and the hypergeometric distribution.

The code for this simulation is going to be a bit complicated. Two factors make it complicated: (1) I want to make the code as readable as possible for you, and (2) I’m not a very skilled coder.

Let’s start by constructing our dataset. I need to create a dataset that has 21 promoted males, 3 unpromoted males, 14 promoted females, and 10 unpromoted females. To do this, I’m going to create two variables: `gender`

and `promoted`

.

For the `gender`

variable, I need to create 24 males and 24 females using:

`gender <- c( rep("Male",24), rep("Female", 24) )`

Remember, this will create a list of the word “Male” repeated 24 times, followed by “Female” repeated 24 times. I want 21 of those 24 males to be promoted and 3 to not be promoted. I then want 14 promoted females and 10 unpromoted females. To do this, I’ll use:

`promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10))`

I’ll then combine the two variables (gender and promoted) into a single dataframe I’ll call `promotions`

.

Let’s put this together in a single command:

```
promotions <- data.frame(
gender = c( rep("Male",24), rep("Female", 24) ),
promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10) )
)
```

I could have written that across a single line, but I’m hoping the spacing helps you interpret the code.

Let’s look at a tally (or table) of this data to make sure it matches the actual results from this experiment:

`tally(promoted~gender, data=promotions)`

```
## gender
## promoted Female Male
## No 10 3
## Yes 14 21
```

The rows and columns are in a different (alphabetical) order, but this does match the results from the study.

To simulate the results of this study, we’re going to assume that **gender had no affect on promotion decisions**. If that’s true, the gender listed on the application (male or female) had no impact on the supervisors decision to promote the applicant.

Imagine one supervisor in this study who was randomly assigned a male applicant and decided to promote that male applicant. Now imagine we go back in time and, through random assignment, this supervisor was now assigned a female applicant. Because we’re assuming that gender does not influence promotion decisions, we assume that this supervisor would still choose to promote the applicant (even though the supervisor now believes the applicant is female).

This means that in order to simulate the results of this experiment, we simply need to shuffle the gender of the applicants assigned to the supervisors. To randomly shuffle the gender variable, we need to add `shuffle()`

to the previous command. I’m also going to add a `format="data.frame"`

argument to store the results of this shuffling process as a dataframe:

```
shuffled_gender <- tally(promoted ~ shuffle(gender), data=promotions, format="data.frame")
shuffled_gender
```

```
## promoted shuffle.gender. Freq
## 1 No Female 7
## 2 Yes Female 17
## 3 No Male 6
## 4 Yes Male 18
```

As you can see, randomly shuffling the data yielded different results than the original experiment. While we still have 35 promotions, there’s now a different number of promoted males.

Notice that the number of promoted males is stored in the 4th row and 3rd column of our `shuffled_gender`

dataframe. Using square brackets, I can tell R to give me **only** this single value:

`shuffled_gender[4, 3]`

`## [1] 18`

Yep, that’s the number of promoted males we obtained.

As we’ve done several times now, we’re going to replicate our simulation many times. We’re going to:

- Assume gender does not influence promotion decisions
- Randomly assign male or female applicants to our supervisors (shuffle the gender variable)
- Record the number of promoted males
- See how likely we were to get 21 or more promoted males

Here’s the code (with some comments to help explain) to run 10,000 replications of this simulation:

```
shuffled_gender <- do(10000) * # Replicate 10,000 times
tally( # Create a table
promoted ~ shuffle(gender), # of promotions and (shuffled) gender.
data=promotions, # The data is stored in "promotions"
format="data.frame")[4,3] # Access the number of promoted males
```

Now we can examine the distribution of promoted males we got from our simulation:

```
# Tally the results
tally(~result, data=shuffled_gender)
```

```
##
## 12 13 14 15 16 17 18 19 20 21 22 23
## 1 33 210 688 1631 2424 2391 1606 751 222 39 4
```

```
# Create a histogram of the results
histogram(~result, # Plot the results
data=shuffled_gender, # Stored in shuffled_gender
type="count", # Show frequencies on the y-axis
width=1, # Make the bars have width = 1
xlab = "Promoted males", # Label the x-axis
v = 21, # Draw a vertical line at X=21
groups = (result >= 21) ) # Color all bars >= 21
```