Activity #3: R Syntax

Examples from Activity #3

4. Calculate the likelihood of the professor choosing 2 engineering majors at random.

We’ll first do this as a simulation (like the last example in the syntax for Activity #2):

# Let's create 5 engineering majors (all named "E") and 3 math majors (all named "M").
Students <-c ("E","E","E","E","E","M","M","M")
# To choose 2 students, we sample 2 of these letters.  We'll do this 30,000 times
Money <- do(30000) * sample(Students, 2, replace=FALSE)

## Loading required package: parallel

# Note that this creates two columns
#   V1 = first student, V2 = second student
# We can then find the proportion of those samples that have two M's
prop( ~(V1=="M" & V2=="M"), data=Money)

##   TRUE 
## 0.1061

That final value is our estimated probability. We can also use the hypergeometric distribution to estimate this probability:

Recall that the function is: phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

where:
q = number of type 1 objects chosen
m = number of type 1 objects
n = number of type 2 objects
k = number of objects selected

We have 3 math majors (Type 1; m=3) and 5 engineering majors (Type 2; n=5). We select 2 students at random (k=2) and we want to know the probability of getting 2 math majors (q = 2).

We want to calculate P(M=2). I only know how to calculate P(M≤2), so I’ll need to take > P(M≤2) - P(M≤1).

Our syntax would be:

phyper(2, 3, 5, 2, lower.tail = TRUE, log.p = FALSE) - phyper(1, 3, 5, 2, lower.tail = TRUE, log.p = FALSE)

## [1] 0.1071

6. Calculate the likelihood of the professor choosing 1 engineering major and 1 math major.

Simulation:

# We'll use the simulation we created in the previous question.
# I'll count the proportion of times we got
# E,M  or M,E
prop( ~(V1=="M" & V2=="E" | V1=="E" & V2=="M"), data=Money)

##  TRUE 
## 0.534

Hypergeometric:

# I'll take P(M≤1)-P(M≤0) to find P(M=1)
phyper(1, 3, 5, 2, lower.tail = TRUE, log.p = FALSE) - phyper(0, 3, 5, 2, lower.tail = TRUE, log.p = FALSE)

## [1] 0.5357

7. Suppose the class had 50 engineering majors and 30 math majors. Further, suppose the professor claimed to have randomly chosen 20 math majors to receive money. What’s the probability of this happening if the professor chooses students at random?

I’ll skip straight to the hypergeometric distribution:

# I'll take P(M≤20)-P(M≤19) to find P(M=20)
phyper(20, 30, 50, 20, lower.tail = TRUE, log.p = FALSE) - phyper(19, 30, 50, 20, lower.tail = TRUE, log.p = FALSE)

## [1] 8.499e-12

8. Calculate the probability that you answer all 4 randomly chosen questions correctly.

We can use the hypergeometric distribution here, too. We’ll let:
Questions you know = m = 80
Questions you don't know = n = 20
Questions we select = k = 4

To find P(all 4 correct) = P(x=4) = P(x≤4)-P(x≤3), we use:

phyper(4, 80, 20, 4, lower.tail = TRUE, log.p = FALSE) - phyper(3, 80, 20, 4, lower.tail = TRUE, log.p = FALSE)

## [1] 0.4033

If I wanted to reduce that probability, I could grade more questions. Suppose I grade 6 questions (instead of 4). The probability of getting all 6 questions correct would be:

phyper(6, 80, 20, 6, lower.tail = TRUE, log.p = FALSE) - phyper(5, 80, 20, 6, lower.tail = TRUE, log.p = FALSE)

## [1] 0.2521

14. What’s the probability you return 0, 1, 2, 3, or 4 babies to the correct mothers?

If we wanted to conduct a simulation of this random babies experiment, we’d use the exact same syntax used in the previous activity (where we randomly shuffled the letters ABCD).

17. What’s the probability that all four students choose the same tire?

We need to simulate 4 students each choosing a tire. Let’s call the tires A, B, C, and D.

Tires <-c ("A","B","C","D")

To simulate the 4 students each choosing a tire, we’ll sample 4 “tires” with replacement (since multiple students can select the same tire). We’ll do this 100,000 times:

TiresChosen <- do(100000) * sample(Tires, 4, replace=TRUE) 
# This stores the first student's choice as V1, the second student's choice as V2, etc.
head(TiresChosen)

##   V1 V2 V3 V4
## 1  A  C  B  A
## 2  D  C  D  A
## 3  D  B  A  C
## 4  A  B  D  C
## 5  B  A  C  A
## 6  D  B  B  D

To estimate the probability of all 4 students choosing the same tire, we simply find the proportion of our 100,000 trials that yielded the same letter for all 4 students:

prop( ~(V1==V2 & V1==V3 & V1==V4), data=TiresChosen)

##    TRUE 
## 0.01577

That estimate should be close to the answer we got in class (.015625).

18-19) Here’s our dataset:

Observed	Yawn seed	No seed	Total
Yawned	10	4	14
Did not Yawn	24	12	36
Total	34	16	50

As explained in the activity, we can argue that the 14 subjects who yawned would have yawned regardless of getting a yawn seed or not. Under our null hypothesis, the yawn seed had no effect on whether subjects yawned or not.

With this logic, we can treat our 50 subjects as consisting of 14 yawners and 36 non-yawners. For simplicity, I’ll create a vector of subjects where 1 = Yawner and 0 = Non-Yawner.

Yawners = c(rep(1,14),rep(0,36))
Yawners

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

There you can see our 14 1’s (Yawners) and 36 0’s.

To simulate the random assignment of our 50 subjects into the yawn seed groups, we can randomly sample 34 subjects (without replacement) to represent the Yawn Seed Group. We’ll repeat this random sampling 25,000 times:

YawnSeed <- do(25000) * sample(Yawners, 34, replace=FALSE) 
# Note that this creates 34 columns
#   V1 = first subject, V2 = second subject, etc.

Now we want to count the number of Yawners (1’s) that were randomly assigned to the Yawn Seed group:

# Creates a new column, sum, that adds all the 1s from the 34 sampled values
YawnSeed$sum <- rowSums(subset(YawnSeed, select=V1:V34))

Our actual (observed) data had 10 Yawners randomly assigned to the Yawn Seed group. To estimate the p-value, we need to estimate the likelihood of observing something as or more extreme than that. So, in this scenario, we need to find the proportion of our 25,000 samples that have 10 or more yawners in the Yawn Seed group:

# Creates a histogram of the number of yawners in all 25,000 simulations
histogram( ~sum, width=1, main="Yawners in the Yawn Seed Group", v=9.5, col="Steelblue", data=YawnSeed)

plot of chunk unnamed-chunk-16

# Estimates the p-value
prop( ~(sum>=10), data=YawnSeed)

##   TRUE 
## 0.5117

20) Calculate an exact p-value.

We can use the hypergeometric distribution once more:
Yawners = m = 14
Non-Yawners = n = 36
Subjects selected for Yawn Seed group = k = 34

To find P(10 or more yawners in our sample of 34) = P(x≥10) = 1-P(x≤9), we use:

1-phyper(9, 14, 36, 34, lower.tail = TRUE, log.p = FALSE)

## [1] 0.5128

21-28) Does the drug increase subjects’ running speed?

We first need to create the dataset:

group=c(rep("drug",4), rep("placebo", 4))
place=c(1, 2, 4, 5, 3, 6, 7, 8)
Race = data.frame(group=group,place=place)
Race

##     group place
## 1    drug     1
## 2    drug     2
## 3    drug     4
## 4    drug     5
## 5 placebo     3
## 6 placebo     6
## 7 placebo     7
## 8 placebo     8

In class, we used the sum of the ranks as our test statistic. Let’s find that sum for each group:

sum(place ~ group, data = Race)

##    drug placebo 
##      12      24

To estimate the p-value, we wanted to know the likelihood of getting a sum less than 12 if we randomly select 4 of these subjects to take the drug (which we assume has no effect on the results of the race).

That’s the same as simply shuffling the group for each subject. We’ll take all 8 of our subjects and mix-up all their group assignments (drug or placebo). Here are two examples of this shuffling:

RaceShuffle1 = data.frame(group=shuffle(group),place=place)
RaceShuffle2 = data.frame(group=shuffle(group),place=place)
RaceShuffle1

##     group place
## 1 placebo     1
## 2    drug     2
## 3    drug     4
## 4 placebo     5
## 5    drug     3
## 6 placebo     6
## 7 placebo     7
## 8    drug     8

RaceShuffle2

##     group place
## 1    drug     1
## 2 placebo     2
## 3    drug     4
## 4 placebo     5
## 5    drug     3
## 6    drug     6
## 7 placebo     7
## 8 placebo     8

As you can see, the individual subjects may get reassigned to either group in each shuffle. It doesn’t impact their performance in the race, though. The subject who finished first was going to finish first regardless of the drug – we’re assuming the drug does not matter.

Let’s run 30,000 of these shuffles and see how many times we get a sum of 12 or lower for the placebo group:

RaceShuffles <- do(30000) * sum(place ~ shuffle(group), data=Race)
histogram( ~drug, width=1, main="Sum of Ranks in Placebo Group", v=12.5, col="Steelblue", data=RaceShuffles)

plot of chunk unnamed-chunk-21

# Estimates the p-value
prop( ~(drug<=12), data=RaceShuffles)

##   TRUE 
## 0.0565

29-32) Free-throw example.

We first need to create the dataset:

style=c(rep("U",3), rep("T", 3))
ftmade=c(10, 14, 8, 6, 12, 9)
FreeThrows = data.frame(style=style,ftmade=ftmade)
FreeThrows

##   style ftmade
## 1     U     10
## 2     U     14
## 3     U      8
## 4     T      6
## 5     T     12
## 6     T      9

In class, we used the sum of the ranks as our test statistic. Let’s find that sum for the underhanded and traditional styles:

sum(ftmade ~ style, data = FreeThrows)

##  T  U 
## 27 32

To estimate the p-value, we wanted to know the likelihood of getting a sum greater than 32 if we randomly select 3 of these subjects to shoot underhanded (which we assume has no effect on the results of the race).

That’s the same as simply shuffling the group for each subject. We’ll take all 6 of our subjects and mix-up all their group assignments (underhanded or traditional). Here are two examples of this shuffling:

Let’s run 25,000 of these shuffles and see how many times we get a sum of 32 or higher for the underhand group:

FTShuffles <- do(25000) * sum(ftmade ~ shuffle(style), data=FreeThrows)
histogram( ~U, width=1, main="Sum of Ranks in Underhanded Group", v=31.5, col="Steelblue", data=FTShuffles)

plot of chunk unnamed-chunk-24

# Estimates the p-value
prop( ~(U>=32), data=FTShuffles)

##   TRUE 
## 0.2997

That’s our p-value.