Activity #2: R Syntax

Examples from Activity #2

5. Karl Pearson, a famous statistician, once tossed a coin 24,000 times and observed 12,012 heads. Using his results, we’d estimate the probability of tossing heads to be 0.5005.

Here’s the code to create the graph showing the running proportion of heads in 50,000 flips:

# First, specify how many coin tosses we want.  We'll start with 1000
N = 1000
# Flip a coin N times and store the results in **coin**
coin <- do(N) * rflip()
# Add a column: *sum* = cumulative number of heads
coin <- data.frame(coin,
                  sum = cumsum(coin$prop))
# Add a column: *toss* = cumulative number of coin tosses
coin <- data.frame(coin,
                   toss =1:N)   
# Add a column: *runprop* = running proportion of heads
coin <- data.frame(coin,
                   runprop = coin$sum/coin$toss)

Let’s look at the first several rows of our data frame, coin to see all the columns we added:

head(coin)

##   n heads tails prop sum toss runprop
## 1 1     0     1    0   0    1  0.0000
## 2 1     1     0    1   1    2  0.5000
## 3 1     1     0    1   2    3  0.6667
## 4 1     0     1    0   2    4  0.5000
## 5 1     1     0    1   3    5  0.6000
## 6 1     1     0    1   4    6  0.6667

From this, you can see the final column contains our running proportion of heads at each toss. Let’s graph these results:

# Graph the results
xyplot(runprop~toss, data=coin,  #Plot runprop as a function of toss
       col="steelblue", alpha=0.7, type="o",  #Sets the color & transparency
       xlim=c(1,N), ylim=c(0.0,1.0),   #Sets limits for axes
       xlab="Toss", ylab="Proportion of Heads",   #Labels axes
       main=paste("Final Proportion of Heads = ", coin$runprop[N]))  #Creates title
# Add a horizontal line at 0.50
plotFun(0.5~toss, add=TRUE, col="black", lwd=2, lty=3)

plot of chunk unnamed-chunk-5

5c) If we roll two dice and calculate their sum, what’s the probability that the sum is 7? To estimate this probability, I had a computer simulate 5,000 rolls of two dice.

Rolling a die 5,000 times is like choosing 5,000 numbers between 1-6 (with replacement, of course). We can use the sample() syntax:

roll1 <- sample(1:6, size=5000, replace=TRUE)  #Rolls one die 5000 times
roll2 <- sample(1:6, size=5000, replace=TRUE)  #Rolls another die 5000 times

We can now find the sum of the two dice in each of those 5,000 rolls:

sum <- roll1+roll2  #Finds the sum of the two dice
tally( ~sum)  #Tallies the sum from each of the 5,000 rolls

## 
##   2   3   4   5   6   7   8   9  10  11  12 
## 128 296 415 522 701 791 716 569 409 299 154

histogram( ~ sum, width=1, main="Sums from 5,000 Rolls of Two Dice")  #Creates a histogram

plot of chunk unnamed-chunk-7

Finally, let’s count the number (and proportion) of those sums that are equal to 7:

tally( ~(sum==7))  #Tallies the sums that are equal to 7

## 
##  TRUE FALSE 
##   791  4209

prop( ~(sum==7))  #Proportion of sums that are equal to 7

##   TRUE 
## 0.1582

That proportion represents our estimate of getting a sum of 7.

7) Again, suppose we toss a coin 3 times, but we’re only interested in the number of heads we observe. List the sample space. Is each outcome in the sample space equally likely to occur? What’s the probability that we observe at least 2 HEADS?.

We’ll simulate tossing 3 coins 10,000 times:

ThreeCoins <- do(10000) * rflip(3)  #Flips 3 coins 10,000 times
histogram( ~ heads, data=ThreeCoins, v=1.5, width=1)  #Graphs outcomes

plot of chunk unnamed-chunk-9

tally( ~heads, data=ThreeCoins)  #Tallies outcomes

## 
##    0    1    2    3 
## 1212 3772 3775 1241

prop( ~(heads >= 2), data=ThreeCoins)  #Counts proportion with at least 2 heads

##   TRUE 
## 0.5016

11) Suppose you forget to study and you randomly guess on a 10-question true/false quiz. What’s the probability that you get a perfect score? What’s the probability you get at least one question correct?

Let’s simulate 25,000 students randomly guessing on this true/false quiz. With random guessing, each question is like flipping a coin (heads = correct answer).

Here’s the estimated probability of getting at least one question correct:

Quiz <- do(25000) * rflip(10)  #Flips 10 coins 25,000 times
histogram( ~ heads, data=Quiz, width=1, col="SteelBlue")  #Graphs outcomes

plot of chunk unnamed-chunk-10

tally( ~heads, data=Quiz)  #Tallies outcomes

## 
##    0    1    2    3    4    5    6    7    8    9   10 
##   17  252 1138 2960 5033 6122 5151 2978 1088  236   25

prop( ~(heads >=1), data=Quiz)  #Counts proportion with at least 1 correct (head)

##   TRUE 
## 0.9993

Here’s the estimated probability of getting a perfect score:

prop( ~(heads ==10), data=Quiz)  #Counts proportion with 10 heads (correct answers)

##  TRUE 
## 0.001

14) My two older brothers and I all have the same initials, “BAT.” If my parents chose a name at random from 104 commonly used (English) boy names beginning with the letter “B” and 157 names beginning with “A,” what’s the probability I would have gotten the name Bradley Adam?

There is a 1/104 = 0.0096 chance that they would have chosen “Bradley” at random. That’s like flipping a coin that has a 0.0096 chance of coming up heads. Likewise, choosing a middle name of Adam is like flipping a coin that has a 1/157 chance of coming up heads:

FirstName <- do(50000) * rflip(1, prob=1/104)  #Flips an unfair coin 50,000 times
MiddleName <- do(50000) * rflip(1, prob=1/157)  #Flips another unfair coin 50,000 times

Remember that, for the computer, HEADS=1 and TAILS=0. We can add the results from the two coins on each of the 10,000 trials. If the sum is 2, it means both coins came up heads (and the correct name was chosen):

FullName<- FirstName$heads + MiddleName$heads #Sums the coins
tally(~FullName)  #Shows the number of trials with 0, 1, and 2 heads

## 
##     0     1     2 
## 49259   736     5

prop( ~(FullName ==2))  #Counts proportion with 2 heads (correct name)

##  TRUE 
## 1e-04

15) In how many ways could we arrange the letters ABCD? In this calculation, are we assuming that we are sampling with or without replacement?

If we sample with replacement, we’re simply shuffling the letters ABCD. Let’s have the computer do this 10,000 times:

# Create a vector with the characters A, B, C, D
ABCD <-c ("A","B","C","D")
# Shuffle the letters 25,000 times and store the results in "Words"
Words <- do(25000) * shuffle(ABCD, replace=FALSE)
# This stores the first letter as V1, the second letter as V2, etc.

If we wanted to estimate the probability of getting the letters “ABCD” in order, we would use the following:

prop( ~(V1=="A" & V2=="B" & V3=="C" & V4=="D"), data=Words)

##    TRUE 
## 0.04176

If we want a list of all the different 4-letter “words,” we’d need to tally

# Combine the V1-V4 variables into a single 4-character string
Words2<- paste(Words$V1,Words$V2,Words$V3,Words$V4, sep="") 
# Tally the number of different words
tally(~ Words2)

## 
## ABCD ABDC ACBD ACDB ADBC ADCB BACD BADC BCAD BCDA BDAC BDCA CABD CADB CBAD 
## 1044 1012 1038 1023 1038 1085 1067 1071 1021 1155 1015  958 1045  994 1049 
## CBDA CDAB CDBA DABC DACB DBAC DBCA DCAB DCBA 
## 1090 1043 1026 1007  995 1052  990 1080 1102

# Make the table easier to view as columns
cbind(duration.freq = table(Words2))

##      duration.freq
## ABCD          1044
## ABDC          1012
## ACBD          1038
## ACDB          1023
## ADBC          1038
## ADCB          1085
## BACD          1067
## BADC          1071
## BCAD          1021
## BCDA          1155
## BDAC          1015
## BDCA           958
## CABD          1045
## CADB           994
## CBAD          1049
## CBDA          1090
## CDAB          1043
## CDBA          1026
## DABC          1007
## DACB           995
## DBAC          1052
## DBCA           990
## DCAB          1080
## DCBA          1102

From that table, we can count 24 different outcomes.

16-25) Now might be a good time to demonstrate how to calculate combinations and permutations in R.

Let’s start with permutations. We can create a function in R that calculates the number of permutations given n = number of objects and x = number of objects selected:

perm = function(n, x) {
  return(factorial(n) / factorial(n-x))
}

We can then call that function. As an example, suppose we want to replicate our answer to #14 (arranging the letters ABCD). We would set n=4 and x=4:

perm(n=4,x=4)

## [1] 24

To replicate #20 (choosing president, VP, and secretary from 6 people), we would set n=6 and x=3:

perm(n=6,x=3)

## [1] 120

Combinations are just as simple. First, we create a function with inputs n and x:

comb = function(n, x) {
  return(factorial(n) / (factorial(x) * factorial(n-x)))
}

To replicate #23 (5-card hands from 52 cards), we would set n=52 and x=5:

comb(n=52,x=5)

## [1] 2598960

To replicate #24 (dividing 8 subjects into two equal-sized groups), we would set n=8 and x=4:

comb(n=8,x=4)

## [1] 70

31) Suppose we have 10 IE majors and 10 other majors in this class. I need to choose 4 students at random to fail this course (thus, satisfying my ego). In how many ways could I choose 4 students out of 20? In how many ways could I choose 4 students out of the 10 IE majors? Using the results from these two questions, what is the probability that I randomly select 4 IE majors to fail?

We can use the combo function that we just wrote. First, let’s calculate the number of ways to choose 4 students out of 20:

comb(n=20,x=4)

## [1] 4845

Next, let’s calculate the number of ways to choose 4 students from 10 IE majors:

comb(n=10,x=4)

## [1] 210

Finally, we divide the two results to get our probability:

comb(n=10,x=4)/comb(n=20,x=4)

## [1] 0.04334

As we’ll learn in Activity #8, we could use the hypergeometric distribution to get this probability:

The function is: phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

where:
q = number of type 1 objects chosen
m = number of type 1 objects
n = number of type 2 objects
k = number of objects selected

To get our probability, we’d need:

phyper(4, 10, 10, 4, lower.tail = TRUE, log.p = FALSE)-phyper(3, 10, 10, 4, lower.tail = TRUE, log.p = FALSE)

## [1] 0.04334

One final way to estimate this probability would be via simulation. In this scenario, we have 20 students. Let’s create 10 engineering majors (who all happen to have the name “E”) and 10 other majors (who all have the name “X”).

Students <-c ("E","E","E","E","E","E","E","E","E","E","X","X","X","X","X","X","X","X","X","X")
Students

##  [1] "E" "E" "E" "E" "E" "E" "E" "E" "E" "E" "X" "X" "X" "X" "X" "X" "X"
## [18] "X" "X" "X"

To choose 4 students, we simply sample 4 of these letters. We’ll do this 100,000 times:

Failures <- do(100000) * sample(Students, 4, replace=FALSE) 
# This stores the first student as V1, the second student as V2, etc.
head(Failures)

##   V1 V2 V3 V4
## 1  E  X  X  X
## 2  X  E  E  X
## 3  X  X  E  E
## 4  E  X  X  X
## 5  X  E  X  X
## 6  E  X  X  X

If we wanted to estimate the probability of choosing 4 engineering majors, we’d simply find the proportion of our 100,000 trials that yielded “E, E, E, E.”

prop( ~(V1=="E" & V2=="E" & V3=="E" & V4=="E"), data=Failures)

##    TRUE 
## 0.04229