Installing and using R, RStudio

To complete this lab, you’ll need to download and install two free applications:

R: a statistical programming language
RStudio: an interface for R

By the end of the semester, you’ll have some basic fluency in R and the ability to analyze data in RStudio. Until then, I don’t expect that you’ll understand all (or even most) of the code you’ll see in these labs. If you have questions, meet with me or email me.

I encourage you to work with other students on these labs and explore beyond what each lab requires. A willingness to experiment will help you develop skills in R (skills that are in high demand!).

Let’s get started by opening RStudio. If you’re using a lab computer, both programs should already be installed. Open RStudio and skip down to the RStudio Interface section of this lab.

If you’re using your own computer, you’ll need to download and install both programs from the following websites:

R: www.cran.r-project.org. Select the version you want:
- Mac OS X (Download the R-3.3.3.pkg file (or whatever version is there).
- Windows (Download the latest version - R 3.3.3 as of April 2017)
- Linux
RStudio: https://www.rstudio.com/products/rstudio/download/. Download the appropriate installer (at the bottom of the page).

Open the files you downloaded to install both R and RStudio. Then, open RStudio.

If you have any problems installing or opening RStudio, let me know so I can help.

RStudio Interface

When you open RStudio, you should get a screen that looks like this (only not so full of text).

Drawing

That screenshot shows I’m using RStudio to write this lab. As you can see, the interface has 4 panels:

The source pane is in the upper-left quadrant. This is where you’ll type and publish your lab reports (by typing in commands and then clicking the knit button).
The workspace pane in the upper-right quadrant. It will show datasets and functions you load into R. Since you haven’t loaded anything yet, it tells you the environment is empty. If you click the History tab, you will see a list of commands you’ve previously entered.
The files, plots, packages, help pane in the lower-right quadrant. Plots you generate can show up in this pane. You can also use this pane to install and load packages or read help files.
The console pane is in the lower-left quadrant. This is where the action happens. When you launch RStudio, the console will tell you information about the version of R you are using. At the bottom, you’ll see the prompt (>) where you can type commands to get immediate results.

Let’s type some very basic commands to the right of the prompt to see how R works. First, try typing:

2 + 2

When you push return (or enter), you should see the following output:

## [1] 4

In addition to the correct answer, the [1] indicates you’re looking at the first line of output. As you’ll soon see, R is much more than a calculator, but it can be useful in calculating expressions such as: \(\sin \sqrt[5]{\frac{2\pi e^{3}}{\sqrt{7}}}\)

sin( ( (2 * pi * exp(3)) / sqrt(7) )^(1/5) )

## [1] 0.8279101

Before we officially begin the content portion of this lab, let’s look at one more useful feature of R: assignment.

If you type the symbols less-than and hyphen, you type something that looks like an arrow pointing to the left: <-. You can use this arrow to assign values to variables.

For example, you can create a variable x and assign it the value 3 by typing this into the console:

x <- 3

You could assign the value 4 to the variable banana with something like this:

banana <- 2 + 2

Notice that you get no output from these commands. That’s because you haven’t told R to give you any output. If you wanted to see the values of these variables, you can simply type their names:

## [1] 3

banana

## [1] 4

You could also perform functions on these variables, like adding them together:

x + banana

## [1] 7

Trust me – you’ll eventually get used to thinking of <- as an arrow assigning a value to a variable.

Installing helpful packages

R has thousands of packages (collections of functions, data, and code) that can simplify your work. Two helpful packages we’ll use throughout this course are the tidyverse and mosaic packages.

To install these packages on your computer, you can type and run the following commands into the console prompt:

install.packages("tidyverse", dependencies = TRUE)
install.packages("mosaic", dependencies = TRUE)

When you run these commands, the console will fill-up with status messages. As long as you don’t see any error messages, it means the packages were installed correctly.

Another way to install packages would be to click the Packages tab in the lower-right quadrant of RStudio. This will show you a list of packages currently installed. At the top of this list, you’ll find an Install button. Click this button, type the name of the package want to install (tidyverse or mosaic), and click install.

Generating lab reports

The document you’re reading right now is the lab. It will contain helpful information and commands you’ll use to write a lab report.

How do you write a lab report? Thankfully, you don’t need to open up a word processor and copy-and-paste a bunch of stuff. Instead, you’ll generate your entire lab report within RStudio.

To do this, you’ll need to download the lab report template and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab1report.Rmd

Download this file and open it in RStudio. It should open in the upper-left panel of your screen. The first few lines of the template will look something like this:

# ---
# title: "Your Turn #1"
# output:
#   html_document:
#     css: http://www.bradthiessen.com/batlab3.css

You can ignore this text; it just formats the lab report. Just below this formatting information, you’ll find:

# **Date:** [Enter today's date]
#
# **Author:** [Enter name of person typing this report here]
#
# **Lab Partner(s):** [Enter names of other students who worked on this lab.]

This is where you should type today’s date and the names of the students who helped write this lab report. Just type over the information between the brackets.

Now you’re ready to begin completing the exercises in this lab (which will begin in just a bit). You’ll read the exercise in this lab file, then scroll down in your lab report template until you find where you’re supposed to type your answers.

For example, suppose the first exercise of this lab was…

Create a variable named “my_feelings” and assign to it your feelings about statistics.

To answer this in your lab report, you’d scroll down in the template until you find:

Drawing

You’d then type-in a command like:

Drawing

That’s all there is to it! Be sure to save your work often. At the end of this lab, you’ll learn how to publish your lab in a nicely formatted .html file.

These labs should provide you with enough support to complete everything.
If you ever have any questions, let me know.

Simulation to estimate probabilities

Flipping coins with nflip()

In activity #1, we simulated dolphins choosing the correct button 15 out of 16 times. We wanted to estimate the likelihood of choosing 15 (or more) correct buttons if, in fact, the dolphins could not communicate.

We simulated this with an actual coin in class, then with an applet, and finally with R. Let’s see if you can run this simulation on your own in R.

To do this, you need to learn how to tell R to flip a coin. There are several ways to do this, but let’s start with the rflip() command from the mosaic package we loaded in this lab. To use this command, we can simply type:

rflip()

## 
## Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
## 
## T
## 
## Number of Heads: 0 [Proportion Heads: 0]

As you can see, this command yields the results from flipping a single coin. It shows a result of either H (for heads) or T (for tails). Like a real coin, the results can differ each time we run this command. If you repeatedly type rflip() into the console, you’d see that sometimes you’d get heads and sometimes you’d get tails.

In our simulation, we need to flip 16 coins. Can the rflip() function do this? One way to find out would be to look at the help file for this command. To get help, we could type:

help("rflip")

If you type that command into the console, the help file would load in the lower-right panel of RStudio. In that help file, you’d see a description of the rflip() command along with its usage:

rflip(n = 1, prob = 0.5, quiet = FALSE, verbose = ~quiet)

This tells us we can enter in some arguments into the command. The help file presents the following list of arguments for the rflip() command:

n = the number of coins to toss
prob = probability of heads on each toss

To toss 16 coins, then, we simply use the command rflip(n=16) or, more succinctly, rflip(16). If we really wanted to be sure we were flipping a coin with a 50% chance of landing on heads, we’d use:

rflip(n=16, prob=0.5)

## 
## Flipping 16 coins [ Prob(Heads) = 0.5 ] ...
## 
## H H T T H H H T H H T T H T T H
## 
## Number of Heads: 9 [Proportion Heads: 0.5625]

This shows us the results of all 16 coins along with the overall number and proportion of heads obtained.

Use the rflip() command to flip 20 fair coins. Then, use nflip() to flip 20 coins. Notice the output is simplified - it only shows the number of heads. Use either rflip() or nflip() to flip 10 unfair coins (in which the probability of heads = 0.80 for each coin).

Remember, you’re going to complete this exercise by typing code into your report template.

Replication with Do() *

Flipping 16 coins simulates the dolphin experiment one time and shows us one possible result from the experiment. We’d like to know which results are likely if we were to repeat this experiment many times. To do this, we need to replicate our simulation.

With the mosaic package, replication is simple! We simply use the Do() command.

Remember, to flip 16 coins one time, we use nflip(16). If we wanted to repeat this process twice, we’d use:

Do(2) * nflip(n=16)

## # A tibble: 2 × 1
##   nflip
##   <dbl>
## 1     8
## 2     6

The output shows the number of heads, number of tails, and proportion of heads we obtained in both of our replications.

Use Do() * nflip() to flip 20 fair coins 5 times. Then, flip 10 unfair coins (in which the probability of heads = 0.80 for each coin) 7 times.

We want to repeat this process many times, so we’ll conduct 10,000 replications. To do this, we would use Do(10000) * nflip(n=16). Before we do that, think about what would happen. Our screen would fill with 10,000 lines of output showing the results of each replication!

We don’t need to see the results of all 10,000 replications; we only need R to remember the results of all 10,000 replications. To do this, we can assign our simulation to a variable. Remember the <- arrow that assigns values to variables? Let’s use it:

coinflips <- Do(10000) * nflip(n=16)

As you can see, we don’t get any output from this command. That’s because R has stored all the results in a variable named coinflips. We can look at the first several rows of output (stored in the coinflips variable) with the command head().

head(coinflips)

## # A tibble: 6 × 1
##   nflip
## * <dbl>
## 1     9
## 2     8
## 3     9
## 4     7
## 5     8
## 6     6

I can also look at the top-right panel of RStudio and see that a data frame named coinflips has been created. If I click the little arrow to the left of that variable, I can see some information about the values it contains.

Drawing

Flip 20 fair coins 1,000 times and store the results as coins. In other words, assign the results of 1,000 replications of 20 coin flips to the variable coins. Then, look at the first several rows of the results using the head() command. Finally, look at the top-right panel of RStudio to verify the variable coins has been created.

Histogram of simulation results

We’ve flipped 16 coins 10,000 times and stored the results as coinflips. Instead of looking row-by-row at the results, let’s visualize all the results of all 10,000 replications.

To do this, we can construct a histogram. In R, we can generate a histogram with the command histogram(). The histogram() command allows many arguments, including:

histogram(~x, data, xlab, type, width, groups, col)

The only arguments we really need are:

x = the variable (values) we want to plot
data = the data frame containing our variable (values)

Let’s construct a simple histogram of the number of heads in each of our 10,000 replications:

histogram(~nflip, data=coinflips)

Notice that I needed to type a tilde ~ before specifying the variable I wanted to plot (heads). We’ll learn why we need this ~ symbol in a later lab.

The data frame coins contains the results of 1,000 replications of 20 coin flips. Construct a histogram of the number of heads in all 1,000 replications.

Now we can use additional arguments to modify this histogram. For example, I don’t really like the color scheme of that histogram. If I want to change the color, I can use the col() argument:

histogram(~nflip, data=coinflips, col="grey", border="white")

Change the colors of the bars and borders in your histogram. Try any colors you’d like. I’d suggest using “lightgreen” and “forestgreen.” If you want to see a list of possible colors, you can visit [this website] (http://research.stowers-institute.org/efg/R/Color/Chart/)

If you look at the y-axis of these histograms, you’ll notice it represents density. We haven’t yet discussed density, so we might want to change the axis to represent the number of times each result was obtained. To do this, we can change the type of histogram:

histogram(~nflip, data=coinflips, col="grey", border="white", type="count")

You can see the y-axis now shows the number of times each result was obtained. One other aspect of the histogram I want to set is the width of the bars. Depending on the width I set, this histogram could look very different. Since I know the number of heads must be a whole number (1 through 16), I’m going to set the width of each bar equal to 1.

histogram(~nflip, data=coinflips, col="grey", border="white", type="count", width=1)

Finally, let’s clarify the label of our x-axis:

histogram(~nflip, data=coinflips, col="grey", border="white", type="count", width=1,
          xlab = "Number of heads from 16 coin flips")

Change your histogram to have counts represented on the y-axis, bar widths of 1, and a descriptive label on the x-axis.

Estimating our likelihood of interest

We’ve flipped 16 coins 10,000 times, stored the results as coinflips, and plotted the results in a histogram. Now, let’s estimate our likelihood of interest. Remember, we’re interested in the likelihood of obtaining 15 or more heads out of 16 coin flips.

We can visualize this likelihood with our histogram. I’m going to add one last argument: ‘v’:

histogram(~nflip, data=coinflips, type="count", width=1,
          xlab = "Number of heads from 16 coin flips",
          v = 15)

The v = 15 argument draws a vertical line at X = 15.

Add a vertical line to your histogram at X = 12.

The histogram allows us to see that the likelihood of interest is small, but it doesn’t tell us how small. What we want to do is estimate the proportion of our 10,000 replications that resulted in 15 or more heads. To do this, we can use the prop() command.

prop(~nflip >= 15, data=coinflips)

##  TRUE 
## 4e-04

This counts the proportion of our results (~nflip) that are greater than or equal to 15 (noting that our data is stored in coinflips).

Remember this value represents the likelihood of the dolphins choosing 15 or more correct switches if they cannot communicate. Since this p-value is extremely low, we have evidence that this belief that dolphins cannot communicate is incorrect. From this, we might conclude that dolphins can, in fact, communicate.

If we were to go through this process again with a different set of 10,000 replications, we’d get a slightly different estimated likelihood. Each time, though, we’d find the estimated likelihood would be very close to 0.00026. How do I know this? We’ll find out by the end of this unit.

Tea Time

In the 1920s, Dr. Muriel Bristol claimed to be able to tell – by taste alone – whether milk was added to tea or tea was added to milk.

A statistician, R. A. Fisher, proposed to test that claim. Fisher gave Dr. Bristol 8 cups of tea (4 with milk added to the tea; 4 with tea added to the milk) in random order. Dr. Bristol was asked to identify which variety of tea was in each of the 8 cups.

Dr. Bristol was able to correctly identify all 8 cups of tea.

Simulate this experiment with 10,000 replications of coin flips. Then, create a histogram of the results and estimate the p-value from this experiment.

Explain what this p-value represents. What assumptions were you making when you simulated this experiment?

Psychic

The simulations we’ve conducted have all used fair coins. Let’s conduct a simulation where the probability of heads should not be 0.50.

Statistician Jessica Utts has analyzed studies investigating psychic functioning. In one study, a “sender” concentrated on an image while a “receiver” in another room tried to determine which image was being “sent.” The receiver was given 4 images to choose from (the actual image the sender is concentrating on, plus 3 other images).

The probability of heads in our simulation should represent the probability of the receiver picking the correct image if the sender has no psychic ability.

In this study, subjects chose the correct image in 106 of 329 attempts (a 32.2% success rate).

Simulate this experiment with 10,000 replications. Estimate the p-value.

Explain what this p-value represents. What assumptions were you making when you simulated this experiment?

Answer: Replace this text with your answer.

Hot Hand

Basketball players who make several shots in a row might be described as having a hot hand. Whether the “hot hand” exists or not has generated controversy (as you can see if you Google “hot hand basketball”).

Let’s investigate the performance of Kobe Bryant in the 2009 NBA finals against Orlando Magic. In this series, many believe Kobe appeared to have a hot hand. Let’s load data from these games and look at some of it.

# This first line of code downloads the dataset from a website
load(url("http://www.openintro.org/stat/data/kobe.RData"))
head(kobe[c(4,5,6)])

## # A tibble: 6 × 3
##     time                                             description basket
## * <fctr>                                                  <fctr>  <chr>
## 1   9:47                 Kobe Bryant makes 4-foot two point shot      H
## 2   9:07                               Kobe Bryant misses jumper      M
## 3   8:11                        Kobe Bryant misses 7-foot jumper      M
## 4   7:41 Kobe Bryant makes 16-foot jumper (Derek Fisher assists)      H
## 5   7:03                         Kobe Bryant makes driving layup      H
## 6   6:01                               Kobe Bryant misses jumper      M

Each row of this data frame records a shot taken by Kobe. In the basket column, H represents a shot Kobe hit (made), while M represents a missed shot. Let’s calculate Kobe’s overall shooting percentage from this data:

# Calculate the proportion of H in the basket column (in the kobe data frame)
# Store the resulting proportion as "p"
p <- prop(~basket, data = kobe)

# Print the results in a statement (rounding p to two decimal places)
paste("Kobe's shooting percentage =", round(p,2))

## [1] "Kobe's shooting percentage = 0.44"

Kobe’s first nine shot attempts yielded the following: H M | M | H H M | M | M | M.

It’s difficult to judge the existence of a hot hand simply by looking at the string of hits and misses. Instead, let’s consider the belief that hot hand shooters tend to go on shooting streaks. Let’s define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.

Within the nine shot attempts listed above, there are 6 streaks (separated by the “|” symbol). The lengths of these streaks are: 1, 0, 2, 0, 0, 0.

A function called calc_streak(), was loaded with the data. We can use this function to calculate the lengths of each shooting streak. Once we have these streak lengths, we can look at their distribution.

# In the following line of code, the $ symbol tells R that
# we want the variable **basket** in the **kobe** dataset.
kobe_streak <- calc_streak(kobe$basket)

# Let's look at the lengths of each streak
kobe_streak

##  [1] 1 0 2 0 0 0 3 2 0 3 0 1 3 0 0 0 0 0 1 1 0 4 1 0 1 0 1 0 1 2 0 1 2 1 0
## [36] 0 1 0 0 0 1 1 0 1 0 2 0 0 0 3 0 1 0 1 2 1 0 1 0 0 1 3 3 1 1 0 0 0 0 0
## [71] 1 1 0 0 0 1

# Now, let's create a table summarizing the lengths of the streaks
table(kobe_streak)

## kobe_streak
##  0  1  2  3  4 
## 39 24  6  6  1

# Create a histogram of these results
histogram(~kobe_streak, width=1, col="lightblue", border="white", type="count", 
          xlab="Streak Length", ylab="Number of streaks", xlim=c(-1, 7))

From all of this, we can see the longest shooting streak for Kobe was 4. What can we compare this to?

If Kobe did not have a hot hand, we might assume his shot attempts are independent of one another. Under an assumption of independence, making one shot would not increase the likelihood of Kobe making the next shot. Using Kobe’s estimated 44% shooting percentage, we could write this assumption as:

No hot hand: P(making one shot) = 0.44 P(making next shot given he made the first shot) = 0.44

If, on the other hand, Kobe did have a hot hand, the probability of making a second shot might increase.

Hot hand: P(making one shot) = 0.44 P(making next shot given he made the first shot) = 0.60

To see if Kobe had a hot hand, we can compare his shooting streaks to a simulated player who is guaranteed to have independent shots: a coin.

To conduct this simulation, we will need:

• P(heads) = 0.44 (to represent Kobe’s shooting percentage)

• Run 133 replications of single coin flips (equal to the number of shots in this data frame)

Simulate this experiment with 10,000 replications. Estimate the p-value.

# The code has been provided
bball <- Do(133) * nflip(n = 1, prob = 0.44)

# Calculate the length of each streak and store as "sim_streak"
# Don't worry about understanding this code
sim_streak <- diff(which(c(0, bball$nflip, 0)==0)) - 1

# Histogram
histogram(~sim_streak, width=1, col="lightblue", border="white", type="count", 
          xlab="Streak Length for simulated shooter", ylab="Number of streaks", xlim=c(-1, 7))

Compare the streak lengths of the simulated shooter to those of Kobe Bryant. From this comparison, do you have evidence that Kobe had a hot hand? Briefly explain.

Answer: Replace this text with your answer.

Generating (pubishing) your lab report

When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit button: Drawing

Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.

Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.

You’ve done it! Congratulations on finishing your first lab!

This lab, released under a Creative Commons Attribution-ShareAlike 3.0 Unported license, is a derivative of OpenIntro Lab: Probability, originally developed by Mark Hansen (UCLA Statistics) and adapted by Andrew Bray (Mt. Holyoke) & Mine Çetinkaya-Rundel (Duke).