--- title: "Correlation" output: html_document: code_folding: hide css: http://www.bradthiessen.com/batlab3.css df_print: tibble fig_height: 3.9 fig_width: 6.3 highlight: pygments theme: spacelab toc: yes toc_float: yes html_notebook: code_folding: hide css: http://www.bradthiessen.com/batlab3.css fig_height: 3.9 fig_width: 6.3 highlight: pygments theme: spacelab toc: yes --- ```{r 'global options', echo=FALSE, message=FALSE, results='hide'} knitr::opts_chunk$set( comment = "# ", collapse = TRUE, fig.height = 3.9, fig.width = 6.3 ) ``` ```{r 'prereqs', message=FALSE} # Install and load all necessary packages list.of.packages <- c("tidyverse", "mosaic", "energy", "ggExtra") new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages) library(tidyverse) library(mosaic) library(energy) library(ggExtra) ``` ***** # Scenario: Villains wear black Do NFL teams with more malevolent-looking uniforms play more aggressively than other teams? ![](uniforms.jpg)
[In 1988](http://faculty.smu.edu/chrisl/courses/psyc5351/articles/blackuniforms.pdf) 25 volunteers were paid $2 each to rate the **perceived malevolence** of each NFL team's uniform and [logo](http://www.sportslogos.net/teams/list_by_league/7). The total yards each team was penalized that year were then recorded as z-scores (to represent **standardized penalty yards** for each team). Here's a sample of the [data](http://www.bradthiessen.com/html5/data/nflm.csv): ![](data.png)
***** ## Covariance 1. Describe the relationship between these two variables. Are larger values of *malevolence* associated with larger values of *penalty yards*? When one variable deviates from its mean, does the other variable deviate from its mean in a similar way?

![](Transparent.gif)

Recall the concept of **variance**: $s_{x}^{2}=\frac{\sum \left ( x_{i}-\bar{X} \right )^{2}}{n-1}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( x_{i}-\bar{X} \right )}{n-1}=s_{xx}$
We can visualize the *covariance* of our data in a couple ways: ##### {.tabset .tabset-fade} ###### diagram
![](cov.png)
###### scatterplot ```{r 'nfl-sample-scatterplot', message=FALSE, fig.height = 3, fig.width = 4} nfl_sample <- tibble( malevolence = c(5.1, 4.68, 4, 3.9, 3.38, 2.8), penalty = c(1.19, 0.29, -0.73, -0.81, 0.04, -1.6)) ggplot(nfl_sample, aes(x = malevolence, y=penalty)) + geom_point(color="steelblue", size=3) ``` #### . To calculate the **covariance**, we can modify the formula for variance: $\textrm{cov}_{xy}=s_{xy}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{n-1}=\frac{(5.1-3.977)(1.19- -0.27)+...+(2.8-3.977)(-1.60- -0.27)}{6-1}=0.6889$. ```{r 'calculate-covariance'} # This calculates the variance-covariance matrix cov(nfl_sample) ```
2. Under what conditions would the covariance be negative? positive? zero?

![](Transparent.gif)

3. If we multiplied every value of X and Y by 10. What would happen to the covariance? What are the lowest and highest possible values for a covariance? ```{r 'calculate-covariance-with-bigger-data'} # Multiply everything by 10 ten_times <- nfl_sample * 10 cov(ten_times) ```

![](Transparent.gif)

That's one limitation of using covariance as a measure of the strength of the relationship between two variables: **it depends on the scale of measurement used**. For example, the penalty yards variable in our data are reported as z-scores. If we were to use (unstandardized) penalty yards, the covariance increases to 4.822. If we used penalty **feet** (instead of yards), the covariance would again increase to 14.467. These examples show that larger values of covariance do not necessarily indicate stronger relationships between variables. Because of this, we shouldn't directly compare covariances.
***** ### Correlation: standardized covariance To address this scale dependence, we need to use a unit of measurement *(e.g., meters, pounds, decibels)* into which any scale of measurement *(e.g., penalty distance, uniform malevolence)*could be converted. If we use the standard deviation as that standard unit of measurement, we can standardize the covariance by: - Standardizing our variables and calculating the covariance: $\frac{\sum \left ( z_{xi}-\bar{Z_{x}} \right )\left ( z_{yi}-\bar{Z_{y}} \right )}{n-1}=\frac{\sum z_{x}z_{y}}{n-1}=r_{xy}$ or - Calculating the covariance and standardizing it: $\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{(n-1)s_{x}s_{y}}=r_{xy}$
This is called `Pearson's product-moment correlation coefficient` (or *r*).
4. What are the smallest and largest values of r?

![](Transparent.gif)

***** ### Calculating the correlation coefficient ![](cor.png) 5. Expand the code to see how to calculate the correlation coefficient in R. Interpret this correlation. ```{r 'calculate-r'} # Using cor() cor(nfl_sample$malevolence, nfl_sample$penalty) # Using cor() within dplyr nfl_sample %>% summarize(correlation = cor(malevolence, penalty)) ```

![](Transparent.gif)

Let's now calculate the correlation for our full dataset: ```{r 'full-nfl-data'} # Load data nfl <- read.csv("http://www.bradthiessen.com/html5/data/NFLuniforms.csv") # Rename variables names(nfl) <- c("team", "malevolence", "penalty") # Plot and annotate with correlation ggplot(nfl, aes(x = malevolence, y=penalty)) + geom_point(color="steelblue", size=3) + annotate("text", x=4.5, y=-1.2, label=paste("r = ", round(cor(nfl$malevolence, nfl$penalty),4))) ``` ***** ### Characteristics of the Person product-moment correlation #### Scale invariance The magnitude of the correlation does not change under any linear transformation of the variables. Expand the code to see an example. ```{r 'scale-invariance'} # Let's convert malevolence to centi-malevolence nfl$centi_malevolence <- nfl$malevolence / 100 # The correlation does not change cor(nfl$centi_malevolence, nfl$penalty) ```
#### Strength of *linear* relationship Pearson's r only measures the strength of a *linear* relationship. If we want to measure the strength of nonlinear relationships, we'll need another statistic. Here are some values of r for various scatterplots: ![](scattercor.png)
#### Impact of outliers 6. Below, I've highlighted the two most extreme observations in our data. We know the correlation between malevolence and penalty yards for **all** the data is $r_{xy}=0.43$. What would happen to the value of this correlation if those two highlighted values were removed? ```{r 'calculate-r-with-outliers-removed'} # Add a variable to highlight the two outliers nfl$outlier <- c(1, rep(0,26), 1) # Create scatterplot with the outliers highlighted ggplot(data=nfl, aes(x = malevolence, y = penalty, color=outlier)) + geom_point(size=3) + theme(legend.position="none") # Calculate correlation without highlighted observations nfl %>% filter(outlier==0) %>% summarize(correlation = cor(malevolence, penalty)) ```

![](Transparent.gif)

#### Impact of range restriction Let's simulate a large dataset with a strong, positive correlation: ```{r 'simulate-data'} # Simulate data sim_data <- tibble( x = c(1:1000), y = 2*x + rnorm(1000, 3, 300)) # Plot ggplot(data = sim_data, aes(x, y)) + geom_point(color="steelblue", size=2, alpha=0.5) + annotate("text", x = 125, y = 2222, label=paste("r = ",round(cor(sim_data$x, sim_data$y),3))) + geom_vline(xintercept=900, color="red") ```
7. Look at that sample data. Suppose we only restrict the range to only include data in which $x>900$ (the data to the right of the red line). What will happen to the magnitude of the correlation? Why? ```{r 'calculate-r-with-range-restriction'} # Calculate correlation when only considering x>900 sim_data %>% filter(x>900) %>% summarize(correlation = cor(x, y)) ```

![](Transparent.gif)

#### Correlation does not imply causation Everyone thinks they know *correlation is not causation*. A correlation does not imply a causal relationship, and a *lack of correlation* does not mean there is *no* relationship between two variables. There are many factors that contribute to this. We've already seen that outliers and range restriction can influence the magnitude (and even sign) of Pearson's r. Likewise, if two variables have a nonlinear relationship, Pearson's r will misrepresent that relationship. Yet another reason is the **third-variable problem**. 8. The number of heart attacks in a given month is positively correlated with ice cream sales. Explain why this does not provide evidence that ice cream causes heart attacks.

![](Transparent.gif)

A [2004 article](http://ije.oxfordjournals.org/content/33/3/464.long) questioned research into the relationship between hormone replacement therapy (HRT) and coronary heart disease (CHD). Numerous studies had shown negative correlations between HRT and CHD, leading doctors to propose that HRT was protective against CHD. Randomized controlled trials showed, in fact, that HRT caused a small but significant **increase** in the risk of CHD. Re-analysis of the data from the original studies showed that women undertaking HRT were more likely to be from *higher socioeconomic groups*. These women, therefore, were more likely to have better-than-average diets and exercise regimens. The **third variable** linking hormone replacement therapy to coronary heart disease was **socioeconomic status**.

9. In 2002, a letter in Diabetes Care detailed the calculation behind the correlation of **r = 0.54** between **diabetes rates** and **pollution levels** across all 50 states.

How would you interpret this correlation?
The author tried to be careful in concluding, *"... the correlation between air emissions and the prevalence of diabetes does not prove a cause-and-effect relationship; the significance of the relationship demands attention."*.

In response, Mark Nicolich questioned if the relationship even demanded attention. Using the same diabetes data, Nicolich calculated the correlation between **diabetes rates** and the **alphabetized rank of each state** to be **r = 0.49**. He also found the correlation between **diabetes rates** and the **latitude of each state's capital** to be **r = -0.54**.

![](Transparent.gif)

10. See if you can think of a scenario which would result in the following scatterplot. What is to be learned from this? ```{r 'simulated-test-scores'} test <- tibble( x = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)), y = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)), group = as.factor(c(rep(1,100), rep(2,100), rep(3,100))) ) # Calculate overall correlation overall <- test %>% summarize(r = cor(x, y)) # Calculate correlation for each subgroup subs <- test %>% group_by(group) %>% summarize(r = cor(x, y)) # Plot and annotate with correlations ggplot(data = test, aes(x, y, color = group)) + geom_point(size = 2, alpha=0.7) + geom_smooth(method="lm", se=FALSE) + theme(legend.position="none") + annotate("text", x = 5, y = 18, label=paste("overall r =", round(overall, 2)), color="black") + annotate("text", x = 2, y = 4, label = paste("r =", round(subs[1,2], 2)), color="red") + annotate("text", x = 16, y = 9, label = paste("r =", round(subs[2,2], 2)), color="forestgreen") + annotate("text", x = 18.8, y = 16, label = paste("r = ", round(subs[3,2], 2)), color = "blue") ```

![](Transparent.gif)

If you'd like to explore other examples in which correlation is mistaken for causation, check out: [Spurious correlations](http://www.tylervigen.com/spurious-correlations) [Correlation or Causation](http://jfmueller.faculty.noctrl.edu/100/correlation_or_causation.htm)

***** ## Nonzero correlations? 11. What correlation would you expect to find between two sets of random values? Expand the code to see --> ```{r 'random-correlations'} set.seed(3141) # Generate random values for X and Y random_xy <- tibble( x = rnorm(100, 0, 1), y = rnorm(100, 0, 1) ) cor(random_xy$x, random_xy$y) ggplot(data = random_xy, aes(x, y)) + geom_point(size=3, color="steelblue") + annotate("text", x = 0, y = 2.8, label=paste("overall r =", round(cor(random_xy$x, random_xy$y), 3)), color="black") ```

![](Transparent.gif)

The correlation of any two variables in any sample of data will be non-zero, so how can we have any confidence that a correlation is "real?"
### Randomization-based test 12. A subset of our data is displayed below. What would the (null) randomization hypothesis be in this scenario? Under that hypothesis, how likely were the Raiders (with a 5.1 malevolence rating) to have 1.19, 0.48, or 0.27 standardized penalty yards?

Explain how will we use randomization-based methods to test our randomization hypothesis. ![](sample.png)

![](Transparent.gif)

Let's take a look at some randomizations of our data: ```{r 'function-to-plot-multiple-plots', echo=FALSE} # Multiple plot function multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) { library(grid) # Make a list from the ... arguments and plotlist plots <- c(list(...), plotlist) numPlots = length(plots) # If layout is NULL, then use 'cols' to determine layout if (is.null(layout)) { # Make the panel # ncol: Number of columns of plots # nrow: Number of rows needed, calculated from # of cols layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, nrow = ceiling(numPlots/cols)) } if (numPlots==1) { print(plots[[1]]) } else { # Set up the page grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:numPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } } } ``` ```{r 'randomized-correlations'} # Store our sample correlation as test_stat test_stat <- cor(nfl$malevolence, nfl$penalty) # Randomize our data p1 <- nfl %>% ggplot(., aes(x = malevolence, y = penalty)) + geom_point(color="steelblue", size=2) + geom_smooth(method="lm", se=FALSE, color="red", size=.2) + annotate("text", x = 4, y = 1, label = "actual data", color="red", size=3) p2 <- nfl %>% mutate(shuffled_penalty = sample(penalty)) %>% ggplot(., aes(x = malevolence, y = shuffled_penalty)) + geom_point(color="steelblue", size=2) + geom_smooth(method="lm", se=FALSE, color="red", size=.2) + annotate("text", x = 4, y = 1, label = "randomization #1", color="red", size=3) p3 <- nfl %>% mutate(shuffled_penalty = sample(penalty)) %>% ggplot(., aes(x = malevolence, y = shuffled_penalty)) + geom_point(color="steelblue", size=2) + geom_smooth(method="lm", se=FALSE, color="red", size=.2) + annotate("text", x = 4, y = 1, label = "randomization #2", color="red", size=3) p4 <- nfl %>% mutate(shuffled_penalty = sample(penalty)) %>% ggplot(., aes(x = malevolence, y = shuffled_penalty)) + geom_point(color="steelblue", size=2) + geom_smooth(method="lm", se=FALSE, color="red", size=.2) + annotate("text", x = 4, y = 1, label = "randomization #3", color="red", size=3) p5 <- nfl %>% mutate(shuffled_penalty = sample(penalty)) %>% ggplot(., aes(x = malevolence, y = shuffled_penalty)) + geom_point(color="steelblue", size=2) + geom_smooth(method="lm", se=FALSE, color="red", size=.2) + annotate("text", x = 4, y = 1, label = "randomization #4", color="red", size=3) p6 <- nfl %>% mutate(shuffled_penalty = sample(penalty)) %>% ggplot(., aes(x = malevolence, y = shuffled_penalty)) + geom_point(color="steelblue", size=2) + geom_smooth(method="lm", se=FALSE, color="red", size=.2) + annotate("text", x = 4, y = 1, label = "randomization #5", color="red", size=3) # This uses a "multiplot" function that was loaded silently multiplot(p1, p4, p2, p5, p3, p6, cols=3) ```
Each randomization of data yields a non-zero correlation. Let's run 10,000 randomizations and plot all 10,000 correlations. ```{r 'randomization-correlation'} # Using cor() shuffled_r <- Do(10000) * cor(malevolence ~ shuffle(penalty), data=nfl) # Using dplyr # shuffled_rs <- Do(10000) * nfl %>% # mutate(shuffled = sample(penalty)) %>% # summarize(cor = cor(shuffled, nfl$malevolence)) # Calculate likelihood of test statistic pvalue <- prop(shuffled_r$cor >= test_stat) pvalue # Plot simple histogram # histogram(~cor, data=shuffled_r, # width=.04, # xlab="Possible correlations assuming null hypothesis is true") # ladd(panel.abline(v=test_stat)) # Add vertical line at test statistic # Plot ggplot(data = shuffled_r, aes(x = cor)) + geom_histogram(binwidth = 0.04, fill="lightblue", color="white", alpha=0.8) + annotate("segment", x = test_stat, xend = test_stat, y = 0, yend = 600, color = "red") + annotate("text", x = test_stat, y = 630, label = "observed r = 0.43", color = "red") + annotate("text", x = test_stat+.11, y = 170, label = paste("p =",round(pvalue,4)), color = "red") + labs( title = "Randomized Correlations", x = "r" ) + scale_x_continuous(limits = c(-.7,.7), breaks=seq(-.6, .6, .3), minor_breaks=NULL) + theme( axis.text.x = element_text(size = 11, color="grey10"), legend.position = "none", panel.grid.major.y = element_line(colour = "white", size=.25), panel.grid.major.x = element_line(colour = "white", size=.15), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "grey93") ) ```
### Bootstrap confidence interval 13. Interpret the following confidence interval and explain how it was constructed. ```{r 'bootstrap-ci'} boot <- Do(10000) * cor( malevolence ~ penalty, data=resample(nfl) ) bootstrapCI <- confint(boot, level = 0.95, method = "quantile") lower <- as.numeric(bootstrapCI[2]) # Store lower CI bound upper <- as.numeric(bootstrapCI[3]) # Store upper CI bound # Density plot ggplot(data = boot, aes(x = cor)) + geom_density(fill="lightblue", color="white", alpha = 0.8) + labs( title = "Bootstrap distribution of correlations", x = "bootstrap correlations" ) + scale_x_continuous(breaks=seq(-0.8, 0.8, 0.2), minor_breaks=NULL) + theme( axis.text.x = element_text(size = 11, color="grey10"), legend.position = "none", panel.grid.major.y = element_line(colour = "white"), panel.grid.major.x = element_line(colour = "white", size=.15), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "grey93") ) + annotate("text", x = lower, y = .2, label = round(lower,3)) + annotate("text", x = upper, y = .2, label = round(upper,3)) + annotate("text", x = median(boot$cor), y = .25, label = paste("95%")) + annotate("segment", x = lower, xend = upper, y = 0.1, yend = .1, color = "red") ```

![](Transparent.gif)

### t-test for correlation If we assume our data are sampled from populations with normal distributions, we can use a t-test. In the next lesson, we'll derive the following [test statistic for the correlation coefficient](https://onlinecourses.science.psu.edu/stat414/node/254): $t_{n-2}=\frac{r_{xy}-0}{\textrm{SE}_{r_{xy}}}=\frac{r_{xy}}{\sqrt{\frac{1-r_{xy}^{2}}{n-2}}}=\frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^{2}}}=2.427\ \ (p=.011)$
We can run this test in R using the `cor.test()` function: ```{r} # t-test for correlation coefficient cor.test(nfl$malevolence, nfl$penalty, alternative = c("greater"), method=c("pearson")) ```
14. Is there anything (worthwhile) we can conclude from this t-test?

![](Transparent.gif)

***** ## Other types of correlations ### Nonparametric correlations (correlations of ranks) #### Spearman's rho **Spearman's rho** uses the same formula as Pearson's r, except the data are **converted to ranks**.
15. Using data from 8 MBA graduates, calculate Spearman's rho to estimate the strength of the relationship between scores on the GMAT (a test they took prior to entering graduate school) and their grade point average in the MBA program. To do this, first convert the scores into ranks.
![](spearman.png)
```{r} # Enter the raw data mba <- tibble( gmat = c(710, 610, 640, 580, 545, 560, 610, 530), gpa = c(4, 4, 3.9, 3.8, 3.7, 3.6, 3.5, 3.5) ) # Calculate Pearson's r and Spearman's rho mba %>% summarize(r = cor(gmat, gpa), rho = cor(gmat, gpa, method="spearman")) ```
#### Kendall's tau **Kendall's Tau** is calculated as: $\tau =\frac{\textrm{(number of concordant pairs)}-\textrm{(number of discordant pairs)}}{\textrm{(number of concordant pairs)}+\textrm{(number of discordant pairs)}}$
14. A new worker is assigned to a machine that manufactures bolts. Each day, a sample of bolts is examined and the percent defective is recorded. Do the following data indicate a significant improvement over time for that worker? Calculate Kendall's tau by first calculating the number of concordant and discordant observations below each value.
![](kendall.png)
### Correlations for categorical data We've calculated correlation coefficients for continuous variables, but correlation coefficients can also be calculated for categorical data. We won't have time to investigate these in class, but they are all very easy to learn: - [Phi coefficient](https://en.wikipedia.org/wiki/Phi_coefficient) for 2x2 contingency tables - [Cramer's V](https://en.wikipedia.org/wiki/CramÃ©r%27s_V) for two nominal variables - [Biserial correlation](http://www.andrews.edu/~calkins/math/edrm611/edrm13.htm#BIS) for one continuous and one artificially dichotomous (ordinal) variable - [Point-biserial correlation](https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient) for one continuous and one (artificial or natural) dichotomous variable - [Polychoric correlation](https://en.wikipedia.org/wiki/Polychoric_correlation) for two ordinal variables with underlying continuous distributions

### Correlations for nonlinear relationships #### Distance correlation The [distance correlation](http://personal.bgsu.edu/~mrizzo/energy/AOAS312.pdf) is a modern measure of the relationship between two variables, even when the relationship is *nonlinear*. Earlier in this lesson, you saw a figure with values of Pearson's r for various scatterplots. Compare that to the following figure showing values of the distance correlation:
![](distance.png)
Recall Pearson's r is a standardized *covariance*. The numerator of the covariance is the sum of the products of two distances (distance from the mean of X and distance from the mean of Y) over all points: $\frac{\sum (x_{i}-\bar{X})(y_{i}-\bar{Y})}{n-1}$. The covariance is maximized when all data points are arranged along a straight line. The numerator of the **distance covariance** is similar to that of the covariance. The difference is that the distances are between varying *data points*; not between *a data point and the mean*. The distance covariance is defined by the sum of the products of the two distances over all pairs of points. The distance covariance is maximized when the data are arranged along a straight line *locally* (when the data overall represent a chain in any shape). Let's calculate the distance correlation for an extremely small dataset. Our dataset will be: `X = 1, 2, 3` `Y = 3, 1, 5` Let's first calculate Pearson's r and Spearman's rho for this data: ```{r} # Input data x <- c(1, 2, 3) y <- c(3, 1, 5) # Pearson's r cor(x, y, method="pearson") # Spearman's rho cor(x, y, method="spearman") ```
Now, let's calculate the distance covariance: - First, we calculate all euclidean distances between pairs of observations for X and then for Y. $a_{x}=\begin{bmatrix}0 &1 &2 \\ 1&0 &1 \\ 2 &1 &0 \end{bmatrix} \textrm{ and } b_{y}=\begin{bmatrix}0 &2 &2 \\ 2&0 &4 \\ 2 &4 &0 \end{bmatrix}$ These represent the distances between observations (1 and 2), (2 and 3), and (1 and 3). - Next, convert this to a *euclidean norm* by taking each value in each matrix and (1) subtracting its row mean, (2) subtracting its column mean, and (3) adding the grand mean. This gives us: $A_{x}=\begin{bmatrix} -1.11 &0.22 &0.89 \\ 0.22&-0.44 &0.22 \\ 0.89 &0.22 &-1.11\end{bmatrix} \textrm{ and } B_{y}=\begin{bmatrix} -0.89 &0.44 &0.44 \\ 0.44&-2.22 &1.78 \\ 0.44 &1.78 &-2.22\end{bmatrix}$ - Now, multiply each value in the A matrix by its corresponding value in the B matrix $AxB=\begin{bmatrix}0.9877 &0.0987 &0.3951 \\ 0.0987&0.9877 &0.3951 \\ 0.3951 &0.3951 &2.4691\end{bmatrix}$ - Calculate the average of all the values in this matrix $cov_{xy}^{2}=0.69136$ - Finally, take the square root of that value $cov_{xy}=\sqrt{0.69136}=0.8315$
To convert that distance *covariance* to a distance *correlation*, we standardize by dividing by the product of the *distance variances* of X and Y. Thankfully, we can let R do all these calculations for us: ```{r} # We loaded the energy package earlier # library(energy) # Calculate all the components of a distance correlation # Verify the covariance is 0.8315 DCOR(x, y) # Here's a direct way to calculate the distance correlation dcor(x, y) ```
***** ## Comparing correlation coefficients Let's look at some correlation coefficients for various simulated datasets. ```{r, echo=FALSE, message=FALSE, warning=FALSE} lineardata <- tibble( x = c(1:1000), y = 2*x + 3 + rnorm(1000,0,300) ) p1 <- ggplot(data=lineardata, aes(x = x, y = y)) + geom_point(alpha=.3, color="steelblue", fill="lightblue", shape=20, size=5) + theme_grey() + ggtitle(paste("r =", round(cor(lineardata$x, lineardata$y),2), "; rho =", round(cor(lineardata$x, lineardata$y, method="spearman"),2), "; D =", round(dcor(lineardata$x, lineardata$y, index=1.0),2))) lineardata <- tibble( x = c(1:1000), y = -2*x + 3 + rnorm(1000,0,3000) ) p2 <- ggplot(data=lineardata, aes(x = x, y = y)) + geom_point(alpha=.3, color="steelblue", fill="lightblue", shape=20, size=5) + theme_grey() + ggtitle(paste("r =", round(cor(lineardata$x, lineardata$y),2), "; rho =", round(cor(lineardata$x, lineardata$y, method="spearman"),2), "; D =", round(dcor(lineardata$x, lineardata$y, index=1.0),2))) lineardata <- tibble( x = c(-500:500), y = 2*x^2 + 3 + rnorm(1001,0,50000) ) p3 <- ggplot(data=lineardata, aes(x = x, y = y)) + geom_point(alpha=.3, color="steelblue", fill="lightblue", shape=20, size=5) + theme_grey() + ggtitle(paste("r =", round(cor(lineardata$x, lineardata$y),2), "; rho =", round(cor(lineardata$x, lineardata$y, method="spearman"),2), "; D =", round(dcor(lineardata$x, lineardata$y, index=1.0),2))) lineardata <- tibble( x = c((0:500)/20), y = sin(x)+rnorm(500,0,.1) ) p4 <- ggplot(data=lineardata, aes(x = x, y = y)) + geom_point(alpha=.3, color="steelblue", fill="lightblue", shape=20, size=5) + theme_grey() + ggtitle(paste("r =", round(cor(lineardata$x, lineardata$y),2), "; rho =", round(cor(lineardata$x, lineardata$y, method="spearman"),2), "; D =", round(dcor(lineardata$x, lineardata$y, index=1.0),2))) x <- as.numeric(c(1:1000)) z <- sin(x-500) y <- cos(x-500) lineardata <- data.frame(x=z,y=y) p5 <- ggplot(data=lineardata, aes(x = x, y = y)) + geom_point(alpha=.3, color="steelblue", fill="lightblue", shape=20, size=5) + theme_grey() + ggtitle(paste("r =", round(cor(lineardata$x, lineardata$y),2), "; rho =", round(cor(lineardata$x, lineardata$y, method="spearman"),2), "; D =", round(dcor(lineardata$x, lineardata$y, index=1.0),2))) p1 p2 p3 p4 p5 ```
***** # Your turn 19. The data.frame `midwest` contains demographic information about 437 counties in the Midwest. The list of variables is displayed below.

**(a)** Calculate Pearson's r to measure the correlation between `percollege` (the percent of people in the county with a college education) and 'percbelowpoverty` (the proportion of people below the poverty line).

**(b)** Calculate Spearman's rho for the same two variables.

**(c)** Calculate Kendall's tau for those two variables.

**(d)** Calculate a distance correlation between those variables.

**(e)** Conduct a randomization-based test of Pearson's r and state your conclusions.

**(f)** Construct a bootstrap confidence interval for Pearson's r.

**(g)** Conduct a t-test for pearson's r. ```{r} # Load data data(midwest) # Display variable names and types str(midwest) # Display scatterplot ggplot(data=midwest, aes(x = percollege, y = percbelowpoverty)) + geom_point(alpha=.5, color="steelblue", fill="lightblue", shape=20, size=5) + theme_grey() ```

20. Calculate Kendall's tau for the NFL malevolence data we've used in this lesson. Then, conduct a randomization-based test for Kendall's tau. State your conclusion.

***** # Publishing Your Turn solutions - Download the [Your Turn Solutions template](http://www.bradthiessen.com/html5/stats/m301/yourturn9.Rmd) - RStudio will automatically open it if it downloads as a *.Rmd* file. - If it downloads as a *.txt* file, then you can: - open the file in a text editor and copy its contents - open RStudio and create a new **R Markdown** file (html file) - delete everything and paste the contents of the template - Type your solutions into the [Your Turn Solutions template](http://www.bradthiessen.com/html5/stats/m301/yourturn9.Rmd) - I've tried to show you *where* to type each of your solutions/answers - You can run the code as you type it. - When you've answered every question, click the **Knit HTML** button located at the top of RStudio: Drawing

- RStudio will begin working to create a .html file of your solutions. - It may take a few minutes to compile everything (if you've conducted any simulations) - While the file is compiling, you may see some red text in the console. - When the file is finished, it will open automatically (or it will give you an error message). - If the file does not open automatically, you'll see an error message in the console. - Once you're satisfied with your solutions, email that html file to [thiessenbradleya@sau.edu](mailto:thiessenbradleya@sau.edu) or give me a printed copy. - If you run into any problems, [email me](mailto:thiessenbradleya@sau.edu) or work with other students in the class.

**License** This document is released under a [Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0) license.