--- title: "Your Turn #1" author: "Comparing two means" output: html_document: code_folding: show css: http://www.bradthiessen.com/batlab3.css df_print: tibble fig_height: 3.9 fig_width: 6.3 highlight: pygments theme: spacelab toc: no --- ***** **Author(s):** [Enter names of students working on these solutions] ***** ```{r 'prereqs', message=FALSE, echo=FALSE} # Above, type the names of the students completing this assignment. # You can scroll past the rest of this shaded section (code chunk). # It only loads packages, data, and functions you need in this assignment. # The following code installs the tidyverse and mosaic packages # if they are not already installed. list.of.packages <- c("tidyverse", "mosaic") new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages) # This will load those packages library(tidyverse) library(mosaic) # This will load the datasets and functions we'll use in this assignment # Spider data frame: spider <- tibble( group = c( rep("photo", 12), rep("real", 12) ), anxiety = c(30, 35, 45, 40, 50, 35, 55, 25, 30, 45, 40, 50, 40, 35, 50, 55, 65, 55, 50, 35, 30, 50, 60, 39)) # Function to calculate cohen's d: cohens_d <- function (x, ..., data = parent.frame(), only.2 = TRUE) { m <- mean_(x, ..., data = data) # means v1 <- var(x, ..., data = data)[1] # variance of group 1 v2 <- var(x, ..., data = data)[2] # variance of group 2 nms <- names(m) # names of groups nn <- aggregate(x, ..., data = data, length) # sample sizes n1 <- nn[1,2]-1 # n1 - 1 n2 <- nn[2,2]-1 # n2 - 1 md <- abs(diff(m)) # difference in means sp <- (n1 * v1) + (n2 * v2) # spooled step 1 sp <- sp / (n1 + n2) # spooled step 2 sp <- sqrt(sp) # spooled step 3 cd <- md/sp # cohen's d if (length(nms) > 2 && only.2) { stop("You have more than two groups") } paste("Cohen's d = ", round(cd,3)) } # Function to calculate D effect size D_effect <- function (x, ..., data = parent.frame(), only.2 = TRUE) { m <- mean_(x, ..., data = data) # means v1 <- var(x, ..., data = data)[1] # variance of group 1 v2 <- var(x, ..., data = data)[2] # variance of group 2 nms <- names(m) # names of groups nn <- aggregate(x, ..., data = data, length) # sample sizes n1 <- nn[1,2]-1 # n1 - 1 n2 <- nn[2,2]-1 # n2 - 1 md <- abs(diff(m)) # difference in means sp <- (n1 * v1) + (n2 * v2) # spooled step 1 sp <- sp / (n1 + n2) # spooled step 2 sp <- sqrt(sp) # spooled step 3 DD <- md / (sp * sqrt(2)) PD <- pnorm(DD, mean=0, sd=1, lower.tail=TRUE) # Calculate prob. if (length(nms) > 2 && only.2) { stop("You have more than two groups") } paste("D = ", round(PD,3)) } # Function to calculate Cohen's d without the text # Cohen's d cohens_d_2 <- function (x, ..., data = parent.frame(), only.2 = TRUE) { m <- mean_(x, ..., data = data) # means v1 <- var(x, ..., data = data)[1] # variance of group 1 v2 <- var(x, ..., data = data)[2] # variance of group 2 nms <- names(m) # names of groups nn <- aggregate(x, ..., data = data, length) # sample sizes n1 <- nn[1,2]-1 # n1 - 1 n2 <- nn[2,2]-1 # n2 - 1 md <- abs(diff(m)) # difference in means sp <- (n1 * v1) + (n2 * v2) # spooled step 1 sp <- sp / (n1 + n2) # spooled step 2 sp <- sqrt(sp) # spooled step 3 cd <- md/sp # cohen's d if (length(nms) > 2 && only.2) { stop("You have more than two groups") } cd } # Here's a function to calculate the difference in group medians diffmedian <- function (x, ..., data = parent.frame(), only.2 = TRUE) { m <- median(x, ..., data = data) res <- m-lag(m) # This calculates the median minus the other median res[2] } ``` ***** 25. A [1999 study](http://www.ncbi.nlm.nih.gov/pubmed/10224215) investigated the relationship between breastfeeding and intelligence. Researchers calculated an intelligence score for 322 four-year old children (237 of whom were breastfed) using the [McCarthy Scales of Children's Abilities](https://en.wikipedia.org/wiki/McCarthy_Scales_of_Children%27s_Abilities).

The data have been loaded in a data.frame named `bfeed` with variables:
   `group`: identifies whether each child was **breastfed** or **not**
   `score`: each child's score on the intelligence test.


a) Plot the data to visualize the intelligence scores for the two groups of children.

b) Calculate and interpret one effect size for the difference in mean intelligence scores between the two groups.

c) Conduct an independent samples t-test to compare the groups. State your hypotheses and the assumptions you're making. Estimate and interpret the p-value.

d) Calculate and interpret a **90**% confidence interval for the difference in means between the two groups.

Some summary statistics for each group are provided below. ```{r 'Q25-a-b'} # The data is loaded as bfeed bfeed <- read.csv("http://www.bradthiessen.com/html5/stats/m300/bfeed.csv") # Rename variables and groups library(forcats) bfeed <- bfeed %>% rename(group = Feeding, score = GCI) %>% mutate(group = fct_recode(group, "breastfed" = "Breastfed", "not" = "NotBreastfed")) # Report means and standard deviations bfeed %>% group_by(group) %>% summarize(n = n(), mean = mean(score), sd = sd(score)) # a) Plot the data to visualize the intelligence scores # You can modify the code I used at the beginning of the activity # to create a dotplot or boxplot. You'll need to change the name # of the data frame (from "spider" to "bfeed") and the names of the # variables (from "anxiety" and "group" to "score" and "group"). # You don't need any of the lines of code that start with: theme(. # Type your code right below this line. # b) Calculate either cohen's d or D. You can use the functions # "cohens_d" or "D_effect." Type your code right below this line. ``` ```{r 'Q25-b', eval=FALSE} # In this shaded box, type your interpretation of the effect size you # just calculated. The effect size represents... ``` ```{r 'Q25-c'} # c) Conduct a t-test using code similar to this: # t.test(YYY ~ XXX, data = DDD, alternative = "greater") ``` ```{r 'Q25-c-2', eval=FALSE} # Write your hypotheses here: Null hypothesis: Alternative hypothesis: Assumptions I am making: Interpretation of the p-value: ``` ```{r 'Q25-d'} # d) Make sure you construct a 90% interval (not 95%): ``` ```{r 'Q25-d-2', eval=FALSE} This confidence interval represents... ```

26. The `simdata` data frame contains simulated data from a fictitious study on weightloss. Subjects were randomly assigned to one of two `diet` groups (*Diet A* or *Diet B*). The number of pounds each subject lost (`weightloss`) was then recorded.

Summary statistics, two effect sizes, and the results of an independent samples t-test are reported below. From all of this, are you willing to conclude Diet A results in greater weightloss than Diet B? What caused the p-value and effect sizes to be so low? ```{r 'sim-data', message=FALSE} # You'll type your answers in the next shaded box # Simulate the data set.seed(1234) # Set random number generator to replicate this data simdata <- tibble( diet = c(rep("A", 20000), rep("B", 20000)), weightloss = c(rnorm(20000, 2, 20), rnorm(20000, 1, 20)) ) # Summary statistics simdata %>% group_by(diet) %>% summarize(n = n(), mean = mean(weightloss), sd = sd(weightloss)) # Effect sizes cohens_d(weightloss ~ diet, data = simdata) D_effect(weightloss ~ diet, data = simdata) # t-test t.test(weightloss ~ diet, data = simdata, alternative="greater") ``` ```{r 'Q26', eval=FALSE} # Write your conclusion here # Explain what caused both the p-value and effect sizes to be low. ```

27. Let's go back to our `spider` data frame. Use bootstrap methods to construct a 95% confidence interval for **Cohen's d** (the effect size of the difference in anxiety levels). To do this, you'll want to use the `cohens_d_2` function I wrote (instead of the `diffmeans` function we used to construct a confidence interval for the mean difference in the activity). Plot the distribution of bootstrap estimates and report the confidence interval limits. From this confidence interval, what can you conclude about the effect size? ```{r 'cohens-d-bootstrap', message=FALSE} # You'll calculate the effect size using # cohens_d_2(anxiety ~ group, data = spider) # You'll want to calculate the effect size # as you repeatedly sample your data with replacement # Your code will look something like this: # bootsamples <- Do(###) * cohens_d_2(anxiety ~ group, data = spider) # But you'll need to make sure you are *resampling* your data # To plot those bootstrap estimates of the effect size, you can simply use: # histogram(~real, data=bootsamples) # To calculate the confidence interval, you'll use something like: # confint(bootsamples, level = 0.95, method = "quantile") ```

28. Let's try something new with our `spider` data frame. Use randomization-based NHST methods to compare the **medians** of the real spider vs. photo groups. Plot the randomized sampling distribution and estimate the p-value. To calculate the difference in medians, you can use a `diffmedian()` function I wrote for this lesson. ```{r 'medians-example'} # To calculate the difference in medians for our spider data # you can use: diffmedian(anxiety ~ group, data=spider) ```

29. Suppose our spider data frame didn't contain anxiety levels for two different groups of subjects. Instead, suppose 12 subjects first entered the room with the real spider for 5 minutes. After recording each subject's anxiety level, suppose these same subjects then spent 5 minutes in the room with the photo of the spider. In this *repeated measures* study, the anxiety levels of the same 12 subjects were measured twice.

Conduct a test to determine if the real spider elicited more anxiety than the photo for these 12 subjects. ```{r 'dependent-test', message=FALSE} # We no longer have two independent groups. We needed that independence # assumption to calculate either SEpooled or the Satterthwaite approximation. # What can we do with two dependent groups (the same group measured twice)? # Let's separate our data into the two repeated measures # First, let's get anxiety scores related to the real spider real <- spider %>% filter(group == "real") %>% transmute(real = anxiety) # Now, we'll get anxiety scores for those same people with the photo photo <- spider %>% filter(group == "photo") %>% transmute(photo = anxiety) # We can combine these into a single data frame spider2 <- data.frame(real, photo) # Let's look at this data spider2 # This shows that instead of having 24 independent anxiety scores # we have a pair of anxiety scores for each of 12 subjects. # The first subject had an anxiety of 40 with the real spider # and 30 with the photo. # Now, let's calculate the difference in anxiety scores for each subject spider2 <- spider2 %>% mutate(difference = real - photo) # We can see this difference column now... spider2 # Under a true null hypothesis, we would expect this difference to be zero # Let's see what our sample mean difference equals mean(spider2$difference) # How likely were we to get an average difference score of 7 # if the null hypothesis were true? Let's use a theory-based # null hypothesis significance test # Run the following code: # t_test(anxiety ~ group, data=spider, paired=TRUE, alternative="less") ```
![](Transparent.gif)
You're done!