--- title: "Analyzing Categorical Data" author: "Lesson 8" output: html_document: css: http://www.bradthiessen.com/batlab2.css highlight: pygments theme: spacelab toc: yes toc_depth: 5 fig_width: 5.6 fig_height: 4 --- ```{r message=FALSE, echo=FALSE} # Load the mosaic package library(mosaic) # Load ggplot2 library(ggplot2) date <- c(rep("Oct. 9, 1992", 2), rep("Oct. 14, 1992", 4), rep("Oct. 19, 1992", 17)) amount <- c(1927.46, 27902.31, 86241.90, 72117.46, 81321.75, 97473.96, 93249.11, 89658.17, 87776.89, 92105.83, 79949.16, 87602.93, 96879.27, 91806.47, 84991.67, 90831.83, 93776.67, 88336.72, 94639.69, 83709.28, 96412.21, 88432.86, 71552.15) arizona <- data.frame(date, amount) ``` ***** ## Scenario: Accounting Fraud? Drawing
In 1993, Wayne James Nelson was accused of defrauding the state of Arizona out of $2 million by diverting funds to a bogus vendor. Nelson claimed he diverted random payments to the bogus vendor to demonstrate the absence of safeguards in a new computer system. The prosecutors argued that the diverted funds were *not* randomly selected payments and, in fact, Nelson created the payments in attempt to make them *appear* random. Here are the amounts of the 23 checks Nelson diverted to the bogus vendor: ```{r} arizona ``` If you look closely, you'll notice:    • The amounts start small and then increase. This may be expected in the case of fraud.    • Many amounts are just below $100,000 (which may be the limit at which transactions receive greater scrutiny)
If we look at the **distribution of the first digits** of each of those amounts, we find: ```{r} # Extract first digit of each amount and store as "firstdigit" arizona$firstdigit <- as.numeric(substr(arizona$amount, 1, 1)) # Get table of first digits tally(~firstdigit, margins=TRUE, data=arizona) # Bar chart of first digits histogram(~firstdigit, data=arizona, col="lightblue", border="white", xlab="First Digit", width=1) ```
**Does the distribution of these first digits give us evidence that the amounts were *not* randomly selected?** If so, it may support the accusation that Nelson created these amounts in an attempt to defraud the state.
### What do you expect? What *should* the distribution of leading digits be in a set of financial transactions? Without giving it much thought, we might expect a uniform distribution: ```{r, echo=FALSE} # Create uniform distribution of digits from 1-9 unif <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) # Plot histogram histogram(~unif, col="lightblue", border="white", width=1, xlab="Uniform Distribution") ```
It turns out, however, that the distribution of leading digits of financial records looks like this: ```{r, echo=FALSE} d <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) benford <- log10(1 + 1/d) benford <- c(rep(1, 301), rep(2, 176), rep(3, 125), rep(4, 97), rep(5, 79), rep(6, 67), rep(7, 58), rep(8, 51), rep(9, 46)) histogram(~benford, col="lightblue", border="white", width=1, xlab="Benford's Distribution") ```
#### Benford's Law This distribution, generated from [Benford's Law](http://testingbenfordslaw.com/distance-of-stars-from-earth-in-light-years), applies to first digits of numeric data that have a wide distribution and are generated by a *natural* process (as opposed to being generated by a human). Some examples of data that follow Benford's Law include: stock market volumes, populations of cities, distances to stars in the universe, house addresses, and numbers of followers on Twitter. What is Benford's Law? It originates back in the nineteenth century, when Simon Newcombe visited the library. Like most mathematicians of the time, Newcombe relied on reference books like *Arithmetica Logarithmica* to help with calculations. The book provided page after page of logarithms calculated to 10 decimal places of every number up to 100,000.
Drawing
As Newcombe used this book, he noticed the pages at the beginning of the book were dirtier than the pages towards the end. In fact, he noticed the pages with numbers beginning with 1 were the dirtiest, followed by pages beginning with the number 2, and so on. The cleanest pages were for numbers beginning with 9. Newcombe, realizing the dirt came from people **using** some pages more than others, argued: >"That the ten digits do not occur with equal frequency must be evident to anyone making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9." Newcombe published his findings in an 1881 article, describing a law for the probability that *any* number begins with a particular digit: **Probability of a number beginning with digit *d* is:**    $\textrm{log}_{10}\left ( d+1 \right )-\textrm{log}_{10}\left ( d \right )=\textrm{log}_{10}\left ( 1+\frac{1}{d} \right )$
##### Why does it work (and why isn't it called *Newcombe's Law*)? Imagine you have a [pencil](http://www.datagenetics.com/blog/march52012/index.html) that is *one unit* long. It doesn't matter what *one unit* means to you -- it could be centimeters, inches, cubits, or anything else. Now imagine that pencil slowly and continuously growing in length. For a long time, the length will be **1.x** units long. In fact, it will need to *double* in length (a 100% change) before the leading digit changes from 1 to 2. Once the leading digit is 2, however, it only needs to change in length by 50% to get a leading digit of 3. Once at 3, a change of 33.3% will bring the leading digit to a 4. The following figure shows that as we move along a logarithmic scale, there's a shorter distance between each subsequent digit. In fact, once we get to a leading digit of 9, we only need an 11% increase to go back to a leading digit of 1. Drawing
In the figure displayed below, the amount of "time" the leading digit is a 1 has been shaded red. You can see the shaded area represents roughly 30% of the entire figure: Drawing
If each leading digit is given its own color, we can compare the relative widths of each shaded region: Drawing
It turns out that the width of each colored segment is proportional to   $\textrm{log}_{10}\left ( d+1 \right )-\textrm{log}_{10}\left ( d \right )=\textrm{log}_{10}\left ( 1+\frac{1}{d} \right )$ Drawing
These probabilities tell us what to **expect** if we examine the first digits of a series of financial records.
Unfortunately for Newcombe, his article went largely unnoticed. Fifty-seven years later, a physicist named Frank Benford made the *exact same discovery* as Newcome through the *exact same observation* of dirtier pages at the beginning of logarithm books, and derived the *exact same formula*. Benford, though, had the resources to test his theory empirically. Benford spent years gathering 20,229 observations from a diverse range of data sets: baseball statistics, areas of rivers, atomic weights of elements–even arbitrary numbers plucked from Reader’s Digest articles. This mix of all kinds of different datasets happened to be a [near-perfect fit](http://alfre.dk/how-to-commit-tax-fraud/) with his theory. With these additional findings, Benford's article quickly gained popularity and the distribution of first-digits became known as **Benford's Law**.

***** ### Is he guilty? Now that we know what we should **expect** for the distribution of leading digits, we simply need to compare it to what we **observed**. ```{r, echo=FALSE, message=FALSE, warnings=FALSE} # Convert benford to data.frame for ggplot2 benforddata <- data.frame(benford) # Plot histogram of observed data and line for expected plotty <- ggplot() + geom_histogram(data=arizona, aes(x = firstdigit, y=..density..), binwidth=1, alpha=.5, color="white", fill="steelblue", size=.5) + geom_freqpoly(data=benforddata, aes(x = benford, y=..density..), alpha=1, color="red", linetype=1, size=2, binwidth=1) + xlim(1, 10) + labs(title ="Observed (blue) vs Expected (red) First Digits", x = "First Digit of Amounts", y = "Proportion") suppressWarnings(print(plotty)) ```
1. Explain why we can't look at the previous plot and conclude that Nelson committed fraud? Why aren't we certain that Nelson did *not* randomly choose 23 transactions?







**How can we calculate the likelihood of observing these first-digits if, in fact, Nelson had selected a random sample of 23 transactions (from a population that follows Benford's Law)?** To derive some methods to estimate this likelihood, let's turn to a simpler scenario.
***** ### Scenario: Rolling a die Suppose we roll a single 6-sided die 120 times. We might get results like:
Drawing



2. Assuming the die is fair, fill-in the expected frequencies.

3. We need to think of a method to determine how "far off" our observations are from our expectations. Ideally, it would be a single value we could calculate from the table displayed above. What ideas do you have?









4. Use your method to calculate the "distance" of the observations to the expectations for each of the tables displayed below. Are the observed values in each table the same "distance" from their expectations? Should they be?
Drawing









#### Chi-squared Statistic (goodness-of-fit) To measure the distance from an **observed set of frequencies** to a **theoretical distribution of expectations**, we can use a **chi-squared** statistic:       $\chi _{c-1}^{2}=\sum_{c=1}^{c}\frac{\left ( \textrm{observed}_{c} -\textrm{expected}_{c}\right )^{2}}{\textrm{expected}_{c}}=\sum \frac{\left ( \textrm{O}_{c}-\textrm{E}_{c} \right )^{2}}{\textrm{E}_{c}}$
To estimate a p-value, we compare this value to a [chi-squared distribution](http://lock5stat.com/statkey/theoretical_distribution/theoretical_distribution.html#chi).
5. Calculate this chi-squared goodness-of-fit test for the following table of die rolls. Estimate the p-value by using this [applet](http://lock5stat.com/statkey/theoretical_distribution/theoretical_distribution.html#chi)?
Drawing
We can calculate this chi-square goodness-of-fit test in R: ```{r} # Input the observed frequencies of each digit die <- c(23, 15, 19, 24, 21, 18) # Input the expected probabilities of each digit dieprob <- c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6) # Run the chi-square test using chisq.test # rescale.p = TRUE just makes sure the probabilities sum to 1 chisq.test(die, p = dieprob, rescale.p = TRUE) ```
#### Randomization-based chi-squared goodness-of-fit test We could also conduct this goodness-of-fit test using randomization-based methods. In this simple example with rolling a die 120 times, we expect to see an equal proportion of 1s, 2s, 3s, 4s, 5s, and 6s. In other words, we expect `P(1) = P(2) = ... = P(6) = 1/6`. With this assumption, we can randomly sample 120 values from the discrete distribution we just identified with those probabilities. In this case, we'll have the computer randomly sample 120 values from a distribution in which there is an equal chance of choosing any number between 1-6: ```{r} # Run the chi-square test using chisq.test # Set simulate.p.value = TRUE to run the randomization-based test # B = the number of randomizations to run chisq.test(die, p = dieprob, rescale.p = TRUE, simulate.p.value = TRUE, B=5000) ```
6. The p-value from the randomization-based method differs slightly from that from the theory-based chi-squared distribution. What conditions are necessary in order to use the theoretical chi-squared distribution? Are those conditions met in this scenario?










***** ### Back to the fraud case: Is he guilty? Let's conduct this goodness-of-fit test on our potential fraud case: ```{r} # First, let's input the frequency of each first digit that we observed nelsondigits <- c(1, 1, 0, 0, 0, 0, 3, 9, 1) # Now, let's calculate the expected probabilities of each digit # First, I'll input the digits 1-9 digits <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) # Now, I'll calculate the probabilities using Benford's Law benford <- log10(1 + 1/digits) # Let's take a look at those probabilities benford # Now we can run the chi-squared goodness-of-fit test # First via theory-based methods: chisq.test(nelsondigits, p = benford) # Next with randomization-based methods chisq.test(nelsondigits, p = benford, simulate.p.value = TRUE, B=5000) ``` 7. What conclusions can we draw? Explain why the p-value from the theory-based method differs so much from the randomization-based method?











***** ## Another Goodness-of-Fit Example: Exponential distribution Testing for goodness-of-fit allows us to test whether a sample of data come from a theoretical distribution. Suppose the following data represent time between breakdowns for a machine: ```{r} times <- c(0.1, 0.1, 0.6, 0.6, 1.3, 1.6, 2.4, 2.4, 3.7, 3.7, 4.9, 5.0, 6.2, 6.3, 10.3, 10.6, 0.1, 0.2, 0.7, 0.7, 1.7, 1.8, 2.4, 2.7, 4.1, 4.1, 5.2, 5.3, 7.4, 7.5, 10.9, 12.3, 0.2, 0.3, 0.3, 0.4, 0.8, 0.9, 0.9, 1.0, 1.9, 1.9, 2.0, 2.0, 2.8, 2.8, 2.8, 3.0, 4.2, 4.2, 4.3, 4.3, 5.3, 5.7, 5.7, 5.9, 7.5, 7.7, 8.1, 8.6, 13.9, 13.7, 13.9, 14.8, 0.5, 0.6, 0.6, 1.1, 1.2, 1.3, 2.1, 2.2, 2.2, 3.1, 3.5, 3.7, 4.5, 4.9, 4.9, 6.1, 6.2, 6.2, 9.2, 9.5, 10.0, 14.9, 17.3, 17.6) ```
We want to know if the **exponential distribution** is an appropriate model for the time we wait until the machine breaks down. To test this, let's place our sample data into **bins**. I'll arbitrarily choose bins of width = 3 units:
Drawing
That last bin only has 2 observations, so I'll merge it with the preceding bin:
Drawing
Now that we have our observed data in bins, we need to calculate our expected frequencies (or probabilities). We could do this by hand, using the exponential distribution:
Drawing
We could then calculate our chi-square goodness-of-fit statistic:
Drawing
We could also do this in R: ```{r} # The exponential distribution is defined by lambda = rate = 1 / E(x) # Calculate probabilities for each bin # This calculates probabilities under the exponential distribution p1 <- pexp(3, rate = 1/mean(times), lower.tail = TRUE) #P(0 < x < 3) p2 <- pexp(6, rate = 1/mean(times)) - pexp(3, rate = 1/mean(times)) #P(3 12) # Store all the probabilities as "probs" probs <- c(p1, p2, p3, p4, p5) # There's an easier way to do the following, but I think this makes sense # Let's store our observed data as frequencies in each bin # To do this, I'm going to "cut" my data into 5 bins # The bins are defined by the endpoints 0, 3, 6, 9, 12, and infinity # I name the bins "1", "2", "3", "4", and "5" cuts <- cut(times, breaks = c(0, 2.99, 6, 9, 12, Inf), labels=1:5) # Now I can get frequencies for each bin tally(~cuts, margins=TRUE) # I'll store those frequencies in a vector called "timebins" # This code does it automatically timebins <- c(length(cuts[cuts==1]), length(cuts[cuts==2]), length(cuts[cuts==3]), length(cuts[cuts==4]), length(cuts[cuts==5])) # Now I run the chi-squared test chisq.test(timebins, p = probs) # Next with randomization-based methods chisq.test(timebins, p = probs, simulate.p.value = TRUE, B=5000) ```
9. What conclusions can we draw? Explain how this method could be used to assess the normality assumption (prior to conducting an ANOVA, perhaps).










***** ## Scenario: Male pattern heart disease? [Researchers](http://news.harvard.edu/gazette/2000/01.27/bald.html) selected a sample of 663 heart disease patients and a control group of 772 males not suffering from heart disease. Each subject was classified into one of 5 baldness categories: Drawing
Drawing
```{r, echo=FALSE} # Enter data baldness <- c(rep("none", 582), rep("little", 386), rep("some", 380), rep("much", 84), rep("extreme", 3)) disease <- c(rep("yes", 251), rep("no", 331), rep("yes", 165), rep("no", 221), rep("yes", 195), rep("no", 185), rep("yes", 50), rep("no", 34), rep("yes", 2), rep("no", 1)) baldheart <- data.frame(baldness, disease) ``` ```{r} # Mosaic plot mosaicplot(table(baldheart)) ``` 10. Do these results give us evidence that heart disease is associated with baldness? Calculate the relative risk of heart disease for each category of baldness. Then, calculate odds ratios.











Can we use a chi-square test to test for a relationship between heart disease in baldness? We can if we're able to derive our **expectations** for each cell.
### Expectation: Independence 11. Write out a null hypothesis for the relationship between heart disease and baldness.







12. What's it mean for two variables, A and B, to be **independent**?







13. If A and B are **independent**, P(A | B) = ___________ and P(A and B) = ____________________?





14. Suppose we’re interested in the relationship between gender and the result of a coin toss. We get 50 men and 50 women to each flip a coin and record the result. Assuming gender has no influence on the result of a coin toss, fill-in the **expectations** for each cell: Drawing





15. Derive a general formula to calculate the expectations for any rxc contingency table.









Suppose we conduct this experiment and get the following results:
Drawing
#### Fisher's Exact Test
16. Look at the top-left cell. We expected 25 but observed 30. Using the [hypergeometric distribution](http://stattrek.com/online-calculator/hypergeometric.aspx), calculate the probability of observing 30 or more males flipping heads if gender and coin toss results are independent.







We can conduct this test (**Fisher's Exact Test**) in R: ```{r} # P(30 or more heads) = 1 - P(29 or fewer heads), where # m = 40 heads in the population # n = 60 tails in the population # k = 50 chosen for males 1 - phyper(29, 40, 60, 50, lower.tail=TRUE) # Enter the data for Fisher's Test coin <- c(rep("heads", 40), rep("tails", 60)) gender <- c(rep("male", 30), rep("female", 10), rep("male", 20), rep("female", 40)) coingender <- data.frame(coin, gender) # Examine the data tally(coin ~ gender, margins = TRUE, data=coingender) # Fisher's test fisher.test(coingender$gender, coingender$coin, alternative="less") ```
You can see the results from Fisher's Exact Test match the calculation from the hypergeometric distribution. This method is more difficult to use with larger numbers (or when our table isn't a 2x2 table). #### Chi-square test for independence Since we have expected and observed observations for categorical variables, we could also calculate a chi-square statistic. 17. Compute a chi-squared statistic and [estimate a p-value](http://lock5stat.com/statkey/theoretical_distribution/theoretical_distribution.html#chi). How many degrees-of-freedom does your test statistic have? Drawing











We can conduct this **chi-square test for independence** in R: ```{r} # Our data are still stored in the "coingender" data.frame # We use chisq.test(x, y) to run the test chisq.test(coingender$gender, coingender$coin, correct=FALSE) # We can estimate the p-value with randomization-based methods chisq.test(coingender$gender, coingender$coin, simulate.p.value=TRUE, B=5000) ```
### So... heart disease and baldness? Let's enter our heart disease / baldness data into R and run both Fisher's Exact Test and a Chi-square Test for Independence: ```{r} # Enter data baldness <- c(rep("none", 582), rep("little", 386), rep("some", 380), rep("much", 84), rep("extreme", 3)) disease <- c(rep("yes", 251), rep("no", 331), rep("yes", 165), rep("no", 221), rep("yes", 195), rep("no", 185), rep("yes", 50), rep("no", 34), rep("yes", 2), rep("no", 1)) baldheart <- data.frame(baldness, disease) # Check for entry errors tally(disease ~ baldness, margins = TRUE, data=baldheart) # Run Fisher's Exact Test fisher.test(baldheart$baldness, baldheart$disease, alternative="less") # Chi-square test for independence chisq.test(baldheart$baldness, baldheart$disease, correct=FALSE) # We can estimate the p-value with randomization-based methods chisq.test(baldheart$baldness, baldheart$disease, simulate.p.value=TRUE, B=5000) ```
18. Based on these results, what conclusions can you make?







***** ## Chi-squared tests - notes and warnings ### Conditions of the chi-square test One problem with the chi-square test is that the sampling distribution of the test statistic we calculate has an *approximate* chi-square distribution. With larger sample sizes, the approximation is good enough. With smaller sample sizes, however, p-values estimated from the chi-square distribution can be inaccurate. This is why many textbooks recommend you only use chi-square tests when the expected frequencies in each cell are **at least 5**. Even with large overall sample sizes, expected cell frequencies below 5 result in a loss of statistical power. When you have cells with less than 5 observations, you can:     • Use Fisher's Exact Test     • Use a likelihood ratio test (like the g-test we'll see in a bit)     • Merge cells into larger bins to ensure observed frequencies are at least 5 Another condition of the chi-square test is **independence**. All observations must contribute to only one cell of the contingency table. You cannot use chi-square tests on repeated-measures designs.
### Effect sizes If you have a large sample size, the chi-square test will produce a low p-value... even if the difference in proportions is small. It's imperative that whenever you analyze contingency tables, you calculate some sort of effect size. You could calculate relative risk or odds ratios to give an estimate of the differences among categories.
### Yates's Correction When you have a 2x2 contingency table, the chi-square test (with one degree of freedom) tends to produce p-values that are too low (i.e., it makes Type I errors). One suggestion to address this issue is **Yates's continuity correction**. Yates's correction simply has you subtract 0.5 from the absolute value of the deviation in the numerator of our chi-square test statistic:       $\chi _{\textrm{Yates}}^{2}=\frac{\left ( \left | O_{i}-E_{i} \right |-0.5 \right )^{2}}{E_{i}}$ This correction lowers our chi-square test statistic, thereby yielding a higher p-value. Although this may seem like a nice solution, there is evidence that this correction **over-corrects** and produces chi-square values that are too small. If you want to use Yates's correction in R, you simply tell R that `correct=TRUE`: ```{r} # Chi-square test for that baldness data chisq.test(baldheart$baldness, baldheart$disease, correct=TRUE) ```
### Likelihood ratio tests (G-test) An alternative to the chi-square test (that happens to produce results similar to the chi-square test) is the **G-test**. The G-test is an example of a **likelihood ratio test (LRT)**. We'll see other likelihood ratio tests later in the course. For now, let's conduct an LRT on a simple dataset. #### Scenario: Graduating from college Suppose you hypothesize 75% of St. Ambrose students graduate within 6 years. You then collect a sample of 1000 students and find 770 of them graduated within 6 years. You can think of this data as frequencies across two cells: 770 (graduated) and 230 (did not). With this, we could calculate a chi-square goodness-of-fit test: ```{r} # Input data graduated <- c(770, 230) # Input probabilities probabilities <- c(.75, .25) # Run the test chisq.test(graduated, p = probabilities) ```
We could also calculate the likelihood of observing this data under our hypothesis. If we think of each student as an independent trial with two possible outcomes (1 = graduated; 0 = did not), we can use the binomial distribution. We want to calculate **P(X = 770)** under a binomial distribution with **n = 1000 trials** and **p = 0.75** (our hypothesized value). Let's store this likelihood as `null`: ```{r} # P(X = 770) in binomial distribution with n=1000, p=0.75 null <- dbinom(770, size=1000, prob=0.75) null ```
Now let's an alternative hypothesis that the true proportion of students who graduate is 77%. We'll store this result as `alternative`: ```{r} # P(X = 770) in binomial distribution with n=1000, p=0.77 alternative <- dbinom(770, size=1000, prob=0.77) alternative ```
From these results, we'd conclude the data we obtained were more likely to be obtained if the true proportion of St. Ambrose students that graduate within 6 years is 77%. Let's take the ratio of the null likelihood to the alternative likelihood: ```{r} # Likelihood ratio LR <- null / alternative LR ```
This ratio will get smaller as the observed results get further away from our null hypothesis. If we take the natural log of this ratio and multiply it by -2, we get the *log-likelihood* (or **G-statistic**).       $G=-2\textrm{ln}\frac{L_{\textrm{null}}}{L_{\textrm{alternative}}}$ We multiply by -2 and take the natural log because it results in a test statistic that approximates a chi-square distribution. Let's calculate this statistic for our graduation example and determine its likelihood from a chi-square distribution: ```{r} # Calculate G-statistic G <- -2 * log(LR) G # Compare it to chi-square distribution with df = 2-1 pchisq(G, df=1, lower.tail=FALSE) ```
For this simple example, our p-value is 0.140. From this, we have very little evidence to support the belief that 77% of students graduate from St. Ambrose within 6 years.
Calculating the G-statistic can be tedious. Thankfully, there's an easier formula to use (based on the fact that in the ratio of likelihoods, a lot of stuff cancels out):       $G=2\sum_{i=1}^{c}\left [ O_{i}-\textrm{ln}\frac{O_{i}}{E_{i}} \right ]$
The G-test can also be used to test for independence. You can see how to [conduct this test in R here](http://rcompanion.org/rcompanion/b_04.html).
#### G-test vs. Chi-square test Since both the G-test and the chi-squared test can be used for both goodness-of-fit and independence, which test should you use? It depends. The G-test is more flexible and can be used for more elaborate (and additive) experimental designs. The chi-square test, on the other hand, is more easily understood and explained. The G-test uses the multinomial distribution in its calculation, whereas the chi-square test provides only an approximation.

***** # Your turn 19. Criminologists have long debated whether there is a relationship between weather conditions and the incidence of violent crime. The article *Is There a Season for Homocide?* (Criminology, 1988: 287-296) classified 1361 homicides according to season, resulting in table displayed below. Conduct a chi-square goodness-of-fit test (and the randomization-based chi-square test) to determine if seasons are associated with different numbers of violent crimes.
Drawing

20. You may remember the dataset we analyzed in [lesson #3](http://bradthiessen.com/html5/stats/m301/lab3.html) regarding the amount of weight lost by subjects on various diets (e.g., Atkins, WeightWatchers). Prior to analyzing that data, we tested the homogeneity of variance assumption with Levene's test. Let's test for normality using a chi-square goodness-of-fit test.

21. Is acupuncture effective in reducing pain? A [2007 study](http://www.ncbi.nlm.nih.gov/pubmed/17893311) looked at 1,162 patients with chronic lower back pain. The subjects were randomly assigned to one of 3 groups:
  • **Verum acupuncture** practiced according to traditional Chinese principles,
  • **Sham acupuncture** where needles were inserted into the skin but not at acupuncture points,
  • **Traditional** (non-acupuncture) therapy consisting of drugs, physical therapy, and exercise.

Results are displayed in the table below.

**(a)** Calculate the probability that an individual undergoing verum acupuncture experienced a substantial reduction in pain.

**(b)** Calculate this same probability for the subjects in the traditional group.

**(c)** Calculate and interpret the ratio of the two probabilities you just calculated.

**(d)** Run a chi-square test for independence and interpret the result.
Drawing

22. The table displayed below shows the number of black and white defendants sentenced to death in the 1960s-70s for murdering a white victim. As you can see, a higher proportion of black defendants were sentenced to death.

**(a)** Run Fisher's Exact Test to calculate the likelihood of observing results as or more extreme than these results if, in fact, race and the death penalty were independent.

**(b)** Run a chi-square test for independence. Drawing

***** # Publishing your solutions When you're ready to try the **Your Turn** exercises: - Download the appropriate [Your Turn file](http://bradthiessen.com/html5/M301.html). - If the file downloads with a *.Rmd* extension, it should open automatically in RStudio. - If the file downloads with a *.txt* extension (or if it does not open in RStudio), then: - open the file in a text editor and copy its contents - open RStudio and create a new **R Markdown** file (html file) - paste the contents of the **Your Turn file** over the default contents of the R Markdown file. - Type your solutions into the **Your Turn file**. - The file attempts to show you *where* to type each of your solutions/answers. - Once you're ready to publish your solutions, click the **Knit HTML** button located at the top of RStudio: Drawing - Once you click that button, RStudio will begin working to create a .html file of your solutions. - It may take a minute or two to compile everything, especially if your solutions contain simulation/randomization methods - While the file is compiling, you may see some red text in the console. - When the file is finished, it will open automatically. - If the file does not open automatically, you'll see an error message in the console. - Send me an [email](mailto:thiessenbradleya@sau.edu) with that error message and I should be able to help you out. - Once you have a completed report of your solutions, look through it to see if everything looks ok. Then, send that yourturn.html file to me [via email](mailto:thiessenbradleya@sau.edu) or print it out and give it to me.
Sources: [Explanation of Benford's Law](http://www.datagenetics.com/blog/march52012/index.html) [Chi-squared randomization applet](http://lock5stat.com/statkey/advanced_goodness/advanced_goodness.html) [Chi-squared distribution applet](http://lock5stat.com/statkey/theoretical_distribution/theoretical_distribution.html#chi) [List of categorical analysis methods](https://en.wikipedia.org/wiki/List_of_analyses_of_categorical_data) This document is released under a [Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0) license.