Activity #6: R Syntax

Examples from Activity #6

9. Faked numbers in tax returns, payment records, invoices, expense account claims, and many other settings often display patterns that aren’t present in legitimate records. Some patterns, like too many round numbers, are obvious and easily avoided by a clever crook. Other patterns are more subtle. It is a striking fact that the ^irst digits of numbers in legitimate records often follow a distribution known as Benford’s Law. Here is that distribution:

First Digit	1	2	3	4	5	6	7	8	9
Proportion	.301	.176	.125	.097	.079	.067	.058	.051	.046

R comes with some built-in datasets. I’m going to use a dataset called state.x77. Let’s look at the first several rows to see what this dataset contains:

Data1 <- state.x77
head(Data1)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

The entire dataset has values for each of the 50 states.

Benford’s law holds for first digits, so we need to select the first digit from those variables. To do this, we use a function called substr()

Let’s get the first digits of the Area variable:

FirstArea <- substr(as.character(state.area),1,1)
FirstArea

##  [1] "5" "5" "1" "5" "1" "1" "5" "2" "5" "5" "6" "8" "5" "3" "5" "8" "4"
## [18] "4" "3" "1" "8" "5" "8" "4" "6" "1" "7" "1" "9" "7" "1" "4" "5" "7"
## [35] "4" "6" "9" "4" "1" "3" "7" "4" "2" "8" "9" "4" "6" "2" "5" "9"

We can then see the frequency of each of those first digits:

tally(~FirstArea)

## 
##  1  2  3  4  5  6  7  8  9 
##  8  3  3  8 11  4  4  5  4

To get a larger dataset, let’s get the first digits of each of the variables:

FirstDigits <- substr(as.character(Data1),1,1)
tally(~FirstDigits)

## 
##  0  1  2  3  4  5  6  7  8  9 
## 26 89 26 36 59 53 36 52 12 11

Uh oh… there aren’t supposed to be any zeros. Take another look at our dataset:

head(Data1)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

See Colorado at the bottom? The illiteracy rate is 0.7, so it looks like the illiteracy variable may have some leading zeros. To get rid of those leading zeros, we can multiply everything by 10:

table(substr(as.character(10*Data1),1,1))

## 
##  0  1  2  3  4  5  6  7  8  9 
##  1 89 26 36 59 56 46 57 15 15

Argh! There’s still one leading zero. It took me awhile to find it, but I eventually did. Hawaii has 0 days of frost. I’m going to convert that value to NA and run the tabulation again:

Data1["Hawaii","Frost"]<- NA
FirstDigits <- table(substr(as.character(10*Data1),1,1)); FirstDigits

## 
##  1  2  3  4  5  6  7  8  9 
## 89 26 36 59 56 46 57 15 15

There we go. Now that we have our leading digits, let’s compare their relative frequencies to what we’d expect under Benford’s Law:

# Sum all the first digits in the dataset
SumOfFirstDigits <- sum(FirstDigits)
# Get Benford's Law
BLaw <- sapply(1:9, function(x) log10(1+1/x)); BLaw

## [1] 0.30103 0.17609 0.12494 0.09691 0.07918 0.06695 0.05799 0.05115 0.04576

m <- cbind(FirstDigits/SumOfFirstDigits,BLaw)
colnames(m)<- c("Observed Prop.", "Theoretical Prop."); m

##   Observed Prop. Theoretical Prop.
## 1        0.22306           0.30103
## 2        0.06516           0.17609
## 3        0.09023           0.12494
## 4        0.14787           0.09691
## 5        0.14035           0.07918
## 6        0.11529           0.06695
## 7        0.14286           0.05799
## 8        0.03759           0.05115
## 9        0.03759           0.04576

# Boxplots of observed vs theoretical 
barplot( rbind(FirstDigits/SumOfFirstDigits,BLaw/sum(BLaw)), beside = T, col = rainbow(7)[c(2,5)], xlab = "First Digit")
title("Benford’s Law Compared to States Data")

## Warning: conversion failure on 'Benford’s Law Compared to States Data' in 'mbcsToSbcs': dot substituted for <e2>
## Warning: conversion failure on 'Benford’s Law Compared to States Data' in 'mbcsToSbcs': dot substituted for <80>
## Warning: conversion failure on 'Benford’s Law Compared to States Data' in 'mbcsToSbcs': dot substituted for <99>

plot of chunk unnamed-chunk-10

Activity #6: R Syntax

Brad Thiessen

June 18, 2014

Loading the Mosaic Package

Examples from Activity #6