Using dplyr

Load required packages

Let’s load the Mosaic package:

require(dplyr)
require(mosaic)
require(ggplot2)
require(Lahman)

Data manipulation

In the last document, I showed how to use R to explore datasets using summary statistics and visualizations. We often need to manipulate our data to get it into a format where it’s easy to create these visualizations.

In this document, we’ll use the dplyr package to manipulate a large dataset.

To learn more about dplyr, you can run the command: vignette("introduction", package = "dplyr")

First, let’s load some baseball data:

### Lists datasets available in the Lahman package
data(package="Lahman")
# Loads Batting data
data(Batting)
# Gets dimensions of Batting data
dim(Batting)

## [1] 96600    24

This Batting dataset has 96,600 rows and 24 columns. Let’s get an idea of what those rows and columns represent:

head(Batting, 8)

##    playerID yearID stint teamID lgID   G G_batting  AB  R   H X2B X3B HR
## 1 aardsda01   2004     1    SFN   NL  11        11   0  0   0   0   0  0
## 2 aardsda01   2006     1    CHN   NL  45        43   2  0   0   0   0  0
## 3 aardsda01   2007     1    CHA   AL  25         2   0  0   0   0   0  0
## 4 aardsda01   2008     1    BOS   AL  47         5   1  0   0   0   0  0
## 5 aardsda01   2009     1    SEA   AL  73         3   0  0   0   0   0  0
## 6 aardsda01   2010     1    SEA   AL  53         4   0  0   0   0   0  0
## 7 aardsda01   2012     1    NYA   AL   1        NA  NA NA  NA  NA  NA NA
## 8 aaronha01   1954     1    ML1   NL 122       122 468 58 131  27   6 13
##   RBI SB CS BB SO IBB HBP SH SF GIDP G_old
## 1   0  0  0  0  0   0   0  0  0    0    11
## 2   0  0  0  0  0   0   0  1  0    0    45
## 3   0  0  0  0  0   0   0  0  0    0     2
## 4   0  0  0  0  1   0   0  0  0    0     5
## 5   0  0  0  0  0   0   0  0  0    0    NA
## 6   0  0  0  0  0   0   0  0  0    0    NA
## 7  NA NA NA NA NA  NA  NA NA NA   NA    NA
## 8  69  2  2 28 39  NA   3  6  4   13   122

Each row represents the batting statistics for a baseball player in a single year playing for a single team. So, for example, if a player played for 2 teams in a single season, that player would have two rows (one for each “stint”).

From the header displayed above, we can see the player “aardsda01” has data from the 2004-2012 seasons (where he played for San Francisco, both Chicago teams, Boston, Seattle, and the Yankees).

Suppose I want to know which 5 players in MLB history have batted in the most games. To do this, I need to add up the games played (G) for all the rows for each player.

The dplyr package allows us to do these kind of data manipulations easily. We want to:

Group the rows of the dataset by playerID
Summarize those groups by adding up the games played
Arrange the data by listing the 5 players with the most games played

To do this step-by-step, we would use:

## Group rows of data by playerID
players <- group_by(Batting, playerID)
## Summarize the groups by taking the sum of games played
games <- summarize(players, total = sum(G))
## Arrange the data by games played (in descending order) and list the top 5
head(arrange(games, desc(total)), 5)

## Source: local data frame [5 x 2]
## 
##    playerID total
## 1  rosepe01  3562
## 2 yastrca01  3308
## 3 aaronha01  3298
## 4 henderi01  3081
## 5  cobbty01  3035

Those commands took a small fraction of a second to complete. From the output, we can see the top five players with most games played are: Pete Rose, Carl Yastrzemski, Hank Aaron, Ricky Henderson, and Ty Cobb.

We could have also done this manipulation by chaining operations together with the %.% operator:

Batting %.%
  group_by(playerID) %.%
  summarize(total = sum(G)) %.%
  arrange(desc(total)) %.%
  head(5)

## Source: local data frame [5 x 2]
## 
##    playerID total
## 1  rosepe01  3562
## 2 yastrca01  3308
## 3 aaronha01  3298
## 4 henderi01  3081
## 5  cobbty01  3035

You can think of %.% as an arrow pointing to the right –>. It tells the computer to move on to the next command in the pipeline.

Let’s see some other manipulation commands we may want to use:

Select columns (variables) of interest

This Batting dataset has 24 variables. Suppose we’re only interested in 14 of those columns: playerID, yearID, teamID, at-bats, hits, doubles, triples, homeruns, stolen bases, number of times caught stealing, strikeouts, walks, hit-by-pitch numbers, and sacrifice flies.

We can select a subset of columns from our dataset with the select command:

Batting %.%
  select(playerID, yearID, teamID, AB, H, X2B, X3B, HR, SB, CS, SO, BB, HBP, SF) %.%
  head(10)

##     playerID yearID teamID  AB   H X2B X3B HR SB CS SO BB HBP SF
## 1  aardsda01   2004    SFN   0   0   0   0  0  0  0  0  0   0  0
## 2  aardsda01   2006    CHN   2   0   0   0  0  0  0  0  0   0  0
## 3  aardsda01   2007    CHA   0   0   0   0  0  0  0  0  0   0  0
## 4  aardsda01   2008    BOS   1   0   0   0  0  0  0  1  0   0  0
## 5  aardsda01   2009    SEA   0   0   0   0  0  0  0  0  0   0  0
## 6  aardsda01   2010    SEA   0   0   0   0  0  0  0  0  0   0  0
## 7  aardsda01   2012    NYA  NA  NA  NA  NA NA NA NA NA NA  NA NA
## 8  aaronha01   1954    ML1 468 131  27   6 13  2  2 39 28   3  4
## 9  aaronha01   1955    ML1 602 189  37   9 27  3  1 61 49   3  4
## 10 aaronha01   1956    ML1 609 200  34  14 26  2  4 54 37   2  7

As you can see, our dataset now has these 14 columns. Suppose we wanted to add some new columns:

Batting average = hits / at-bats
On-base percentage = (hits + walks + hit-by-pitch) / (at-bats + walks + hbp + sacrifice flies)
Slugging percentage = (singles + 2doubles + 3triples + 4HRs) / (at-bats)

It’s easy to add columns using the mutate command in dplyr:

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, AB, H, X2B, X3B, HR, SB, CS, SO, BB, HBP, SF) %.%
  mutate(Avg = H/AB,
         OBP = (H + BB + HBP)/(AB+BB+HBP+SF),
         SLG = (((H-X2B-X3B-HR)+(2*X2B)+(3*X3B)+(4*HR))/AB)) %.%
  head(12)

##     playerID yearID teamID  AB   H X2B X3B HR SB CS SO BB HBP SF    Avg
## 1  aardsda01   2004    SFN   0   0   0   0  0  0  0  0  0   0  0    NaN
## 2  aardsda01   2006    CHN   2   0   0   0  0  0  0  0  0   0  0 0.0000
## 3  aardsda01   2007    CHA   0   0   0   0  0  0  0  0  0   0  0    NaN
## 4  aardsda01   2008    BOS   1   0   0   0  0  0  0  1  0   0  0 0.0000
## 5  aardsda01   2009    SEA   0   0   0   0  0  0  0  0  0   0  0    NaN
## 6  aardsda01   2010    SEA   0   0   0   0  0  0  0  0  0   0  0    NaN
## 7  aardsda01   2012    NYA  NA  NA  NA  NA NA NA NA NA NA  NA NA     NA
## 8  aaronha01   1954    ML1 468 131  27   6 13  2  2 39 28   3  4 0.2799
## 9  aaronha01   1955    ML1 602 189  37   9 27  3  1 61 49   3  4 0.3140
## 10 aaronha01   1956    ML1 609 200  34  14 26  2  4 54 37   2  7 0.3284
## 11 aaronha01   1957    ML1 615 198  27   6 44  1  1 58 57   0  3 0.3220
## 12 aaronha01   1958    ML1 601 196  34   4 30  4  1 49 59   1  3 0.3261
##       OBP    SLG
## 1     NaN    NaN
## 2  0.0000 0.0000
## 3     NaN    NaN
## 4  0.0000 0.0000
## 5     NaN    NaN
## 6     NaN    NaN
## 7      NA     NA
## 8  0.3221 0.4466
## 9  0.3663 0.5399
## 10 0.3649 0.5583
## 11 0.3778 0.6000
## 12 0.3855 0.5458

As you can see, the command worked for rows 8-12. We can see, for example, that Hank Aaron (aaronha01) batted .279 in 1954. We can also see the command gave us “NaN” and “NA” in several of those rows.

The NA represents missing data, while NaN means “not a number.” We got NaN results when we tried to divide by zero in our calculations. For example, the first row lists a player with no at-bats. When we tried to calculate a batting average for that row, the calculation was 0/0 = NaN. To get around this problem, we can filter the dataset to include only players who have at least one at-bat in a season:

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, AB, H, X2B, X3B, HR, SB, CS, SO, BB, HBP, SF) %.%
  filter(AB>0) %.%
  mutate(Avg = H/AB,
         OBP = (H + BB + HBP)/(AB+BB+HBP+SF),
         SLG = (((H-X2B-X3B-HR)+(2*X2B)+(3*X3B)+(4*HR))/AB)) %.%
  head(5)

##    playerID yearID teamID  AB   H X2B X3B HR SB CS SO BB HBP SF    Avg
## 1 aardsda01   2006    CHN   2   0   0   0  0  0  0  0  0   0  0 0.0000
## 2 aardsda01   2008    BOS   1   0   0   0  0  0  0  1  0   0  0 0.0000
## 3 aaronha01   1954    ML1 468 131  27   6 13  2  2 39 28   3  4 0.2799
## 4 aaronha01   1955    ML1 602 189  37   9 27  3  1 61 49   3  4 0.3140
## 5 aaronha01   1956    ML1 609 200  34  14 26  2  4 54 37   2  7 0.3284
##      OBP    SLG
## 1 0.0000 0.0000
## 2 0.0000 0.0000
## 3 0.3221 0.4466
## 4 0.3663 0.5399
## 5 0.3649 0.5583

Let’s find the top 10 season slugging percentages in MLB history:

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, AB, H, X2B, X3B, HR, SB, CS, SO, BB, HBP, SF) %.%
  filter(AB>0) %.%
  mutate(Avg = H/AB,
         OBP = (H + BB + HBP)/(AB+BB+HBP+SF),
         SLG = (((H-X2B-X3B-HR)+(2*X2B)+(3*X3B)+(4*HR))/AB)) %.%
  arrange(desc(SLG)) %.%
  head(11)

##     playerID yearID teamID AB H X2B X3B HR SB CS SO BB HBP SF Avg OBP SLG
## 1  chacigu01   2010    HOU  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 2  hernafe02   2008    SEA  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 3  lefebbi01   1938    BOS  1 1   0   0  1  0  0  0  0   0 NA   1  NA   4
## 4   motagu01   1999    MON  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 5  narumbu01   1963    BAL  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 6  perrypa02   1988    CHN  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 7  quirkja01   1984    CLE  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 8  rogered01   2005    BAL  1 1   0   0  1  0  2  0  0   0  0   1   1   4
## 9  sleatlo01   1958    DET  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 10   yanes01   2000    TBA  1 1   0   0  1  0  0  0  0   0  0   1   1   4
## 11 altroni01   1924    WS1  1 1   0   1  0  0  0  0  0   0 NA   1  NA   3

As you can see, there were 10 players who had a homerun in their single at-bat during the season (which yields a slugging percentage of 4.00). Let’s filter these results to only include players with at least 100 at-bats in a season:

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, AB, H, X2B, X3B, HR, SB, CS, SO, BB, HBP, SF) %.%
  filter(AB>100) %.%
  mutate(Avg = H/AB,
         OBP = (H + BB + HBP)/(AB+BB+HBP+SF),
         SLG = (((H-X2B-X3B-HR)+(2*X2B)+(3*X3B)+(4*HR))/AB)) %.%
  arrange(desc(SLG)) %.%
  head(10)

##     playerID yearID teamID  AB   H X2B X3B HR SB CS  SO  BB HBP SF    Avg
## 1  bondsba01   2001    SFN 476 156  32   2 73 13  3  93 177   9  2 0.3277
## 2   ruthba01   1920    NYA 457 172  36   9 54 14 14  80 150   3 NA 0.3764
## 3   ruthba01   1921    NYA 540 204  44  16 59 17 13  81 145   4 NA 0.3778
## 4  bondsba01   2004    SFN 373 135  27   3 45  6  1  41 232   9  3 0.3619
## 5  bondsba01   2002    SFN 403 149  31   2 46  9  2  47 198   9  2 0.3697
## 6   ruthba01   1927    NYA 540 192  29   8 60  7  6  89 137   0 NA 0.3556
## 7  gehrilo01   1927    NYA 584 218  52  18 47 10  8  84 109   3 NA 0.3733
## 8   ruthba01   1923    NYA 522 205  45  13 41 17 21  93 170   4 NA 0.3927
## 9  hornsro01   1925    SLN 504 203  41  10 39  5  3  39  83   2 NA 0.4028
## 10 mcgwima01   1998    SLN 509 152  21   0 70  1  0 155 162   6  4 0.2986
##       OBP    SLG
## 1  0.5151 0.8634
## 2      NA 0.8490
## 3      NA 0.8463
## 4  0.6094 0.8123
## 5  0.5817 0.7990
## 6      NA 0.7722
## 7      NA 0.7654
## 8      NA 0.7644
## 9      NA 0.7560
## 10 0.4699 0.7525

From this, we see Barry Bonds had the highest slugging percentage in 2001 (when he hit 73 home runs). Let’s use a similar set of commands to see which players had the highest number of strikeouts:

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, SO) %.%
  arrange(desc(SO)) %.%
  head(5)

##    playerID yearID teamID  SO
## 1 reynoma01   2009    ARI 223
## 2  dunnad01   2012    CHA 222
## 3 reynoma01   2010    ARI 211
## 4 stubbdr01   2011    CIN 205
## 5 reynoma01   2008    ARI 204

Mark Reynolds, in 2009, struck-out 223 times. Adam Dunn was a close second, with 222 strikeouts in 2012 for the White Sox.

Who had the most career strikeouts? Let’s see:

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, SO) %.%
  group_by(playerID) %.%
  summarize(totalSO = sum(SO)) %.%
  arrange(desc(totalSO)) %.%
  head(5)

## Source: local data frame [5 x 2]
## 
##    playerID totalSO
## 1 jacksre01    2597
## 2 thomeji01    2548
## 3  sosasa01    2306
## 4 rodrial01    2032
## 5  dunnad01    2031

Reggie Jackson has struck-out more times (2597 times) than any other player in MLB history. Does he have the highest strike-out rate (strikeouts per at-bat)? Let’s see (for players with at least 2000 at-bats in their careers):

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, AB, SO) %.%
  group_by(playerID) %.%
  summarize(totalSO = sum(SO),
            totalAB = sum(AB)) %.%
  filter(totalAB>2000) %.%
  mutate(Krate = totalSO/totalAB) %.%
  arrange(desc(Krate)) %.%
  head(5)

## Source: local data frame [5 x 4]
## 
##    playerID totalSO totalAB  Krate
## 1  custja01     819    2107 0.3887
## 2 branyru01    1118    2934 0.3810
## 3 reynoma01    1122    2973 0.3774
## 4  deerro01    1409    3881 0.3631
## 5 jacksbo01     841    2393 0.3514

A Google search tells me the leader, striking out nearly 39% of the time, is Jack Cust (who amassed most of his strikeouts in Oakland from 2007-2009).

I wonder who had the highest strikeout rate while playing for the Detroit Tigers…

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, AB, SO) %.%
  group_by(playerID) %.%
  filter(teamID=="DET") %.%
  summarize(totalSO = sum(SO),
            totalAB = sum(AB)) %.%
  filter(totalAB>2000) %.%
  mutate(Krate = totalSO/totalAB) %.%
  arrange(desc(Krate)) %.%
  head(5)

## Source: local data frame [5 x 4]
## 
##    playerID totalSO totalAB  Krate
## 1  ingebr01    1189    4626 0.2570
## 2 clarkto02     721    2831 0.2547
## 3 fieldce01     926    3674 0.2520
## 4 grandcu01     618    2579 0.2396
## 5 gibsoki01     930    4170 0.2230

Brandon Inge. What about the lowest strike-out rate? To do this, I just sort the data in ascending order (deleting the desc command):

## Add new columns (BA, OBP, SLG)
Batting %.%
  select(playerID, yearID, teamID, AB, SO) %.%
  group_by(playerID) %.%
  filter(teamID=="DET") %.%
  summarize(totalSO = sum(SO),
            totalAB = sum(AB)) %.%
  filter(totalAB>2000) %.%
  mutate(Krate = totalSO/totalAB) %.%
  arrange(Krate) %.%
  head(5)

## Source: local data frame [5 x 4]
## 
##    playerID totalSO totalAB   Krate
## 1 cramedo01      86    2720 0.03162
## 2  kellge01     107    3303 0.03239
## 3 bassljo01      73    2240 0.03259
## 4 gehrich01     372    8860 0.04199
## 5 kuennha01     205    4372 0.04689

Google tells me it’s Doc Cramer, who played for Detroit from 1942-1948.

My favorite baseball player is Alan Trammell, a shortstop for the Detroit Tigers. To access his data, I need to know his playerID. I’ll look it up by typing in the first letters of his last name:

playerInfo("tramm")

##        playerID nameFirst nameLast
## 14308 trammal01      Alan Trammell
## 14309 trammbu01     Bubba Trammell

From this, we see there were two Trammells in MLB: Alan and Bubba. I want playerID trammal01. Let’s take a look at his batting statistics each season. Since I already know his playerID and teamID, I’ll eliminate those columns (using a minus - in the select command):

Batting %.%
  filter(playerID=="trammal01") %.%
  select(-playerID, -teamID, -stint, -lgID, -G_batting, -G_old)

##    yearID   G  AB   R   H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
## 1    1977  19  43   6   8   0   0  0   0  0  0  4 12   0   0  1  0    1
## 2    1978 139 448  49 120  14   6  2  34  3  1 45 56   0   2  6  3   12
## 3    1979 142 460  68 127  11   4  6  50 17 14 43 55   0   0 12  5    6
## 4    1980 146 560 107 168  21   5  9  65 12 12 69 63   2   3 13  7   10
## 5    1981 105 392  52 101  15   3  2  31 10  3 49 31   2   3 16  3   10
## 6    1982 157 489  66 126  34   3  9  57 19  8 52 47   0   0  9  6    5
## 7    1983 142 505  83 161  31   2 14  66 30 10 57 64   2   0 15  4    7
## 8    1984 139 555  85 174  34   5 14  69 19 13 60 63   2   3  6  2    8
## 9    1985 149 605  79 156  21   7 13  57 14  5 50 71   4   2 11  9    6
## 10   1986 151 574 107 159  33   7 21  75 25 12 59 57   4   5 11  4    7
## 11   1987 151 597 109 205  34   3 28 105 21  2 60 47   8   3  2  6   11
## 12   1988 128 466  73 145  24   1 15  69  7  4 46 46   8   4  0  7   14
## 13   1989 121 449  54 109  20   3  5  43 10  2 45 45   1   4  3  5    9
## 14   1990 146 559  71 170  37   1 14  89 12 10 68 55   7   1  3  6   11
## 15   1991 101 375  57  93  20   0  9  55 11  2 37 39   1   3  5  1    7
## 16   1992  29 102  11  28   7   1  1  11  2  2 15  4   0   1  1  1    6
## 17   1993 112 401  72 132  25   3 12  60 12  8 38 38   2   2  4  2    7
## 18   1994  76 292  38  78  17   1  8  28  3  0 16 35   1   1  2  0    8
## 19   1995  74 223  28  60  12   0  2  23  3  1 27 19   4   0  3  2    8
## 20   1996  66 193  16  45   2   0  1  16  6  0 10 27   0   0  1  3    3

From this, we can see he played 20 seasons from 1977-1996. We can list his 5 best seasons in terms of on-base percentage:

## Add new columns (BA, OBP, SLG)
Batting %.%
  filter(playerID=="trammal01") %.%
  select(-playerID, -teamID, -stint, -lgID, -G_batting, -G_old) %.%
  mutate(Avg = H/AB,
         OBP = (H + BB + HBP)/(AB+BB+HBP+SF)) %.%
  arrange(desc(OBP)) %.%
  head(5)

##   yearID   G  AB   R   H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
## 1   1987 151 597 109 205  34   3 28 105 21  2 60 47   8   3  2  6   11
## 2   1993 112 401  72 132  25   3 12  60 12  8 38 38   2   2  4  2    7
## 3   1983 142 505  83 161  31   2 14  66 30 10 57 64   2   0 15  4    7
## 4   1984 139 555  85 174  34   5 14  69 19 13 60 63   2   3  6  2    8
## 5   1990 146 559  71 170  37   1 14  89 12 10 68 55   7   1  3  6   11
##      Avg    OBP
## 1 0.3434 0.4024
## 2 0.3292 0.3883
## 3 0.3188 0.3852
## 4 0.3135 0.3823
## 5 0.3041 0.3770

We can also get his career totals:

Batting %.%
  filter(playerID=="trammal01") %.%
  group_by(playerID) %.%
  summarize(seasons = n(),
            games = sum(G),
            AB = sum(AB),
            hits = sum(H), 
            dbl = sum(X2B),
            trpl = sum(X3B),
            HR = sum(HR),
            Avg = (sum(H) / sum(AB)),
            OBP = ((sum(H)+sum(BB)+sum(HBP))/(sum(AB)+sum(BB)+sum(HBP)+sum(SF))),
            SB = sum(SB),
            SBpct = (sum(SB)/(sum(CS)+sum(SB))))

## Source: local data frame [1 x 12]
## 
##    playerID seasons games   AB hits dbl trpl  HR    Avg    OBP  SB  SBpct
## 1 trammal01      20  2293 8288 2365 412   55 185 0.2854 0.3515 236 0.6841

Over his 20 seasons, Alan Trammell batted .285, hit 185 HR, and successfully stole bases 68.4% of the time.

Visualizations

Let’s see the relationship between career homeruns and strikeouts for every player in MLB history

### Minimum 400 games over 5 seasons with at least one SO and HR
### na.rm=TRUE is a command to remove NA values from each variable
hrSO <- Batting %.%
  group_by(playerID) %.%
  summarize(seasons = n(),
            games = sum(G, na.rm=TRUE),
            atbats = sum(AB, na.rm=TRUE),
            HR = sum(HR, na.rm=TRUE),
            Avg = (sum(H, na.rm=TRUE) / sum(AB, na.rm=TRUE)),
            SO = sum(SO, na.rm=TRUE)) %.%
  filter(seasons>5 & games>400 & atbats>1000 & SO>0 & HR>0)

### Create plot
ggplot(hrSO, aes(HR, SO)) +
  geom_point(alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

This plot shows the general positive relationship between strikeouts and homeruns. This could be simply due to some players having more at-bats than others, so let’s plot the relationship between “home runs per at-bat” and “strikeouts per at-bat”

### Minimum 400 games over 5 seasons with at least one SO and HR
### na.rm=TRUE is a command to remove NA values from each variable
hrSO2 <- Batting %.%
  group_by(playerID) %.%
  summarize(seasons = n(),
            games = sum(G, na.rm=TRUE),
            atbats = sum(AB, na.rm=TRUE),
            HR = sum(HR, na.rm=TRUE),
            Avg = (sum(H, na.rm=TRUE) / sum(AB, na.rm=TRUE)),
            SO = sum(SO, na.rm=TRUE),
            HRperAB = (HR/atbats),
            SOperAB = (SO/atbats)) %.%
  filter(seasons>5 & games>400 & atbats>1000 & SO>0 & HR>0)

### Create plot
ggplot(hrSO2, aes(HRperAB, SOperAB)) +
  geom_point(alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

I wonder if this same relationship holds for pitchers. Do pitchers with more career strikeouts give up more career homeruns? Let’s see (controlling for innings pitched)…

### Minimum 3000 innings pitched
### na.rm=TRUE is a command to remove NA values from each variable
PitcherHRso <- Pitching %.%
  group_by(playerID) %.%
  summarize(seasons = n(),
            IPouts = sum(IPouts, na.rm=TRUE),
            HR = sum(HR, na.rm=TRUE),
            SO = sum(SO, na.rm=TRUE),
            HRperIP = (HR/IPouts),
            SOperIP = (SO/IPouts)) %.%
  filter(IPouts > 3000)

### Create plot
ggplot(PitcherHRso, aes(HRperIP, SOperIP)) +
  geom_point(aes(size=IPouts), alpha = .4) +
  geom_smooth() +
  scale_size_area()

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-23

The size of the dots indicate the number of innings pitched for each pitcher. This relationship is a bit more complicated than it was for batters.

Final example - team statistics

See if you can figure out what this plot shows:

TeamSummary <- Batting %.%
  group_by(teamID) %.%
  summarize(games = sum(G, na.rm=TRUE),
            atbats = sum(AB, na.rm=TRUE),
            HR = sum(HR, na.rm=TRUE),
            Avg = (sum(H, na.rm=TRUE) / sum(AB, na.rm=TRUE)),
            SO = sum(SO, na.rm=TRUE),
            HRperGame = (HR/games),
            SOperGame = (SO/games)) %.%
            filter(HRperGame>0 & SOperGame>0)

ggplot(TeamSummary, aes(HRperGame, SOperGame)) +
  geom_point(aes(size=Avg), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-24