Source file ⇒ FinalMarkdown.Rmd

For my final project I decided to take a look at the hockey Hall of Fame, and see if I can possibly predict what kind of stats a player will need in order to be inducted into the Hall of Fame. So in order to do this I first obtained some data sets from kaggle, which can be found here. I only need the data sets with all of the players in the NHL up to this point (2011), a list of players that had been inducted into the Hall of Fame, and player statistics.

HOF<- read.csv("HOF.csv")
Score <- read.csv("Scoring.csv")
Master <- read.csv("Master.csv")

I need some sort of indicator to show that a player is in the Hall of Fame or not. So I created a new variable called inhof to be applied to the Master data set, which is the data set with all of the players. A 0 is given if the player doesn’t have a Hall of Fame ID, which would mean they were never inducted, and a 1 is given for that variable otherwise. The Master data set also has 31 different variables, so I just want to select the variables that have meaning to me, such as the player’s IDs, the player’s Hall of Fame IDs, their first and last names, height, weight, position, and the new indicator variable I just created. We call this new data set Master1.

Master1 <-
  Master %>% 
  mutate(inhof=ifelse(hofID == "", 0,1)) %>% 
  select(playerID,hofID,firstName, lastName, height, weight, pos, inhof)

Now I want to see how many people have been inducted into the Hall of fame at this point. All I need to do for this is see how many rows there are in the HOF data set.

HOF %>% 
  nrow()
## [1] 366

But if you look at the HOF data set, there are several different categories of people that are in the Hall of Fame. There are Players, Builders, and Referee/Linesman. Since I want to see if I can predict if a player will make it into the Hall of Fame, I can filter out all the cases of the players and see how many there are.

HOF1 <-
  HOF %>% 
  filter(category=="Player") 
HOF1 %>% 
  nrow()
## [1] 251

So now that I have an idea of the number of players that have been inducted, I can start on a decision tree to try to predict, based on a players stats, if they have what it takes to be inducted. To start off I take the Master1 data set which has the 8 variables I selected from the original data set, and I make it even smaller by only selecting the indicator variable inhof and the player ID.

inhof<-
  Master1 %>% 
  select(playerID, inhof)

I could analyze the data all together, but it wouldn’t really make any sense. The goalies inducted most likely wouldn’t have any goals, and the centers and defenseman have much different roles on the ice. So in order to get a better decision tree I have to split the Master1 data set up by the player’s position, which I included when I created the Master1 data table. So I just split the Master1 data set by position, and it creates a list. I have to do a little searching to see which elements of the list correspond to each of the positions, but once I figure it out I am able to extract them into their own data sets. Just for fun I split up the left wingers and the right wingers into their own data sets and tests, but I dont expect the data to be too different between the two.

mylist<- split(Master1, Master1$pos)
#mylist[[2]] #centers
#mylist[[6]] #Dman 
#mylist[[12]] #L
#mylist[[15]] #R
centers <- mylist[[2]]
dman<- mylist[[6]]
left <- mylist[[12]]
right<- mylist[[15]]

Centers

I first started with the centers. I joined the centers data set (that I just obtained from the split Masters1 data set) with the Score data set (which has all the stats on the players) in order to only have the players with the position of center and all of their stats. I then summerized the centerall data set to get the total goals, games played, total assists, points, powerplay goals, and shots for each of the players over the course of their careers. I then joined that data set with the inhof data set, which has the indicator variable inhof.

centerall <-
  centers %>% 
  inner_join(Score,by="playerID") %>% 
  group_by(playerID)
## Warning: Column `playerID` joining factors with different levels, coercing
## to character vector
centerall1 <-
  centerall %>%
  group_by(playerID) %>% 
  summarize(totalgoals=sum(G),gamesplayed=sum(GP),totalassists=sum(A),
            points=sum(Pts),powerplaygoals=sum(PPG),shots=sum(SOG)) %>% 
  distinct( .keep_all = TRUE)
centertotals<-
  centerall1 %>% 
  inner_join(inhof,by="playerID")
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector

From there I’m set to create the decision tree. I use a little bit of machine learning in order to do this, with the inhof indicator variable being the output, and all of the other summarized variables being the input.

mod1 <- party::ctree(
  inhof~totalgoals+totalassists+powerplaygoals+shots+points+gamesplayed, data=centertotals)
plot(mod1,type="simple")

The resulting tree is kind of interesting, but it makes sense. If you have a lot of power play goals, points, and normal goals, you have a much higher chance of making it into the Hall of Fame as a center.

Defenseman

I took the same exact approach when it came to the defenseman, with the same variables and same methods.

dmanall<-
  dman %>% 
  inner_join(Score,by="playerID") %>% 
  group_by(playerID)
## Warning: Column `playerID` joining factors with different levels, coercing
## to character vector
dmanall1<-
  dmanall %>% 
  summarize(totalgoals=sum(G),gamesplayed=sum(GP),totalassists=sum(A),
            points=sum(Pts),powerplaygoals=sum(PPG),shots=sum(SOG)) %>% 
  distinct( .keep_all=TRUE)
dmantotals<-
  dmanall1 %>% 
  inner_join(inhof,by="playerID") 
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
mod2 <- party::ctree(
  inhof~totalgoals+totalassists+powerplaygoals+points+shots+gamesplayed, data=dmantotals)
plot(mod2,type="simple")

This tree is more interesting, due to the fact that the branches are only based on total goals and assists. It makes sense in the way that defenseman don’t score nearly as much as centers do, especially on the power play. They take less shots and they usually have less points than their center counterparts, but I find it interesting that those variables are insignificant in the overall tree.

Left Wings

As before, I did the same thing as I did with the centers, but switched the position to left wingers.

leftwingall<-
  left %>% 
  inner_join(Score,by="playerID") %>% 
  group_by(playerID)
## Warning: Column `playerID` joining factors with different levels, coercing
## to character vector
leftwingall1<-
  leftwingall %>% 
  summarize(totalgoals=sum(G),gamesplayed=sum(GP),totalassists=sum(A),
            points=sum(Pts),powerplaygoals=sum(PPG),shots=sum(SOG)) %>% 
  distinct( .keep_all=TRUE)
leftwingtotals<-
  leftwingall1 %>% 
  inner_join(inhof,by="playerID")
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
mod3 <- party::ctree(
  inhof~totalgoals+totalassists+powerplaygoals+shots+gamesplayed+points, data=leftwingtotals)
plot(mod3,type="simple")

Yet again I’m surprised with the outcome. This one is only based on total goals, but I would expect all of the categories to play a larger role. At least power play goals and points, due to the fact that they are similar to the centers for both of those categories, and those were both significant for the centers.

Right Wings

Same exact approach as all the others, just with right wing players.

rightwingall<-
  right %>% 
  inner_join(Score,by="playerID") %>% 
  group_by(playerID)
## Warning: Column `playerID` joining factors with different levels, coercing
## to character vector
rightwingall1<-
  rightwingall %>% 
  summarize(totalgoals=sum(G),gamesplayed=sum(GP),totalassists=sum(A),
            points=sum(Pts),powerplaygoals=sum(PPG),shots=sum(SOG)) %>% 
  distinct( .keep_all=TRUE)
rightwingtotals<-
  rightwingall1 %>% 
  inner_join(inhof,by="playerID")
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
mod4 <- party::ctree(
  inhof~totalgoals+totalassists+powerplaygoals+shots+gamesplayed+points, data=rightwingtotals)
plot(mod4,type="simple")

To make things more interesting, this tree includes power play goals and games played, along with total goals. I totally would have expected something similar to the left wing tree. But according to this result power play goals and games played are much more significant to a right winger than a left winger (of course this isn’t true).

Some of the Greats

Well lets put these trees to the test with some of the great hockey players of today that are currently playing. I’m a little biased on who the greatest players are in the league, lets see if you can tell.

Sidney Crosby

Arguably, Sidney Crosby is one of the top 5 best players to step on the ice. He currently has won three cups, two Conn Smythe’s, two Hart Memorial Trophies, two Maurice ā€œRocketā€ Richard Trophies, two Art Ross Trophies, and if that isn’t enough he has a couple of gold medals from the olympics. I could continue on with his accomplishments, but you get the gist. He will most definitely be inducted into the Hall of Fame when he retires, but lets put the decision tree to the test. In order to do this we have to scrape some data from the web.

page_url <- "https://www.hockey-reference.com/players/c/crosbsi01.html"
XPATH <- '//*[@id="stats_basic_plus_nhl"]'
table_list1 <- 
  page_url %>%
  read_html(stringsAsFactors=FALSE) %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)
crosby<- table_list1[[1]]
colnames(crosby)=crosby[1,]
crosby1<-crosby[-1:-14,] 
knitr::kable(crosby1)
Season Age Tm Lg GP G A PTS +/- PIM EV PP SH GW EV PP SH S S% TSA TOI ATOI FOW FOL FO% HIT BLK TK GV Awards
15 Career NHL 810 394 662 1056 158 608 270 121 3 54 390 267 5 2685 14.7 4096 16909 20:53 7596 6979 52.1 559 303 343 516

So looking at the data scraped from here Crosby has 394 goals, 1056 points, 121 power play goals, 662 assists, 2685 shots, and has played 810 games. So when we look at the decision tree, we follow the branches to see that that his probability of induction is 1. If you watch him play you will agree with this.

Evgeni Malkin

Evgeni Malkin is also a center who has dominated his position. Malkin has such a strong presence on the ice, which has carried over to his fair share of awards. He also has three cups, two Art Ross Trophies, a Conn Smythe Trophy, a Hart Memorial Trophy, and a Calder Memorial Trophy. Although he is overshadowed by Crosby a bit, Malkin has made a great career for himself.

page_url <- "https://www.hockey-reference.com/players/m/malkiev01.html"
XPATH <- '//*[@id="stats_basic_plus_nhl"]'
table_list2 <- 
  page_url %>%
  read_html(stringsAsFactors=FALSE) %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)
malkin<- table_list2[[1]]
colnames(malkin)=malkin[1,]
malkinall<-malkin[2:13,]
malkin1<-malkin[-1:-13,]
knitr::kable(malkin1)
Season Age Tm Lg GP G A PTS +/- PIM EV PP SH GW EV PP SH S S% TSA TOI ATOI FOW FOL FO% HIT BLK TK GV Awards
14 Career NHL 730 336 520 856 75 748 209 123 4 60 312 207 1 2522 13.3 4225 14731 20:11 3115 4025 43.6 391 254 526 579

Scraping his career stats from here we see that Malkin has played 730 games, scored 336 goals, has 520 assists, 856 points, 2522 shots, and he has 123 power play goals. When we look at the decision tree (as of December 5th 2017), and follow the correct path based on his career stats, his probability of being inducted is 0.023, which doesn’t make a lot of sense. I know that his chances are a lot less than Crosby, but I would have expected a higher probability than 0.023.

Interestingly, if Malkin scores just two more power play goals (which unless he quits the sport at this very moment, this is a given to occur), his chances of being inducted into the Hall of Fame rise to 0.615. I think this is more accurate, because he is one of the better players on the ice today. A probability of induction of 0.615 is a lot more reasonable than 0.023 in my eyes.

Phil Kessel

Phil Kessel is the traditional Cinderella story, where he was the best player on a bad team, and was traded to win back to back Stanley Cups. He is a right winger who can score goals, even if he does not fit the hockey player stereotype.

page_url <- "https://www.hockey-reference.com/players/k/kesseph01.html"
XPATH <- '//*[@id="stats_basic_plus_nhl"]'
table_list3 <- 
  page_url %>%
  read_html(stringsAsFactors=FALSE) %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)
kessel<- table_list3[[1]]
colnames(kessel)=kessel[1,]
kessel1<-kessel[-1:-16,]
knitr::kable(kessel1)
Season Age Tm Lg GP G A PTS +/- PIM EV PP SH GW EV PP SH S S% TSA TOI ATOI FOW FOL FO% HIT BLK TK GV Awards
17 Career NHL 860 308 374 682 -59 258 226 81 1 51 236 138 0 2886 10.7 4936 15691 18:15 290 387 42.8 135 190 338 549

The stats that were scraped from the web are from here, and we see that Phil has played 860 games, scored 308 goals, has 374 assists, 682 points, has 81 power play goals, and 2886 shots. So when we take a look at the right wing tree, we can follow the path to get a probability of 0.121. Again this seems like a low probability to me, but still pretty accurate.

Kris Letang

Kris Letang is one of the most important players on the Pittsburgh Penguins for so many reasons. Mostly because he is one of the best defenseman in the league. Lets take a look at his stats and compare it to the decision tree to see his chances at the Hall of Fame.

page_url <- "https://www.hockey-reference.com/players/l/letankr01.html"
XPATH <- '//*[@id="stats_basic_plus_nhl"]'
table_list3 <- 
  page_url %>%
  read_html(stringsAsFactors=FALSE) %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)
letang<- table_list3[[1]]
colnames(letang)=letang[1,]
letang1<-letang[-1:-13,]
knitr::kable(letang1)
Season Age Tm Lg GP G A PTS +/- PIM EV PP SH GW EV PP SH S S% TSA TOI ATOI FOW FOL FO% HIT BLK TK GV Awards
14 Career NHL 631 88 318 406 45 460 53 32 3 18 176 135 7 1585 5.6 3070 14830 23:30 1 2 33.3 1025 949 289 513

We can see from the data scraped from here that Letang has played in 631 games, has 88 goals, 318 assists, 406 points, 32 power play goals, and 1585 shots. If you remember, the defenseman decision tree was very unhelpful for finding out new information, but lets still put it to the test. If you follow the correct braches, it gives Letang a 0.081 probability. Again this seems low to me. I think my biases may come into play a little more with this one, but I think his odds should be more than 0.081. But then again, it is hard to predict things in hockey, and there are so many other variables I left out that contribute to a player getting in the Hall of Fame.

One More Interesting Thing…

In my opinion, I think that Malkin is one of the best players in the league currently, and he is often overlooked because he playes on a team with Crosby. But the question is is he good enough to make it into the Hall of Fame? According to the decision tree, Crosby is a shoo-in for the Hall of Fame, and Malkin’s chances are about 0.023 at the moment (but with a very good chance to jump to 0.615 in the near future). The two big differences in the two players are their goals and their points, so lets dive a little deeper…

To get the data ready, I took the data I scraped from the web for both Crosby and Malkin and took the data over the course of their careers, which is different from how I only took their totals above (I did not include the current season, 2017-2018, due to the fact it is currently going on and it is incomplete data). I then added a column named player to each of the data tables with their last name so I could differentiate where the data was coming from. I then concatenated the data using an rbind command, and only selected the variables I needed for the analysis.

crosbypoints<-crosby[2:13,]
crosbypoints["player"]<-"crosby"
  
malkinpoints<-malkin[3:12,]
malkinpoints["player"]<-"malkin"

compare<-
  rbind(crosbypoints,malkinpoints)
compare1<-
  compare %>% 
  select(Season,PTS,G,player)

The next step was to convert the points and goal variables to numeric variables so I could find the average of each for both players. I then grouped the data by player, and found the average points and goals scored a season for both players. The results are below.

compare1$PTS<- as.numeric(compare1$PTS)
compare1$G<- as.numeric(compare1$G)

compareavg<-
  compare1 %>% 
  group_by(player) %>% 
  summarize(avgpts=mean(PTS),avggoal=mean(G))
knitr::kable(compareavg)
player avgpts avggoal
crosby 85.58333 31.83333
malkin 74.70000 29.50000

So first I started with looking at the point differences between the players. To do this I used Season as the x-axis variable and points as the y-axis variable. I then broke it up by player, with Crosby being in red and Malkin being in blue. I connected the points just to show the trend over time and make it a little cleaner. I then used the data from the compareavg data table to put two horizontal dashed lines that symbolize the average amount of points scored a season by each player. Crosby on average scores 85.58 points, which is represented by the blue line, and Malkin scores on average 74.7 points a season, shown by the red line.

ggplot(data=compare1,aes(x=Season,y=PTS,group=player,color=player))+
  geom_point()+geom_line()+geom_hline(aes(yintercept = 85.583),color="red",linetype="dashed")+
  geom_hline(aes(yintercept = 74.700),color="cyan",linetype="dashed")

So from looking at this graph we can see that overall Sidney Crosby has more points than Malkin, which contributes to him having a better chance of getting into the Hall of Fame. The only times Malkin scored more points was in his first two seasons, and in the 11-12 season. But this actually leads to another interesting result. The 2011-2012 season was Malkin’s second best season for points, while it was Crosby’s worst. This happened to be the season that Sidney Crosby was out due to a concussion and concussion like symptoms. He missed the first 20 games, came back for 8, and then was out for the next 40 games. That explains why his total was so low, and the explaination for Malkin’s jump can be best explained with the thought that Malkin steps up when Crosby is not doing well, or out all-together. You can clearly see here that Malkin stepped up for the team when he was needed.

For the other graph, I used the same x variable but I changed the y variable to goals. I also altered the horizontal lines to match the average number of goals each player scores in a year.

ggplot(data=compare1,aes(x=Season,y=G,group=player,color=player))+
   geom_point()+geom_line()+geom_hline(aes(yintercept = 31.83),color="red",linetype="dashed")+
   geom_hline(aes(yintercept = 29.50),color="cyan",linetype="dashed")

As you can see on the graph, the difference in the average goals per year is a lot smaller than the difference in the number of points. Crosby on average scores 31.83 goals a year, while Malkin scores about 29.50 a year. The result is actually very similar to the points result. Malkin scored more in his first two seasons, and he clearly stepped up in the 2011-2012 season to fill in for Crosby. But still overall, Crosby has better stats.

Closing Remarks

I wish I had more time in the day so I could have devoted more time working on this. If I did have more time, I would have completely changed the criteria for the defenseman decision tree to use more relative things such as number of blocked shots. I also would have made a tree for a Goalie’s chance of getting inducted into the Hall of Fame. I also would have also liked to look at some other players such as Alex Ovechin, Erik Karlsson, Steven Stamkos, and John Tavares.