Written by Dalton Herrold and Joshua Choi
League of Legends, developed by Riot Games, is the most popular game in the world with a peak of over 8 million players being logged in at the same time every day. The true daily number of players is certainly much greater than this. In this game, there are two teams, red and blue, each consisting of 5 players. Each team's goal is to destroy the other team's home base. The team that does so first wins the game. League of Legends games can be quite long since, as long as both team's bases stand, the game continues. The average game length according to Riot Games is somewhere around 30 minutes.
Before the game starts, each player in the game must pick a unique champion. Currently, there are 152 champions in the game to choose from, but that number increases often as Riot constantly introduces new champions to the game. The game gets complex fast as, beyond the 10 champions that are in the game (5 on your team and 5 on the other), there are many systems to keep in mind such as the economic system to purchase items, experience system to allow your champion access to more powerful abilities, and neutral objectives that grant powerful effects.
There is a ranked system in the game consisting of Iron, Bronze, Silver, Gold, Platinum, Diamond, Masters, and Challenger. To see the full ranked distribution, check out this site: https://www.leagueofgraphs.com/rankings/rank-distribution. We will be looking at a dataset consisting of statistics from the first 10 minutes of Diamond ranked North American solo games. We are going to try and see if the first 10 minutes dictates the outcome of the game as well as which statistics are the best estimators for the game outcome.
Diamond players are considered to be in the top 2.4% of players according to the resource mentioned before, meaning that their understanding of the game is deep and they typically make high impact decisions that are likely to go well for them. As such, we believe that tracking their behaviors and seeing what leads to wins in their games is an accurate way to determine who will win a game of League of Legends.
For people who do not know very much about League of Legends, this should be a great way to generate interest in the game for them and uniquely show exploratory data analysis. For those who have experience in playing League of Legends, this should hopefully serve as a rough outline of what your team should prioritize to climb the ranks and succeed. For those who are unfamiliar with the game, we recommend this resource to check out to get a basic understanding: https://leagueoflegends.fandom.com/wiki/League_of_Legends_Wiki.
For this section, we will be taking advantage of the following libraries: pandas, numpy, matplotlib, and seaborn.
# Import the needed libraries for this section of the tutorial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Let us see the entire width of the dataset when printed
pd.set_option('display.max_columns', 50)
# Make our plots look a little better
plt.style.use('ggplot')
For this tutorial, we have done no preprocessing to the data and will be going through the steps of getting it ready to train ML models just like you normally would.
First things first, let's get the dataset into python and take a thorough look around it.
# Read the dataset from within the same directory as the notebook
games = pd.read_csv("high_diamond_ranked_10min.csv", sep=',')
games.head()
We can see that we are working with 40 columns here, which is quite a lot. Let's take a look at each column and see if there are any we can eliminate right off the bat.
There are duplicates of all variables prefixed with blue for the red team (prefixed with red instead). However, to avoid redundancy and unnecessary explanations, I am not going to write them here.
In this section we are going to clean up the data and "wrangle" it to be more concise and better structured. This data wrangling is important as, if done correctly, it will improve the results, and speed of training, of our ML algorithms later. Part of this process is eliminating unnecessary columns.
The only column that it seems like we can eliminate off the bat is the gameID column. The gameID column corresponds to the League of Legends API reference to the game. Utilizing this number, we could request more information from the API. Since we will only be working with the data from the dataset, we will remove this column since it does not affect whether or not a game is won. So, let's remove it!
# Dropping the gameID column
games.drop(['gameId'], axis=1, inplace=True)
games.head()
Upon closer inspection of the data columns, it seems like certain columns may be a direct linear combination of other columns. For example, total experience and average level should be directly related. Additionally, CSPerMin and TotalMinionsKilled should be directly related. This is the same with total gold and gold per minute. We want to look at getting rid of these as they could bias our algorithm toward a particular feature.
Let me explain that a little further. If we were to keep two features that were the same, say total experience and average level. If these two were indeed the same, then by keeping both in our algorithm we are putting more weight on the experience than we are on other features since there are two features for experience. Since we don't want a particular feature to be weighted higher than another, we will try and remove all of these features which are linearly correlated.
Let's go ahead and take a closer look at total experience and average level and see if we can conclusively say that they are linearly correlated, and if so, remove one of them.
# Creating the plot
plt.figure(figsize=(20, 5))
# Plotting for blue team on the left subplot
plt.subplot(1, 2, 1)
sns.scatterplot(x=games['blueAvgLevel'], y=games['blueTotalExperience'], alpha=0.7, color=["blue"])
# Labeling the axises
plt.xlabel('Average Level', fontsize=17, fontweight='bold')
plt.ylabel('Total Experience', fontsize=17, fontweight='bold')
plt.title('Total Experience vs. Average Level for Blue Team', fontsize=15, fontweight='bold')
# Plotting for red team on the right subplot
plt.subplot(1, 2, 2)
sns.scatterplot(x=games['redAvgLevel'], y=games['redTotalExperience'], alpha=0.7, color=["red"])
# Labeling the axises
plt.xlabel('Average Level', fontsize=17, fontweight='bold')
plt.ylabel('Total Experience', fontsize=17, fontweight='bold')
plt.title('Total Experience vs. Average Level for Red Team', fontsize=15, fontweight='bold')
plt.show()
As we can see, there is quite obviously a positive linear relationship between total experience and average level. In fact, it seems as though they are probably the same thing, just average level seems binned and therefore, less accurate. For this reason, we are going to remove average level for both the blue and red teams. We will leave total experience intact since it is more precise.
# Dropping both red and blue average level
games.drop(['redAvgLevel', 'blueAvgLevel'], axis=1, inplace=True)
Now let's take a look at if there is a correlation between average CS per minute and total minions killed. Once again, it seems obvious that there should be, but it would be a bad idea to assume without checking.
# Creating a new plot
plt.figure(figsize=(20, 5))
# Plotting for blue team on the left subplot
plt.subplot(1, 2, 1)
sns.scatterplot(x=games['blueCSPerMin'], y=games['blueTotalMinionsKilled'], alpha=0.7, color=["blue"])
plt.xlabel('Average CS Per Minute', fontsize=17, fontweight='bold')
plt.ylabel('Total Minions Killed', fontsize=17, fontweight='bold')
plt.title('Total Minions Killed vs. Average CS Per Minute for Blue Team', fontsize=15, fontweight='bold')
# Plotting for red team on the right subplot
plt.subplot(1, 2, 2)
sns.scatterplot(x=games['redCSPerMin'], y=games['redTotalMinionsKilled'], alpha=0.7, color=["red"])
plt.xlabel('Average CS Per Minute', fontsize=17, fontweight='bold')
plt.ylabel('Total Minions Killed', fontsize=17, fontweight='bold')
plt.title('Total Minions Killed vs. Average CS Per Minute for Red Team', fontsize=15, fontweight='bold')
plt.show()
Wow, that is a linear combination right there. Since it forms a perfect line, looking at the axes, we can determine that average CS per minute is most likely just total minions divided by 10 (since this is the first 10 minutes of the game). For this reason, we will remove one of these features. Since it is a perfect linear combination, it shouldn't matter what feature we delete here, but we are going to delete the average CS per minute feature for both red and blue. This is because we will do our own data scaling later and it might be a tad better to start with the original total CS values.
# Dropping the average cs per minute for both red and blue teams
games.drop(['blueCSPerMin', 'redCSPerMin'], axis=1, inplace=True)
Last but not least, let's take a look at total gold vs gold per minute. I suspect that this will look very similar to total CS vs average CS, but we won't know until we look at the graph.
# Creating a new plot
plt.figure(figsize=(20, 5))
# Plotting for blue team on the left subplot
plt.subplot(1, 2, 1)
sns.scatterplot(x=games['blueGoldPerMin'], y=games['blueTotalGold'], alpha=0.7, color=["blue"])
plt.xlabel('Gold Per Min', fontsize=17, fontweight='bold')
plt.ylabel('Total Gold', fontsize=17, fontweight='bold')
plt.title('Total Gold vs. Gold Per Minute for Blue Team', fontsize=15, fontweight='bold')
# Plotting for red team on the right subplot
plt.subplot(1, 2, 2)
sns.scatterplot(x=games['redGoldPerMin'], y=games['redTotalGold'], alpha=0.7, color=["red"])
plt.xlabel('Gold Per Min', fontsize=17, fontweight='bold')
plt.ylabel('Total Gold', fontsize=17, fontweight='bold')
plt.title('Total Gold vs. Gold Per Minute for Red Team', fontsize=15, fontweight='bold')
plt.show()
Once again these columns are perfect linear combinations, with gold per minute appearing to be total gold divided by 10. For the exact same reason as the last feature we looked at, we are going to drop gold per minute for both the red and the blue teams.
# Dropping the average gold per minute for both red and blue teams
games.drop(['redGoldPerMin', 'blueGoldPerMin'], axis=1, inplace=True)
Whew, that was a lot of hard work! Before doing any more analysis, let's take a step back and admire the work that we have done so far.
games.head()
This is looking pretty good so far! However, there is still some more work to be done to it. Taking a look at the dataset, there seem to be 2 more features that we can eliminate easily without reducing the utility of the dataset. The features in question are redGoldDiff and redExperienceDiff since they are just the negation of the features blueGoldDiff and blueExperienceDiff. There is no need to have extra complexity in our model for these features since they have already been accounted for once.
# Removing the two features that are redundant
games.drop(['redGoldDiff', 'redExperienceDiff'], axis=1, inplace=True)
Upon further inspection of the dataset, it appears that we can also drop the features blueEliteMonsters and redEliteMonsters. This is because this feature is just a sum of the number of dragons and Rift Heralds that a team has killed. Since we already have two features for these epic monsters separately, there is no reason to keep the total count of all of them together.
# Dropping the two unecessary features
games.drop(['blueEliteMonsters', 'redEliteMonsters'], axis=1, inplace=True)
We can also get rid of the features redTotalExperience and blueTotalExperience since we have blueExperienceDiff which already takes into account both variables and displays the data in a better, cleaner way for later training. This also applies to TotalGold for each team for the same reason, so we will go ahead and remove all of these.
# Dropping the 4 columns that area already accounted for
games.drop(['redTotalExperience', 'blueTotalExperience', 'redTotalGold', 'blueTotalGold'], axis=1, inplace=True)
This last feature drop brought up a good idea for other ways to simplify our dataset. What is important for determining the outcome of a game is not the kills, deaths, and assists for either team. We do not want to look at these features separately, but rather as a single feature called KDA difference.
Before calculating this difference, which will involve calculating the KDA for each team, check if there are any null values that we will have to deal with.
games.isnull().sum(axis = 0)
That is great, there are no missing values for us to have to fill in.
Now that we know this, we can get on with our KDADifference feature. We will be using the following function, KDA = (Kills + Assists)/Deaths, given by Riot Games to calculate the KDA for each team. Once we calculate the KDA for each team, we will subtract the KDAs to give us the KDADifference.
There is one subtlety to this that I will mention before we get into it. If deaths is 0, that will be a problem since we would be trying to divide by 0. So before the calculation, we should make sure that instead of dividing by 0, we divide by 1.
def get_row_kda(row):
# Calculate the KDA for each team (ensuring deaths is never 0)
blue_kda = (row['blueKills']+row['blueAssists'])/(row['blueDeaths'] if row['blueDeaths'] != 0 else 1)
red_kda = (row['redKills']+row['redAssists'])/(row['redDeaths'] if row['redDeaths'] != 0 else 1)
return blue_kda-red_kda
# Create new row blueKDADiff applying the function we just created to all rows
games['blueKDADiff'] = games.apply(lambda x: get_row_kda(x), axis=1)
# Drop the now unecessary columns that we used to create the new column
games.drop(['blueKills', 'blueAssists', 'blueDeaths', 'redKills', 'redAssists', 'redDeaths'], axis=1, inplace=True)
Let's take a look at how that simplified our dataset.
games.head()
That cleaned up our dataset a lot, so let's look back at our dataset and see if there are other places that we can simplify our dataset in the same ways.
This can be done for CS and Jungle CS since at the end of the day it doesn't matter how much CS either team has. It matters what the difference is between the two teams CS.
# Calculate the difference between the teams total minions killed
def get_row_cs(row):
blue_tot_minions = row['blueTotalMinionsKilled']
red_tot_minions = row['redTotalMinionsKilled']
return blue_tot_minions - red_tot_minions
# Calculate the difference between the teams total minions killed
def get_row_jg_cs(row):
blue_tot_jg_minions = row['blueTotalJungleMinionsKilled']
red_tot_jg_minions = row['redTotalJungleMinionsKilled']
return blue_tot_jg_minions - red_tot_jg_minions
# Apply these two functions to every row in the dataset
games['blueCSDiff'] = games.apply(lambda x: get_row_cs(x), axis=1)
games['blueJGCSDiff'] = games.apply(lambda x: get_row_jg_cs(x), axis=1)
# Drop the now obsolete columns from our dataset
games.drop(['blueTotalMinionsKilled', 'redTotalMinionsKilled', 'blueTotalJungleMinionsKilled', 'redTotalJungleMinionsKilled'], axis=1, inplace=True)
Based on the above analysis, we can do the same thing for wards and wards destroyed. Which team has more vision is more essential to determining who will win than just knowing how much vision either team has.
# Calculate the difference in wards between blue and red
def get_row_wards(row):
return row['blueWardsPlaced']-row['redWardsPlaced']
# Calculate the difference in wards destroyed between blue and red
def get_row_dest_wards(row):
return row['blueWardsDestroyed']-row['redWardsDestroyed']
# Create new columns for our new features by applying the functions to the dataset
games['blueWardDiff'] = games.apply(lambda x: get_row_wards(x), axis=1)
games['blueDestWardDiff'] = games.apply(lambda x: get_row_dest_wards(x), axis=1)
# Remove the old, uneeded features
games.drop(['blueWardsPlaced', 'redWardsPlaced', 'blueWardsDestroyed', 'redWardsDestroyed'], axis=1, inplace=True)
After that hard work, let's take a breather and look at the structure of our dataset.
games.head()
Wow, our dataset is looking pretty good! There is still a lot more work to do before we get to training an ML model but we should give ourselves a pat on the back for what we have done so far.
Next up, let's take a look around the dataset some more and see what features seem to be the most correlated with winning a game. In this next section, we may end up removing some features if we see that they do not significantly contribute to the chance of winning of a game.
During this section, we want to explore the data and any correlations that may exist. We want to begin to determine what features are going to be important to us during the prediction phase. First, we are going to take a look at some basic correlations, such as does getting the first kill in the game gives you a better chance of winning? What about getting the first epic monster kill? Does getting the first turret kill make a difference? Looking further down the line, we will take a look at the more complex relationships such as kills, deaths, and assists to see if we can see any trends.
First, let's take a look at the chance of winning the game given first blood. First blood is an important metric to track, not only because of the morale boost it provides your team to get the first kill, but it is also worth 33% more gold than a normal kill: granting 400 gold instead of the normal 300 gold. For this, we will be calculating the conditional probability of winning given that the team got first blood, written as P(winning|first blood).
To calculate this, we will calculate the number of times a team won the game and got first blood. We will then divide that by the total number of times that a team got first blood. This will give us the conditional probability that we are after.
# Create a new plot
plt.figure(figsize=(20, 5))
# performing the calculation described above.
blue_fb_win = len(games[(games['blueFirstBlood'] == 1) & (games['blueWins'] == 1)])
blue_fb = len(games[(games['blueFirstBlood'] == 1)])
red_fb_win = len(games[(games['redFirstBlood'] == 1) & (games['blueWins'] == 0)])
red_fb = len(games[(games['redFirstBlood'] == 1)])
both = [blue_fb_win/blue_fb, red_fb_win/red_fb]
# Creating a barplot and labeling the axes
ax = sns.barplot(x=['Blue', 'Red'], y=both, alpha=0.7, palette=["Blue", "Red"])
plt.xlabel('Team', fontsize=17, fontweight='bold')
plt.ylabel('Winning Percentage', fontsize=17, fontweight='bold')
plt.title('Winning Percentage Given First Blood for Each Team', fontsize=22, fontweight='bold')
# Adding the percentage value above each bar
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height +.007,
'{:1.2f}%'.format(100*height),
ha="center")
plt.show()
Based on these values, we can see that given a certain team achieves first blood within the first 10 minutes, they are favored to win at about a 60% chance, written P(winning|first blood)=.60. A 60% chance of winning is pretty high, certainly higher than casino odds. It is important to remember, however, that it is not guaranteed that a team will achieve the first blood within the first 10 minutes. Therefore, we must continue looking at other features as well.
Next, let's take a look at what effect dragon kills have on the favorability of a team to win. We will be using the same equation for conditional probability described above, only this time we will be finding the probability of winning the game given at least one dragon kill. This equation can be written as P(winning|at least one dragon kill).
# Create a new plot
plt.figure(figsize=(20, 5))
# Find the probability of winning given at least one
# dragon using conditional probability formuls
blue_fm_win = len(games[(games['blueDragons'] >= 1) & (games['blueWins'] == 1)])
blue_fm = len(games[(games['blueDragons'] >= 1)])
red_fm_win = len(games[(games['redDragons'] >= 1) & (games['blueWins'] == 0)])
red_fm = len(games[(games['redDragons'] >= 1)])
y = [blue_fm_win/blue_fm, red_fm_win/red_fm]
# Plot the results and label the graph
ax = sns.barplot(x=['Blue', 'Red'], y=y, alpha=0.7, palette=["Blue", "Red"])
plt.xlabel('Team', fontsize=17, fontweight='bold')
plt.ylabel('Winning Percentage', fontsize=17, fontweight='bold')
plt.title('Winning Percentage Given at Least One Dragon for Each Team', fontsize=22, fontweight='bold')
# Add the heights of the bars
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height +.007,
'{:1.2f}%'.format(100*height),
ha="center")
plt.show()
We can see that this is a more influential statistic for predicting winning a game than first blood is. We can see that P(winning|at least one dragon)=.633 average across both teams. This makes sense because of the impacts that a single dragon can have for the entirety of a game. Dragon is a neutral objective that provides powerful buffs that increase according to the number of dragons you can slay. Because of these powerful, permanent, and potentially game-breaking effects getting a dragon does boost your chances to win tremendously.
Next, we will determine if a dragon or a rift herald is more important by looking at the probability of winning given a rift herald, written as P(winning|at least one rift herald), determined using the conditional probability formula. By contrasting our results with the results obtained for the probability of winning with at least one dragon, we should be able to understand which epic monster is more influential.
# Create the plot
plt.figure(figsize=(20, 5))
# Calculate the percentage of winning for each team given at least one
# rift herald using the conditional probability formula
blue_fm_win = len(games[(games['blueHeralds'] >= 1) & (games['blueWins'] == 1)])
blue_fm = len(games[(games['blueHeralds'] >= 1)])
red_fm_win = len(games[(games['redHeralds'] >= 1) & (games['blueWins'] == 0)])
red_fm = len(games[(games['redHeralds'] >= 1)])
y = [blue_fm_win/blue_fm, red_fm_win/red_fm]
# Plot the results and label the graph
ax = sns.barplot(x=['Blue', 'Red'], y=y, alpha=0.7, palette=["Blue", "Red"])
plt.xlabel('Team', fontsize=17, fontweight='bold')
plt.ylabel('Winning Percentage', fontsize=17, fontweight='bold')
plt.title('Winning Percentage Given at Least One Rift Herald for Each Team', fontsize=22, fontweight='bold')
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height +.007,
'{:1.2f}%'.format(100*height),
ha="center")
plt.show()
Rift Herald is a neutral objective that, once defeated, will fight for your army and destroy turrets extremely quickly. Given that amount of gold turret kills grant on top of the added territory gained from the turret being dead, securing the Rift Herald is important for determining which team will win the game. However, comparing it to the results that we obtained for dragons, we can conclude that it is not as influential on the game since P(winning|at least one rift herald)=.606 < P(winning|at least one rift dragon)=.633. This makes sense because the dragon gives you improved stats that will last the entire game, while rift herald will only impact the game for a few minutes unless killed earlier.
Next up, we will determine the P(winning|at least one tower).
# Setup the graph
plt.figure(figsize=(20, 5))
# Calculate the percentage of winning for each team given at
# least one tower using the conditional probability formula
blue_ft_win = len(games[(games['blueTowersDestroyed'] >= 1) & (games['blueWins'] == 1)])
blue_ft = len(games[(games['blueTowersDestroyed'] >= 1)])
red_ft_win = len(games[(games['redTowersDestroyed'] >= 1) & (games['blueWins'] == 0)])
red_ft = len(games[(games['redTowersDestroyed'] >= 1)])
y = [blue_ft_win/blue_ft, red_ft_win/red_ft]
# Plot the results and label the axes
ax = sns.barplot(x=['Blue', 'Red'], y=y, alpha=0.7, palette=["Blue", "Red"])
plt.xlabel('Team', fontsize=17, fontweight='bold')
plt.ylabel('Winning Percentage', fontsize=17, fontweight='bold')
plt.title('Winning Percentage Given At Least One Turret Destroyed for Each Team', fontsize=22, fontweight='bold')
# Label the bars with their respective percentages
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height +.007,
'{:1.2f}%'.format(100*height),
ha="center")
plt.show()
Based on the above, it seems that destroying a tower within the first 10 minutes significantly increases your chances of winning more so than securing a dragon, Rift Herald, or First Blood. Intuitively, this makes sense at it is much harder to destroy a tower than it is to get first blood or an epic monster kill. Additionally, with how focused League of Legends is on the economy, it makes sense that granting global gold on top of map advantage would lead to a victory.
Next, let's look at some of the other features to see if we can find a correlation between them. For these features, we will no longer be looking at conditional probability, we will be looking at histograms. If the data is unimodal, that would show us that the feature does not significantly impact the win percentage for either team. If the data is bimodal, that means that there is a difference between when the blue team normally wins and when the red team normally wins for that feature. This would indicate to us that the feature is significant.
To start, we will be looking at a histogram of Gold Difference. Just a note that gold differences are with respect to the blue team. This means that positive gold differences signify that blue is ahead in gold, and negative gold differences mean that red is ahead.
# Setup the graph
plt.figure(figsize=(20, 5))
# Split the data between blue winning and red winning
# and plot with the corresponding color
for color, data in games.groupby('blueWins'):
sns.distplot(data['blueGoldDiff'], bins=50, kde=False, color='red' if color == 0 else 'blue')
# Label the axes
plt.xlabel('Gold Difference', fontsize=17, fontweight='bold')
plt.ylabel('Number of Wins', fontsize=17, fontweight='bold')
plt.title('Histogram of Winning\'s Relationship with Gold Difference', fontsize=22, fontweight='bold')
plt.show()
The red section represents times where the red team won and the blue section represents times the blue team has won, and the purple section represents times where both teams have won before. It is clear that once your gold lead for your team hits around 5000, the game is nearly yours! This data is bimodal and that is expected for the gold difference. The more gold lead that blue has, the more probabilistic it is that they will win. The same goes for red. Therefore, gold lead does have a significant impact. Of course, there are times when teams win even when they are very behind in gold, but they are few and far between (and usually due to human error).
Next, let's take a look at experience.
# Get a new graph
plt.figure(figsize=(20, 5))
# Split the data between red winning and blue winning
# and plot (using different colors for different teams)
for color, data in games.groupby('blueWins'):
sns.distplot(data['blueExperienceDiff'], bins=50, kde=False, color='red' if color == 0 else 'blue')
# Label the axes
plt.xlabel('Experience Difference', fontsize=17, fontweight='bold')
plt.ylabel('Number of Wins', fontsize=17, fontweight='bold')
plt.title('Histogram of Winning\'s Relationship with Experience Difference', fontsize=22, fontweight='bold')
plt.show()
It is clear that having an experience lead will result in having a lead for your team since the data is bimodal. For both the blue and red teams, there is a tendency to win if you have the experience lead, with the chance to win increasing as you get a higher and higher experience lead. At around a 3750 gold experience lead for your team, you can expect to win the game. As such, experience is an important factor in determining if you will win a game of League.
Now, let's look at KDA difference.
# Start up a new plot
plt.figure(figsize=(20, 5))
# Split the data between red winning and blue winning
# and plot (using different colors for different teams)
for color, data in games.groupby('blueWins'):
sns.distplot(data['blueKDADiff'], bins=100, kde=False, color='red' if color == 0 else 'blue')
# Label the graph
plt.xlabel('KDA difference', fontsize=17, fontweight='bold')
plt.ylabel('Number of Wins', fontsize=17, fontweight='bold')
plt.title('Histogram of Winning\'s Relationship with KDA Difference', fontsize=22, fontweight='bold')
plt.show()
This data is barely bimodal (looks basically unimodal) which indicates that KDA is less important than other factors. While it is clear based on the graph that having a KDA lead over your opponent will result in wins, the cutoff is not as bold as it is with the other metrics previously mentioned. It appears at slightly above 10 KDA difference for your team, you can expect to win. But for the data within those bounds, it is not as clear cut as to which team will win as it is with our other data. That being said, KDA is a factor in determining if you will win the game.
Next, we will be looking at CS.
# Start a new plot
plt.figure(figsize=(20, 5))
# Split the data between red winning and blue winning
# and plot (using different colors for different teams)
for color, data in games.groupby('blueWins'):
sns.distplot(data['blueCSDiff'], bins=50, kde=False, color='red' if color == 0 else 'blue')
# Label the axes
plt.xlabel('CS Difference', fontsize=17, fontweight='bold')
plt.ylabel('Number of Wins', fontsize=17, fontweight='bold')
plt.title('Histogram of Winning\'s Relationship with CS Difference', fontsize=22, fontweight='bold')
plt.show()
This data looks barely bimodal, indicating that there is a marginal importance on CS in a win. Although a CS lead is a non-trivial factor due to the amount of gold that CS provides, it is clear that upsets can occur despite a CS lead. A CS lead of around 80 will guarantee your team a win, but as with KDA, there is a chance for a slight lead to not really have an impact on winning the game. There is a lot of overlap in the center meaning that unless you have a very sizeable lead, the statistic is not the most accurate in predicting who will win.
The next feature we will be taking a look at is jungle CS, which is the amount of jungle minions killed.
# Setup a new graph
plt.figure(figsize=(20, 5))
# Split the data between red winning and blue winning
# and plot (using different colors for different teams)
for color, data in games.groupby('blueWins'):
sns.distplot(data['blueJGCSDiff'], bins=50, kde=False, color='red' if color == 0 else 'blue')
# Label the axes
plt.xlabel('Jungle CS Difference', fontsize=17, fontweight='bold')
plt.ylabel('Number of Wins', fontsize=17, fontweight='bold')
plt.title('Histogram of Winning\'s Relationship with Jungle CS Difference', fontsize=22, fontweight='bold')
plt.show()
This graph looks unimodal, indicating an insignificant impact on whether or not a team will win a game. Jungle CS seems to have even less impact than regular, across the board, team CS. There is a lot of purple cross-section, meaning that there were many instances where despite a jungle CS lead, the team at a disadvantage in this metric won anyway. Due to this uncertainty and the unimodal nature of the graph, it is in our best interest to remove it from the model as we do not want to add any unnecessary complexity to our model.
# Removing Jungle CS as a feature
games.drop(['blueJGCSDiff'], axis=1, inplace=True)
The next feature that we will be taking a look at is wards.
# Start the graph
plt.figure(figsize=(20, 5))
# Split the data between red winning and blue winning
# and plot (using different colors for different teams)
for color, data in games.groupby('blueWins'):
data = data[(data['blueWardDiff'] < 50) & (data['blueWardDiff'] > -50)]
sns.distplot(data['blueWardDiff'], bins=99, kde=False, color='red' if color == 0 else 'blue')
# Label the axes
plt.xlabel('Ward Difference', fontsize=17, fontweight='bold')
plt.ylabel('Number of Wins', fontsize=17, fontweight='bold')
plt.title('Histogram of Winning\'s Relationship with Ward Difference', fontsize=22, fontweight='bold')
plt.show()
This data is very unimodal. Ward difference seems to be completely insignificant, and therefore we will remove it to reduce excess complexity for the ML model. The sheer amount of majority purple area implies that although having a vision advantage from wards may feel important, it is not very accurate in determining who will win a game of league, with no real value to guarantee a won game.
games.drop(['blueWardDiff'], axis=1, inplace=True)
Finally, let's look at the last feature, wards destroyed difference.
# Start the graph
plt.figure(figsize=(20, 5))
# Split the data between red winning and blue winning
# and plot (using different colors for different teams)
for color, data in games.groupby('blueWins'):
sns.distplot(data['blueDestWardDiff'], bins=45, kde=False, color='red' if color == 0 else 'blue')
# Label the axes
plt.xlabel('Wards Destroyed Differenece', fontsize=17, fontweight='bold')
plt.ylabel('Winning Percentage', fontsize=17, fontweight='bold')
plt.title('Histogram of Winning\'s Relationship with Wards Destroyed Difference', fontsize=22, fontweight='bold')
plt.show()
Once again, since the graph is very unimodal, we are going to get rid of it. With a distribution like that, all it would do is add uneeded complexity.
games.drop(['blueDestWardDiff'], axis=1, inplace=True)
Overall, we have a good idea as to what influences winning a game of League of Legends. It seems that getting a tower, dragon, rift herald, or first blood within the first 10 minutes of the game increases your chances of winning the game at least 60% which is very good. Additionally, we concluded that gold difference and experience difference contribute the most to a team's win. KDA difference, CS difference, and ward difference matter, but not near as much as the first two. We further concluded that jungle CS difference, ward difference, and wards destroyed difference do not matter at all and opted to remove them from the dataset to avoid adding any unnecessary complexity for when we train our ML models.
In this section, we are going to focus on trying to predict which team will win a game of League of Legends when given the first 10 minutes of data. We will be testing out many different models, attempting to find the one that gives us the best results. After this model is determined, we will attempt to improve it even further by fine-tuning its hyperparameters.
The first thing that we have to do is import the models that we want.
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
We will be defining a function that we can use later to return us a dictionary of all of our untrained models. This will be important for later when we do 10-fold cross-validation as we do not want to refit an already fit model with new training data.
Here we will also define two dictionaries, one to hold all of the scores of the different models and another one to hold all of the errors.
# Function to return newly created models
def get_fresh_models():
return {
'Decision Tree': DecisionTreeClassifier(),
'Logistic Regression': LogisticRegression(max_iter=1000),
'Gaussian Naive Bayes': GaussianNB(),
'Support Vector Classifier': SVC(max_iter=10000),
'K-Nearest Neighbors': KNeighborsClassifier(),
'Gradient Boost': GradientBoostingClassifier(),
'RandomForest': RandomForestClassifier(),
'BaggingClassifier': BaggingClassifier(),
'LinearDiscriminantAnalysis': LinearDiscriminantAnalysis(),
}
# Dict to hold scores
scores = {
'Decision Tree': np.zeros(10),
'Logistic Regression': np.zeros(10),
'Gaussian Naive Bayes': np.zeros(10),
'Support Vector Classifier': np.zeros(10),
'K-Nearest Neighbors': np.zeros(10),
'Gradient Boost': np.zeros(10),
'RandomForest': np.zeros(10),
'BaggingClassifier': np.zeros(10),
'LinearDiscriminantAnalysis': np.zeros(10),
}
# Dict to hold errors
errors = {
'Decision Tree': np.zeros(10),
'Logistic Regression': np.zeros(10),
'Gaussian Naive Bayes': np.zeros(10),
'Support Vector Classifier': np.zeros(10),
'K-Nearest Neighbors': np.zeros(10),
'Gradient Boost': np.zeros(10),
'RandomForest': np.zeros(10),
'BaggingClassifier': np.zeros(10),
'LinearDiscriminantAnalysis': np.zeros(10),
}
Now that we have our models, we are finally going to start training and testing them to determine which one is the best. To do this, we will use 10-fold cross-validation. This means that we are going to split up our data into 10 groups, each group containing both training and testing data. Then, for each of these 10 folds, we will train all of the models on the training data and score them with the testing data. If you would like to learn more about cross-validation, check out this beginner friendly intro here. Using these 10 scores, we will determine which performed the best.
# Import the necessary functions for training, testing, and evaluating
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Create a standard scaler so we can standardize the dataset
ss = StandardScaler()
# Split the features from the correct output and scale the features
X = games[[i for i in list(games.columns) if i != 'blueWins']].values
X = ss.fit_transform(X)
y = games['blueWins'].values
# Declare a KFold object and loop through each fold, training and
# testing every model
kf = KFold(n_splits=10)
fold = -1
for train_index, test_index in kf.split(X):
fold += 1
# Get the train and test data for this fold
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Get new fresh models
models = get_fresh_models()
# Train each model and update their score/error for this iteration
for name, model in models.items():
model.fit(X_train, y_train)
scores[name][fold] = model.score(X_test, y_test)
y_pred = model.predict(X_test)
errors[name][fold] = mean_squared_error(y_true=y_test, y_pred=y_pred)
Now we have the results which are scores and errors for each model for each of the 10 folds. Let's get that into a more readable format for us so we can easily determine which is the best.
# Create a new dataframe to display the analysis
analysis_df = pd.DataFrame()
# For each model, calculate the min score, max score, average score,
# min error, max error, and average error and add it in a column
# with the name of the model
for name, values in scores.items():
min_score = values.min()
max_score = values.max()
avg_score = values.mean()
std_score = values.std(ddof=1)
min_mse = errors[name].min()
max_mse = errors[name].max()
avg_mse = errors[name].mean()
std_mse = errors[name].std(ddof=1)
analysis_df[name] = np.array([min_score, avg_score, max_score, std_score, min_mse, avg_mse, max_mse, std_mse])
# Transpose the dataframe so the columns are now the rows
analysis_df = analysis_df.T
# Create the column names and sory by average score
analysis_df.columns = ['Min Score', 'Avg Score', 'Max Score', 'STD Score', 'Min MSE', 'Avg MSE', 'Max MSE', 'STD MSE']
analysis_df = analysis_df.sort_values(by = 'Avg Score', ascending=False)
analysis_df
Looks like Logistic Regression is the best model at predicting wins for us, and decision trees are the weakest.
Logistic Regression is so strong because of the precautions we took with our data. For one we made sure that no columns were just linear combinations of one another. Additionally, we made sure to get rid of some variables that could have created a lot of strange conclusions such as ward kills, and adding a variable with that much variance would have seriously thrown things around. Additionally, our value we are trying to determine which is wins is categorical with a 0 attached for loses and 1 attached for wins which is the ideal situation for logistic regression. We also can see from the earlier charts and graphs we made earlier that the data is predominantly linear in determining a win. For example, the gold difference and win ratio histogram appears extremely linear, with a gold difference closer to zero resulting in essentially a coin flip, but as you spread out from that center, you see a steady and linear trend for either team win rate increasing as you give them a higher and higher gold lead. The fact that the logistic regression model fits our data so well implies that the majority of the variables we took into account are linearly related to the win rate!
Decision Trees might be some of the weakest because of the sheer amount of variables we are considering. Although there are some variables we discussed earlier like the KDA difference which does not do nearly as good of a job of predicting a win as say, having a turret destroyed, that variable is taken into account in the Decision Tree which may result in overfitting. That being said, KDA difference is not arbitrary enough to get rid of and is still a tangible advantage that can be gained, as such it is an important metric to consider. As such, just given our data, Decision Trees fall short of being very good at predicting a win. Additionally, because League of Legends is such a high-variance game, it may result in the Decision Tree making conclusions that are not entirely correct 100% of the time.
Let's take a closer look at it and see if we can get it any better by using GridSearchCV to compute the optimum values for each of our hyperparameters. Click here to learn more about GridSearchCV.
# Import GridSearchCV for finding hyperparameters
from sklearn.model_selection import GridSearchCV
# Define our model
model = LogisticRegression()
# Create our dist of params that we want to try
params = {
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'penalty': ['l2'],
'max_iter': list(range(100,2000,100)),
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
# Find the hyperparameters
better_model = GridSearchCV(model, param_grid=params, refit=True, cv=10)
better_model.fit(X, y)
# Print the average accuracy of the model
print('Average Accuracy: %.4f' % better_model.best_score_)
Well, that is not as much of an improvement as we were hoping for. However, 0.01% of an increase is still an increase though! I think that this is the best that we can do for League of Legends given the first 10 minutes of data. The reasoning for this little improvement is most likely due to the unpredictability of League of Legends. No matter how much more a team is favored to win, humans are not perfect and any small mistake can change the tides in an otherwise won game.
League is a complex game, and although it is probably not likely to be 100% accurate in determining who will win a game of League just based on the first 10 minutes, we were pretty accurate. We were able to determine pretty strong predictors for whether a team will win with our discoveries such as if a team gets a dragon in the first 10 minutes, that is a decent predictor that they will win the game.
There are a lot of different factors at play, and even when a game looks like it is won for one team, human error can always turn the tides. That is why it is not possible to get better accuracy from only the first 10 minutes.
Our next goal is to predict which team will win at any given time during a game, and see how accurate that can be over time. But that is in the future. For now, from the first 10 minutes in a diamond game, a 75.5% accuracy is pretty good given the unpredictable nature of League of Legends. I would love to try this out with my own games, however, neither I nor my partner is Diamond because we suck at the game. I hope you found this tutorial helpful and maybe you will be able to test it out on your games!