Python data mining – how to win the League of heroes lol? What characteristics are more relevant to winning a game

Time:2021-10-15

Profile data set get attention and praise screenshot plus group + QQ group: 606115027 (find management to send it to you)

League of Legends (LOL) is a multiplayer online battle arena game developed and released by riot games for Microsoft Windows and MacOS.

In league of heroes, players play a “champion” role with unique abilities to fight a group of other players or computer-controlled champions.

The goal of the game is usually to destroy each other’s “nexus”, that is, a structure located in the center of the base protected by the defense structure, although other different game modes also have different goals, rules and maps. Every hero League game is discrete. All champions are relatively weak at the beginning, but they will enhance their strength by accumulating props and experience in the process of the game.

This dataset contains the first 10 minutes. About the data, 10K ranking game (solo queue) from high ELO (diamond I to master).

The level of players is roughly the same. After 10 minutes of the game, each team has 19 features (a total of 38). This includes death, death, gold, experience, levels

It’s up to you to do some feature engineering to get more insights. The column bluewins is the target value (the value we are trying to predict). 1 means the blue team won.

The purpose of this article is to predict which characteristics are more relevant to lol victory.

set up

In [1]:

Import packages and data

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

In [2]:

%matplotlib inline

sns.set_style(‘darkgrid’)

In [3]:

df = pd.read_csv(‘/home/kesci/input/lol8974/high_diamond_ranked_10min.csv’)

df.head()

Out[3]:

gameId blueWins blueWardsPlaced blueWardsDestroyed blueFirstBlood blueKills blueDeaths blueAssists blueEliteMonsters blueDragons redTowersDestroyed redTotalGold redAvgLevel redTotalExperience redTotalMinionsKilled redTotalJungleMinionsKilled redGoldDiff redExperienceDiff redCSPerMin redGoldPerMin
0 4519157822 0 28 2 1 9 6 11 0 0 0 16567 6.8 17047 197 55 -643 8 19.7 1656.7
1 4523371949 0 12 1 0 5 5 5 0 0 1 17620 6.8 17438 240 52 2908 1173 24.0 1762.0
2 4521474530 0 15 0 0 7 11 4 1 1 0 17285 6.8 17254 203 28 1172 1033 20.3 1728.5
3 4524384067 0 43 1 0 4 5 5 1 0 0 16478 7.0 17961 235 47 1321 7 23.5 1647.8
4 4436033771 0 75 4 0 6 6 6 0 0 0 17404 7.0 18313 225 67 1004 -230 22.5 1740.4

5 rows × 40 columns

EDA

In [4]:

Check for missing values and data types

df.info()

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 9879 entries, 0 to 9878

Data columns (total 40 columns):

gameId 9879 non-null int64

blueWins 9879 non-null int64

blueWardsPlaced 9879 non-null int64

blueWardsDestroyed 9879 non-null int64

blueFirstBlood 9879 non-null int64

blueKills 9879 non-null int64

blueDeaths 9879 non-null int64

blueAssists 9879 non-null int64

blueEliteMonsters 9879 non-null int64

blueDragons 9879 non-null int64

blueHeralds 9879 non-null int64

blueTowersDestroyed 9879 non-null int64

blueTotalGold 9879 non-null int64

blueAvgLevel 9879 non-null float64

blueTotalExperience 9879 non-null int64

blueTotalMinionsKilled 9879 non-null int64

blueTotalJungleMinionsKilled 9879 non-null int64

blueGoldDiff 9879 non-null int64

blueExperienceDiff 9879 non-null int64

blueCSPerMin 9879 non-null float64

blueGoldPerMin 9879 non-null float64

redWardsPlaced 9879 non-null int64

redWardsDestroyed 9879 non-null int64

redFirstBlood 9879 non-null int64

redKills 9879 non-null int64

redDeaths 9879 non-null int64

redAssists 9879 non-null int64

redEliteMonsters 9879 non-null int64

redDragons 9879 non-null int64

redHeralds 9879 non-null int64

redTowersDestroyed 9879 non-null int64

redTotalGold 9879 non-null int64

redAvgLevel 9879 non-null float64

redTotalExperience 9879 non-null int64

redTotalMinionsKilled 9879 non-null int64

redTotalJungleMinionsKilled 9879 non-null int64

redGoldDiff 9879 non-null int64

redExperienceDiff 9879 non-null int64

redCSPerMin 9879 non-null float64

redGoldPerMin 9879 non-null float64

dtypes: float64(6), int64(34)

memory usage: 3.0 MB

In [5]:

df_clean = df.copy()

In [6]:

#Delete some unnecessary columns. For example, bluefirstblood / redfirst blood blueelitemonster / redelitemonster bluedeath / redkills are repeated

cols = [‘gameId’, ‘redFirstBlood’, ‘redKills’, ‘redEliteMonsters’, ‘redDragons’,’redTotalMinionsKilled’,

‘redTotalJungleMinionsKilled’, ‘redGoldDiff’, ‘redExperienceDiff’, ‘redCSPerMin’, ‘redGoldPerMin’, ‘redHeralds’,

‘blueGoldDiff’, ‘blueExperienceDiff’, ‘blueCSPerMin’, ‘blueGoldPerMin’, ‘blueTotalMinionsKilled’]

df_clean = df_clean.drop(cols, axis = 1)

In [7]:

df_clean.info()

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 9879 entries, 0 to 9878

Data columns (total 23 columns):

blueWins 9879 non-null int64

blueWardsPlaced 9879 non-null int64

blueWardsDestroyed 9879 non-null int64

blueFirstBlood 9879 non-null int64

blueKills 9879 non-null int64

blueDeaths 9879 non-null int64

blueAssists 9879 non-null int64

blueEliteMonsters 9879 non-null int64

blueDragons 9879 non-null int64

blueHeralds 9879 non-null int64

blueTowersDestroyed 9879 non-null int64

blueTotalGold 9879 non-null int64

blueAvgLevel 9879 non-null float64

blueTotalExperience 9879 non-null int64

blueTotalJungleMinionsKilled 9879 non-null int64

redWardsPlaced 9879 non-null int64

redWardsDestroyed 9879 non-null int64

redDeaths 9879 non-null int64

redAssists 9879 non-null int64

redTowersDestroyed 9879 non-null int64

redTotalGold 9879 non-null int64

redAvgLevel 9879 non-null float64

redTotalExperience 9879 non-null int64

dtypes: float64(2), int64(21)

memory usage: 1.7 MB

In [8]:

Next, let’s examine the relationship between the blue team feature parameters

g = sns.PairGrid(data=df_clean, vars=[‘blueKills’, ‘blueAssists’, ‘blueWardsPlaced’, ‘blueTotalGold’], hue=’blueWins’, size=3, palette=’Set1’)

g.map_diag(plt.hist)

g.map_offdiag(plt.scatter)

g.add_legend();

/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.pyPython data mining - how to win the League of heroes lol? What characteristics are more relevant to winning a game UserWarning: The size paramter has been renamed to height; please update your code.

warnings.warn(UserWarning(msg))

Python data mining - how to win the League of heroes lol? What characteristics are more relevant to winning a game

We can see that there are many linear symmetries between variables

In [9]:

We can see that many features are highly correlated. Let’s get the correlation matrix

plt.figure(figsize=(16, 12))

sns.heatmap(df_clean.drop(‘blueWins’, axis=1).corr(), cmap=’YlGnBu’, annot=True, fmt=’.2f’, vmin=0);

Python data mining - how to win the League of heroes lol? What characteristics are more relevant to winning a game

In [10]:

Based on the correlation matrix, let’s clean up the data set a little to avoid collinearity

cols = [‘blueAvgLevel’, ‘redWardsPlaced’, ‘redWardsDestroyed’, ‘redDeaths’, ‘redAssists’, ‘redTowersDestroyed’,

‘redTotalExperience’, ‘redTotalGold’, ‘redAvgLevel’]

df_clean = df_clean.drop(cols, axis=1)

In [11]:

Next, let’s delete columns that have little to do with bluewins

corr_list = df_clean[df_clean.columns[1:]].apply(lambda x: x.corr(df_clean[‘blueWins’]))

cols = []

for col in corr_list.index:

if (corr_list[col]>0.2 or corr_list[col]<-0.2):

cols.append(col)

cols

Out[11]:

[‘blueFirstBlood’,

‘blueKills’,

‘blueDeaths’,

‘blueAssists’,

‘blueEliteMonsters’,

‘blueDragons’,

‘blueTotalGold’,

‘blueTotalExperience’]

In [12]:

df_clean = df_clean[cols]

df_clean.head()

Out[12]:

blueFirstBlood blueKills blueDeaths blueAssists blueEliteMonsters blueDragons blueTotalGold blueTotalExperience
0 1 9 6 11 0 0 17210 17039
1 0 5 5 5 0 0 14712 16265
2 0 7 11 4 1 1 16113 16221
3 0 4 5 5 1 0 15157 17954
4 0 6 6 6 0 0 16400 18543

In [13]:

df_clean.hist(alpha = 0.7, figsize=(12,10), bins=5);

Python data mining - how to win the League of heroes lol? What characteristics are more relevant to winning a game

Model selection

In [14]:

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

X = df_clean

y = df[‘blueWins’]

scaler = MinMaxScaler()

scaler.fit(X)

X = scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Naive Bayes

In [15]:

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score

Fitting model

clf_nb = GaussianNB()

clf_nb.fit(X_train, y_train)

pred_nb = clf_nb.predict(X_test)

Get accuracy score

acc_nb = accuracy_score(pred_nb, y_test)

print(acc_nb)

0.7176113360323887

Decision tree

In [16]:

Fitting decision tree model

from sklearn import tree

from sklearn.model_selection import GridSearchCV

tree = tree.DecisionTreeClassifier()

Search for the best parameters

grid = {‘min_samples_split’: [5, 10, 20, 50, 100]},

clf_tree = GridSearchCV(tree, grid, cv=5)

clf_tree.fit(X_train, y_train)

pred_tree = clf_tree.predict(X_test)

Get accuracy score

acc_tree = accuracy_score(pred_tree, y_test)

print(acc_tree)

0.6928137651821862

Random forest

In [17]:

Fitting model

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

Search for the best parameters

grid = {‘n_estimators’:[100,200,300,400,500], ‘max_depth’: [2, 5, 10]}

clf_rf = GridSearchCV(rf, grid, cv=5)

clf_rf.fit(X_train, y_train)

pred_rf = clf_rf.predict(X_test)

Get accurate scores

acc_rf = accuracy_score(pred_rf, y_test)

print(acc_rf)

0.7282388663967612

logistic regression

In [18]:

Fitting logistic regression model

from sklearn.linear_model import LogisticRegression

lm = LogisticRegression()

lm.fit(X_train, y_train)

Get accurate scores

pred_lm = lm.predict(X_test)

acc_lm = accuracy_score(pred_lm, y_test)

print(acc_lm)

0.7302631578947368

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.pyPython data mining - how to win the League of heroes lol? What characteristics are more relevant to winning a game FutureWarning: Default solver will be changed to ‘lbfgs’ in 0.22. Specify a solver to silence this warning.

FutureWarning)

K-nearest neighbor algorithm (KNN)

In [19]:

Fitting model

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

Search for the best parameters

grid = {“n_neighbors”:np.arange(1,100)}

clf_knn = GridSearchCV(knn, grid, cv=5)

clf_knn.fit(X_train,y_train)

Get accurate scores

pred_knn = clf_knn.predict(X_test)

acc_knn = accuracy_score(pred_knn, y_test)

print(acc_knn)

0.7171052631578947

conclusion

In [20]:

data_dict = {‘Naive Bayes’: [acc_nb], ‘DT’: [acc_tree], ‘Random Forest’: [acc_rf], ‘Logistic Regression’: [acc_lm], ‘K_nearest Neighbors’: [acc_knn]}

df_c = pd.DataFrame.from_dict(data_dict, orient=’index’, columns=[‘Accuracy Score’])

print(df_c)

Accuracy Score

Naive Bayes 0.717611

DT 0.692814

Random Forest 0.728239

Logistic Regression 0.730263

K_nearest Neighbors 0.717105

From the precision score, it can be seen that the prediction effect of logical regression and random forest is the best. Next, let’s take a closer look at the recall and accuracy of these two methods

In [21]:

Recall and accuracy

from sklearn.metrics import recall_score, precision_score

LM parameter

recall_lm = recall_score(pred_lm, y_test, average = None)

precision_lm = precision_score(pred_lm, y_test, average = None)

print(‘precision score for naive bayes: {}\n recall score for naive bayes:{}’.format(precision_lm, recall_lm))

precision score for naive bayes: [0.72736521 0.73313192]

recall score for naive bayes:[0.72959184 0.73092369]

In [22]:

RF parameters

recall_rf = recall_score(pred_rf, y_test, average = None)

precision_rf = precision_score(pred_rf, y_test, average = None)

print(‘precision score for naive bayes: {}\n recall score for naive bayes:{}’.format(precision_rf, recall_rf))

precision score for naive bayes: [0.73550356 0.72104733]

recall score for naive bayes:[0.723 0.73360656]

Generally, logistic regression will be selected

In [23]:

df_clean.columns

Out[23]:

Index([‘blueFirstBlood’, ‘blueKills’, ‘blueDeaths’, ‘blueAssists’,

‘blueEliteMonsters’, ‘blueDragons’, ‘blueTotalGold’,

‘blueTotalExperience’],

dtype=’object’)

In [24]:

lm.coef_

Out[24]:

array([[ 0.09223084, 1.6863957 , -4.9521688 , -0.28960539, 0.30701029,

0.29102466, 5.33422084, 1.55350489]])

In [25]:

np.exp(lm.coef_)

Out[25]:

array([[1.09661794e+00, 5.39998244e+00, 7.06806308e-03, 7.48558900e-01,

1.35935495e+00, 1.33779758e+00, 2.07311157e+02, 4.72801236e+00]])

In [26]:

coef_data = np.concatenate((lm.coef_, np.exp(lm.coef_)),axis=0)

coef_df = pd.DataFrame(data=coef_data, columns=df_clean.columns).T.reset_index().rename(columns={‘index’: ‘Var’, 0: ‘coef’, 1: ‘oddRatio’})

coef_df.sort_values(by=’coef’, ascending=False)

Out[26]:

Var coef oddRatio
6 blueTotalGold 5.334221 207.311157
1 blueKills 1.686396 5.399982
7 blueTotalExperience 1.553505 4.728012
4 blueEliteMonsters 0.307010 1.359355
5 blueDragons 0.291025 1.337798
0 blueFirstBlood 0.092231 1.096618
3 blueAssists -0.289605 0.748559
2 blueDeaths -4.952169 0.007068

PCA

In [27]:

Visualization of results using PCA

X = df_clean

y = df[‘blueWins’]

PCA is affected by scale. Firstly, scale the data set

from sklearn import preprocessing

S standardization features

X = preprocessing.StandardScaler().fit_transform(X)

In [28]:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

components = pca.fit_transform(X)

print(pca.explained_variance_ratio_)

[0.42435947 0.20805752]

In [29]:

Create visual DF

df_vis = pd.DataFrame(data = components, columns = [‘pc1’, ‘pc2’])

df_vis = pd.concat([df_vis, df[‘blueWins’]], axis = 1)

X = df_vis[[‘pc1’, ‘pc2’]]

y = df_vis[‘blueWins’]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:

Redefine PCA data

lm.fit(X_train, y_train)

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.pyPython data mining - how to win the League of heroes lol? What characteristics are more relevant to winning a game FutureWarning: Default solver will be changed to ‘lbfgs’ in 0.22. Specify a solver to silence this warning.

FutureWarning)

Out[30]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,

multi_class=’warn’, n_jobs=None, penalty=’l2’,

random_state=None, solver=’warn’, tol=0.0001, verbose=0,

warm_start=False)

In [31]:

Visualization function

from matplotlib.colors import ListedColormap

def DecisionBoundary(clf):

X = df_vis[[‘pc1’, ‘pc2’]]

y = df_vis[‘blueWins’]

h = .02 

In [32]:

DecisionBoundary(lm)

Python data mining - how to win the League of heroes lol? What characteristics are more relevant to winning a game

This work adoptsCC agreement, reprint must indicate the author and the link to this article