Predicting Salaries with Machine Learning

Using Python to build a machine learning model to forecast NBAA salary and analyse the most important factors

One of the richest and most competitive sports leagues is the NBAA. NBAA players' earnings have been rising over the past several years, but these salaries are determined by a complicated network of circumstances behind every jaw-dropping dunk and three-pointer.

Numerous factors are at play, including market demand, athlete performance, team success, and sponsorship agreements. Who hasn't wondered why their club spent so much money on a player who isn't doing well or admired the thought process that went into a particularly clever business deal?

In this post, we forecast NBAA wages using Python's machine learning skills and identify the key elements that have the most bearing on players' salary.

Understanding the problem

Understanding the foundations of the league's wage system is crucial before delving into the issue. A player is considered a free agent (FA ), a word that will be used frequently in this project, when he is available to sign a deal with any organisation.

In order to preserve a competitive balance among clubs, the NBAA is governed by a complicated system of rules and regulations. The wage ceiling and the luxury tax are the two fundamental ideas of this system.

A team's ability to spend money on player wages during a particular season is limited by the salary cap. The cap, which is based on league income, is modified each year to make sure that teams are able to manage their budgets. Additionally, it aims to promote fairness among franchises by preventing large-market teams from spending much more than their counterparts in smaller markets.

The salary cap can be distributed differently among players, with top-tier players receiving maximum pay and rookies and veterans receiving minimum salaries.

However, clubs who want to build lineups capable of contending for championships frequently go above the pay cap. A club enters the luxury tax zone when its payroll exceeds the wage cap. Teams pay a penalty as a result of the luxury tax. The mid-level exception (MLE ) and trade exception, which allow clubs to make tactical roster adjustments, are only two of the numerous laws that serve as exceptions, but for this project, understanding the salary limit and luxury tax is sufficient.

Predicting Salaries with Machine Learning

The strategy chosen would use the percentage of the cap as the objective instead of the actual compensation amount due to the salary cap's continuing, continual rise. This choice seeks to take into account how the cap is changing, guaranteeing that the result is unaffected by temporal changes and applies even when analysing past seasons. It should be emphasised that this is merely an estimate and not a perfect representation.

Data

The objective of this research is to forecast the earnings of players who sign new contracts for the following season using only data from the current season.

The specific stats that were used were: • Average stats per game • Total stats • Advanced stats • Individual stats: age, position • Salary-related stats: salary from the previous season, max cap for the previous and current seasons, and the percentage of that pay that was covered by the cap.

Only individual features were added because we don't know which team the player would sign with.

This research included 78 characteristics for the target and each player combined.

BRScraper, a Python programme I recently developed that enables easy access to basketball data from Hoops Reference, including NBAA, G League, and other foreign leagues, was used to collect the majority of the data. All instructions on damaging the website or impairing its functionality were adhered to.

Data Treatment

The choice of players for the training of the models is an intriguing factor to take into account. Initially, I chose every player that was available, but because the majority of them would already be bound by a contract, the wage amount did not alter significantly.

Consider a player who agrees to a four-year, $20M contract. He earns about $5M year (very seldom are all years exactly the same amount; often there is some development in the pay around $5M ). However, the value may alter even more when a free agent signs a new contract.

The performance would be noticeably poorer when evaluating solely free agents, even while training a model with every one of the players may have a better overall result (after all, most players would have salaries that were fairly similar to the previous! ).

Only players of this kind should be included in the data as the objective is to estimate the wage of a player signing a new deal. This will help the model better comprehend the trends among these players.

The 2023-24 season is the one that is of importance, however data will be used from 2020-21 onwards to increase the variety of samples, which is achievable owing to the target selection.

Modeling

The train-test split was created to retain a roughly 70/30 split while only including all free agents from 2023-24 in the test set.

At first, a number of regression models were applied:

AdaBoost, Gradient Boosting, Support Vector Machines (SVM ), Elastic Net,

Random Forest,

Light Boosting of Gradients Machine (LGBM ), and others

The root mean square error (RMSE ) and coefficient of determination (R2 ) were used to assess each of their performances.

Results

The following outcomes were found after taking into account the entire dataset for all seasons:

Overall, the models performed well; AdaBoost had the worst metrics among the models employed, while Random Forest and Gradient Boosting obtained the lowest RMSE and greatest R2.

Analysis of Variables

Through SHAP Values, a method that offers a logical explanation of how each characteristic affects the model's predictions, it is possible to visualise the important factors that have an impact on the model's predictions.

Again, Predicting the NBAA MVP using Machine Learning provides a more in-depth description of SHAP and how to read its chart.

Several significant inferences may be made from this graph:

The three most important metrics are minutes per game (MP ), points per game (PTS ), and overall.
The previous season's pay (pay S-1 ) and the percentage of that salary's cap (% Cap S-1 ) both have a significant influence, coming in at #4 and #5, respectively.
Only two advanced statistics-WS (Win Share ) and VORP (Value Over Replace Player )-appear on the list of the top characteristics, making them less common.

This is unexpected considering that the majority of modern statistics were created with the specific purpose of improving player performance evaluation. A notable omission from the top 20 is the Player Efficiency Rating (PER ), which is found in 43rd place.

It suggests that general managers could adhere to a relatively straightforward strategy when negotiating salaries, sometimes omitting the wider range of performance rating indicators.

Perhaps the issue is not as complicated as first thought! Simply put, the person who logs the most playing time and points wins more!

Further Results

concentrating on the free agents this year and contrasting their expected pay with the actual pay:

Principal findings from the Random Forest model for the 2023-2024 season (values in millions).

Five players appear to be more undervalued at the top (getting less than they should), five players are appropriately valued in the middle, and five players are more overvalued at the bottom (receiving more than they should). It's crucial to remember that these evaluations are completely dependent on the model's results.
Starting at the top, Russell Westbrook, a former MVP who just inked a $4M//year contract with the Clippers, is considered to be the most undervalued athlete by the model. In a similar predicament as Malik Beasley, Eric Gordon, and Mason Plumlee, but with much smaller salaries, are these three athletes. Despite receiving a salary of $17M annually, D'Angelo Russell also ranks in the top five, suggesting that he ought to be receiving even more money.
It's interesting to note that these players all signed with competing organisations (the Clippers, Suns, Bucks, and Lakers ). It is common for athletes to opt to make less money in order to play for a team that has a possibility of winning the championship.
Taurean Prince, Orlando Robinson, Kevin Knox, and Derrick Rose all get modest wages that seem to be sufficient in the middle. Caris LeVert earns $15 million a year, but she also seems to be worth that much.
Fred VanVleet was named the player who was overrated the most at the bottom.
Sports result prediction is frequently difficult. This project turned out to be more complicated than anticipated, from the decision of the goal through the choice of participants.
There are undoubtedly many ways to enhance them, one of which is by using the choice of features or dimensionality reduction methods to minimise the feature space and, thus, the variance.
Additionally, access to free agents from prior seasons would allow for an increase in the quantity of samples. However, it doesn't appear that this data is currently available to the general public.
Numerous other outside factors also play a role in this situation. For instance, there is little doubt that information about the club, such as the past year's seed, playoff result, and cap use %, may be very helpful. Regardless of the conditions surrounding the signing club, continuing the strategy that simulates those of a true free agency scenario where the team is unknown might perhaps produce a result that is more in line with the player's "real value".
The use of just data from the previous season to forecast the upcoming salary was one of the project's primary tenets. Given that a player's past performance might provide insightful information, using data from previous seasons may in fact result in better outcomes. To manage the intricate structure and high dimensionality of such data, however, careful feature selection would be required due to their expansive nature.

Source code for the application (Predicting Salaries with Machine Learning)

from BRScraper import nbaa
import pandas as pdd
import numpy as npp
import seaborn as sns
import matplotlib.pyplot as plt
import shap
import os
import pickle
import warnings
warnings.filterwarnings ('ignore' )
from sklearn.preprocessing import OneHotEncoder
from sklearn.model__selection import train__test__split
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.model__selection import GridSearchCV
from sklearn.linear__model import ElasticNet
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from lightgbm import LGBMRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean__squared__error, r2__score
pdd.set__option ('display.max__columns', None )
pdd.set__option ('display.float__format', lambda x: '%.2f' % x )
models =  ['SVM','Random Forest','Elastic Net','AdaBoost','Gradient Boosting','LGBM' ]
path__data = r'C:\Gabriel\aFacul\Programacao\Python\NBAA\Sal�rios'
sep = r'/'
# Season to predict salaries
desired__seasons =  ['2023-24','2022-23','2021-22' ]
data__cap = {'Year':  [2024 ],
            'Cap Maximum':  [142.00 ]}
cap = pdd.DataFrame (data__cap )
cap1 = pdd.read__html ('https://www.spotrac.com/nbaa/cba/' ) [0 ]
cap1 = cap1 [ ['Year','Cap Maximum' ] ]
cap1 ['Cap Maximum' ] =  (cap1 ['Cap Maximum' ].str.replace ('$','' ).str.replace (',','' ) ).astype (float )/1000000
cap = pdd.concat ( [cap, cap1 ], ignore__index = True )
cap = cap.astype ({'Year': 'int32'} )
cap.head ( )
def get__data__salary (desired__seasons ):    
    # Players current salaries info
    salary = nbaa.get__current__salaries ( )    
    salary = salary.drop__duplicates ('user' ).reset__index (drop = True )
    # Normalizing names
    salary ['user' ] = salary ['user' ].str.replace ('.','' )
    salary ['user' ] = salary ['user' ].str.normalize ('NFKD' ).str.encode ('ascii', errors = 'ignore' ).str.decode ('utf-8' )    
    for desired__season in desired__seasons:
        url = 'https://hoopshype.com/salaries/players/'+str (int (desired__season [:4 ] )-1 )+'-20'+str (int (desired__season [5:7 ] )-1 )

        last__season = pdd.read__html (url ) [0 ]
        last__season = last__season.drop ( ['Unnamed: 0',str (int (desired__season [:4 ] )-1 )+'/'+str (int (desired__season [5:7 ] )-1 )+' (* )' ],
                                        axis = 1 )
        last__season = last__season.rename (columns = {str (int (desired__season [:4 ] )-1 )+'/'+str (int (desired__season [5:7 ] )-1 ):
                                                 str (int (desired__season [:4 ] )-1 )+'-'+str (int (desired__season [5:7 ] )-1 )} )
        
        salary1 = salary [salary ['user' ].isin (last__season ['user' ] ) = = False ]
        
        last__season ['user' ] = last__season ['user' ].str.replace ('.','' )
        last__season ['user' ] = last__season ['user' ].str.normalize ('NFKD' ).str.encode ('ascii', errors = 'ignore' ).str.decode ('utf-8' )
        
        last__season = last__season.drop__duplicates ('user', keep = 'first' ).reset__index (drop = True )
        
        salary = last__season.merge (salary, on = 'user', how = 'left', validate = '1:1' )
        
        salary = pdd.concat ( [salary, salary1 ], ignore__index = True )
        
        salary = salary.drop__duplicates ('user', keep = 'first' ).reset__index (drop = True )    
    salary = salary.drop (columns = {'Tm'} )   
    for col in salary.columns:
        if col not in  ['user' ]:
            salary [col ] =  (salary [col ].str.replace ('$','' ).str.replace (',','' ) ).astype (float )
            salary [col ] = salary [col ].fillna (-1000000 )
            salary [col ] = salary [col ]/1000000
    
    return salary
def get__individual__stats (desired__seasons ):
    df__stats = pdd.DataFrame ( )
    
    for desired__season in desired__seasons:
        # Stats for S-1
        per__game = nbaa.get__stats (int (desired__season [:4 ] ), info = 'per__game', rename = True )
        totals = nbaa.get__stats (int (desired__season [:4 ] ), info = 'totals', rename = True )
        avancados = nbaa.get__stats (int (desired__season [:4 ] ), info = 'advanced', rename = True )

        # Droping repeated variables
        totals = totals.drop ( ['Pos','Age','G','GS','Season' ], axis = 1 ).reset__index (drop = True )
        avancados = avancados.drop ( ['Pos','Age','G','MP__advanced','Season' ], axis = 1 ).reset__index (drop = True )

        cols =  ['user','Season','Pos','Age','Tm','G','GS' ]

        # Defining variables type
        for colunas in per__game.columns:
            if colunas not in cols:
                per__game [colunas ] = per__game [colunas ].astype (float )
        for colunas in totals.columns:
            if colunas not in cols:
                totals [colunas ] = totals [colunas ].astype (float )
        for colunas in avancados.columns:
            if colunas not in cols:
                avancados [colunas ] = avancados [colunas ].astype (float )

        times = per__game.drop__duplicates (subset = ['user' ],keep = 'last' ).reset__index (drop = True )
        times = times ['Tm' ]

        per__game = per__game.drop__duplicates (subset = ['user' ],keep = 'first' ).reset__index (drop = True )
        totals = totals.drop__duplicates (subset = ['user' ],keep = 'first' ).reset__index (drop = True )
        avancados = avancados.drop__duplicates (subset = ['user' ],keep = 'first' ).reset__index (drop = True )

        per__game ['Tm' ] = times
        totals ['Tm' ] = times
        avancados ['Tm' ] = times

        # Merging the bases
        stats = per__game.merge (avancados, on = ['user','Tm' ], how = 'left', validate = '1:1' )
        stats = stats.merge (totals, on = ['user','Tm' ], how = 'left', validate = '1:1' ).fillna (0 )
        stats = stats.astype ({'G':int,'GS':int,'Age':int} )

        # Normalizing names
        stats ['user' ] = stats ['user' ].str.replace ('.','' )
        stats ['user' ] = stats ['user' ].str.normalize ('NFKD' ).str.encode ('ascii', errors = 'ignore' ).str.decode ('utf-8' )
        # Trying to match Jrs
        stats ['user' ] = stats ['user' ].apply (lambda x: x.replace (' Jr', '' ) if x.endswith (' Jr' ) and x not in  ['Jaren Jackson Jr', 'Tim Hardaway Jr', 'Gary Trent Jr', 'Larry Nance Jr',
                                                                                                                 'Duane Washington Jr', 'Scottie Pippen Jr', 'Vince Williams Jr', 'Ron Harper Jr' ] else x )    
        df__stats = pdd.concat ( [df__stats, stats ], ignore__index = True )
    
    df__stats = df__stats.rename (columns = {'Season':'Season S-1'} )
    
    return df__stats
def select__data (desired__seasons, stats, salary, cap ):
    
    df__f = pdd.DataFrame ( )

    for desired__season in desired__seasons:
        # Selecting only salaries of S and S-1
        salary1 = salary [ ['user', desired__season,
                         str (int (desired__season [:4 ] )-1 )+'-'+str (int (desired__season [5:7 ] )-1 ) ] ]
        
        stats1 = stats [stats ['Season S-1' ] = = str (int (desired__season [:4 ] )-1 )+'-'+str (int (desired__season [5:7 ] )-1 ) ]
        
        df = stats1.merge (salary1, on = 'user', how = 'left', validate = '1:1' )

        df = df.rename (columns = {str (int (desired__season [:4 ] )-1 )+'-'+str (int (desired__season [5:7 ] )-1 ):'Salary S-1',
                                  desired__season:'Salary S'} )
        
        # Removing null values
        df = df [ (df ['Salary S' ]! = -1 ) &  (df ['Salary S' ].notna ( ) ) ]
        
        df ['Year S' ] =  ('20'+ ( (df ['Season S-1' ].str [5:7 ] ).astype (int )+1 ).astype (str ) ).astype (int )
        df ['Year S-1' ] =  ('20'+ (df ['Season S-1' ].str [5:7 ] ).astype (str ) ).astype (int )
        
        cap2 = cap.rename (columns = {'Cap Maximum':'Cap Maximum S', 'Year':'Year S'} )
        df = df.merge (cap2, how = 'left', on = 'Year S', validate = 'm:1' )
        
        cap2 = cap2.rename (columns = {'Cap Maximum S':'Cap Maximum S-1', 'Year S':'Year S-1'} )
        df = df.merge (cap2, how = 'left', on = 'Year S-1', validate = 'm:1' )

        df ['% of Cap S-1'] = df ['Salary S-1' ]/df ['Cap Maximum S-1' ]*100
        df ['% of Cap S-1'} [df ['% of Cap S-1' ]<0 ] = -1
        
        df ['% of Cap S' ] = df ['Salary S' ]/df ['Cap Maximum S' ]*100
        df ['% of Cap S' ] [df ['% of Cap S' ]<0 ] = -1
        
        # Get dummies for positions
        pos = pdd.get__dummies (df ['Pos' ] ).astype (int )
        for col in pos.columns:
            if len (col )>3:
                pos = pos.rename (columns = {col: col.split ('-' ) [0 ]} )
        pos = pos [ ['PG','SG','SF','PF','C' ] ]
        pos = pos.groupby (pos.columns, axis = 1 ).sum ( )
        df = df.join (pos )
        df = df.drop (columns = {'Pos','Year S','Year S-1'} ).reset__index (drop = True )
        df__f = pdd.concat ( [df__f, df ], ignore__index = True )
    df__f = df__f.fillna (0 )
    return df__f
df = select__data (desired__seasons, stats, salary, cap )
df.to__csv (path__data+sep+'final__data.csv',sep = ',',decimal = '.',index = False )
df
def get__FAs (desired__seasons ):
    
    # Get free agents
    FAs = pdd.DataFrame ( )
    
    for desired__season in desired__seasons:
        FAs1 = pdd.read__html ('https://www.spotrac.com/nbaa/free-agents/'+desired__season [:4 ] ) [0 ]
        FAs1 = FAs1.iloc [:,0 ]
        FAs2 = pdd.read__html ('https://www.spotrac.com/nbaa/free-agents/' ) [1 ]
        FAs2 = FAs2.iloc [:,0 ]
        
        FA = pdd.concat ( [FAs1,FAs2 ],ignore__index = True )
        FA = FA.to__frame (name = "Player" )
        
        FA ['Season S-1' ] = str (int (desired__season [:4 ] )-1 ) + '-' + str (int (desired__season [5:7 ] ) - 1 )
        
        FAs = pdd.concat ( [FA, FAs ], ignore__index = True )
    
    FAs ['user' ] = FAs ['user' ].str.replace ('.','' )
    FAs ['user' ] = FAs ['user' ].str.normalize ('NFKD' ).str.encode ('ascii', errors = 'ignore' ).str.decode ('utf-8' )
    
    return FAs
sns.set (style = 'dark' )
sns.set__theme (rc = {'figure.dpi': 200}, font__scale = 0.6 )
plt.xticks (rotation = 90 )
evol = sns.barplot (data = cap, x = 'Year', y = 'Cap Maximum',color = 'darkblue' ).set (title = 'NBAA Salary Cap Evolution' )
plt.ylabel ('Cap Maximum  (M )' );
plt.savefig (path__data+sep+"salary__cap.jpg", dpi = 300 )

Output: