Injury Prediction in Competitive Runners Using Machine Learning

Running is a popular sport worldwide, with a large number of individuals participating in activities such as jogging, running, or trail running. In the United States alone, around 60 million people engaged in these activities in 2017. However, a worrisome statistic reveals that about half of the runners experience injuries annually. Dealing with these injuries can be a difficult and time-consuming process, prompting runners to adopt various strategies to minimize the chances of getting injured. Some of these preventive measures include using rollers, getting massages, and seeking guidance from professional coaches. Unfortunately, not everyone can afford these resources, making injury prevention a significant concern for many runners.

In response to this concern, experts and data scientists are increasingly embracing the capabilities of machine learning to forecast and avert injuries in competitive runners. By harnessing sophisticated algorithms and extensive datasets, machine learning models possess the ability to transform injury prevention methodologies, ultimately elevating the overall welfare of athletes.

Benefits of Injury Prediction in Competitive Runners Using Machine Learning

Machine learning can analyze extensive datasets, including injury records, training patterns, biometrics, and environmental factors, to uncover hidden patterns and correlations, aiding in more informed decision-making by coaches and athletes.
These models can continuously learn from new data, improving their accuracy and effectiveness over time and adapting to changing injury risk factors as an athlete's career progresses.
Real-time injury risk assessments are provided by machine learning, allowing for immediate adjustments to training loads and practices, reducing the chances of injuries during intense training or competitions.

Challenges of Injury Prediction in Competitive Runners Using Machine Learning

While the potential of machine learning in injury prediction is promising, there are also significant challenges that need to be addressed:

Obtaining high-quality, diverse, and reliable data is a major challenge as machine learning models heavily rely on data for training. Biased, incomplete, or unrepresentative data may result in inaccurate predictions.
The complexity of human physiology and the multifactorial nature of injuries present challenges in creating models that can comprehensively capture all contributing factors. Identifying and considering the interactions between various risk factors is crucial for model effectiveness.
Interpreting the outputs of machine learning models is challenging. While they can provide accurate predictions, understanding the underlying reasons for these predictions can be difficult. Transparent and interpretable models are vital for gaining trust and acceptance from coaches, athletes, and sports medicine professionals.

Machine Learning Injury Prediction Using Python

About the Dataset

The dataset comprises a comprehensive training log from a renowned Dutch high-level running team spanning seven years (2012-2019). It includes middle and long-distance runners competing in races between 800 meters and the marathon. This choice is justified by their comparable endurance-based training regimes. Notably, the team's head coach remained consistent throughout the data collection period.

There are records from 74 runners in the dataset, with 27 women and 47 men. On average, the athletes had been part of the team for about 3.7 years. The majority of runners competed at the national level, while some participated in international competitions. The study strictly adhered to the Declaration of Helsinki guidelines and received approval from the ethics committee of the second author's institution.

Here, we will try to predict the injury in the competitor and look for the best model for the injury prediction.

Importing Libraries

import numpy as np # linear algebra
import pandas as pd # data processing, .csv file I/O 
import sklearn as sk
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import datasets, metrics

Reading the Dataset

dataframe = pd.read_csv("/kaggle/input/injury-prediction-for-competitive-runners/week_approach_maskedID_timeseries.csv")
np.random.seed(0)

Exploratory Data Analysis(EDA)

We'll begin with basic data exploration to determine if the data set contains any anomalies or inconsistencies.

missing_Values = dataframe.isnull().sum()
missing_Values 

Output:

The dataset we have is quite high-dimensional, making it challenging to begin data analysis. To simplify the process, we decided to drop certain attributes based on empirical analysis. Specifically, we removed attributes that were related to how the person feels, as we wanted to focus on predicting data solely based on quantitative running quality. Although attributes related to recovery could potentially provide valuable insights, relying solely on survey questions to gauge a runner's physical condition is challenging and may not yield accurate results. To maintain objectivity and adopt a data-driven approach in our analysis, we chose to exclude subjective attributes and focus on more concrete and measurable factors.

dataframe.info()
dataframe = dataframe.drop(['avg training success.2', 'max training success.2', 'min training success.2', 'avg exertion', 'min exertion', 'max exertion'], axis = 1)
dataframe = dataframe.drop(['avg exertion.1', 'min exertion.1', 'max exertion.1', 'avg exertion.2', 'min exertion.2', 'max exertion.2', 'max km one day'], axis = 1)
dataframe = dataframe.drop(['avg recovery', 'min recovery', 'max recovery', 'avg recovery.1', 'min recovery.1', 'max recovery.1', 'avg recovery.2', 'min recovery.2', 'max recovery.2'], axis = 1)
dataframe = dataframe.drop(['avg training success', 'min training success', 'max training success', 'avg training success.1', 'max training success.1', 'min training success.1'], axis = 1)
dataframe = dataframe.drop(['rel total kms week 0_1', 'rel total kms week 0_2', 'rel total kms week 1_2'], axis = 1)
dataframe.info()

Output:

After careful analysis, we have successfully reduced the number of attributes in the dataset from 72 to 41. While this progress is promising, the dataset still remains high-dimensional, posing a challenge for further analysis and modeling.

Output:

Altogether, there are 74 athletes in the dataset. Now, let's focus on examining the training data of the first athlete to gain insights into their training patterns.

def indexIndividualData(id):
  df_0 = dataframe[dataframe['Athlete ID'] == id]
  index1 = df_0.index[0]
  indexLast = df_0.index[-1]
  y = indexLast - len(df_0[df_0['injury']==0]) - len(df_0[df_0['injury']==1])
  df_0 = df_0.rename(index = lambda x: x - y - 1 if x > indexLast - len(df_0[df_0['injury']==1]) else x - index1)
  df_0 = df_0.sort_values(by = 'Date')
  return df_0
def plotIndividualData(id, column):
  df_0 = indexIndividualData(id)
  plt.figure(figsize = (14,6))
  sns.lineplot(data=df_0[column])


plotIndividualData(1, "total kms")

Output:

This graph raises some intriguing questions. We wonder why there is a significant decline in training for such an extended period for this particular athlete. If each data point represents a week, being injured for over one hundred weeks seems unusual. Alternatively, if the points indicate days, one hundred days of injury is still quite substantial and merits further investigation.

The attributes with '.1', '.2', and '.3' in the data set are perplexing. We find it challenging to comprehend their significance concerning the dates attribute, even after referring to the Metadata file. The meaning and relevance of these attributes remain unclear and warrant further clarification.

def dateInjurySubset(id, column1, column2, column3): 
  df_0 = indexIndividualData(id)
  return df_0[[column1,column2,column3]]



def plotIndividualDataDuoColumn(id, column1, column2, column3):
  df_0 = dateInjurySubset(id,column1,column2,column3)
  plt.figure(figsize = (14,6))
  sns.lineplot(data=df_0) 

print(dateInjurySubset(1,"total kms", "total kms.1","total kms.2").mean())
plotIndividualDataDuoColumn(1,"total kms", "total kms.1", "total kms.2")
print(dateInjurySubset(1,"nr. sessions", "nr. sessions.1","nr. sessions.2").mean())
plotIndividualDataDuoColumn(1,"nr. sessions", "nr. sessions.1","nr. sessions.2")
print(dateInjurySubset(2,"total kms", "total kms.1","total kms.2").mean())
plotIndividualDataDuoColumn(2,"total kms", "total kms.1", "total kms.2")
print(dateInjurySubset(2,"nr. sessions", "nr. sessions.1","nr. sessions.2").mean())
plotIndividualDataDuoColumn(2,"nr. sessions", "nr. sessions.1","nr. sessions.2") 

Output:

dfQ2 = dataframe[['Athlete ID', 'total km Z3-Z4-Z5-T1-T2', 'total km Z3-Z4-Z5-T1-T2.1', 'total km Z3-Z4-Z5-T1-T2.2', 'injury', 'nr. tough sessions (effort in Z5, T1 or T2)', 'nr. tough sessions (effort in Z5, T1 or T2).1', 'nr. tough sessions (effort in Z5, T1 or T2).2', 'total km Z5-T1-T2', 'total km Z5-T1-T2.1', 'total km Z5-T1-T2.2', 'total km Z3-4', 'total km Z3-4.1', 'total km Z3-4.2']]
dfArray = []
for i in dataframe['Athlete ID'].unique():
  dfArray.append(indexIndividualData(i))

The 'dates' attribute in this dataset is also puzzling. To gain a clearer understanding of its significance, we will attempt to visualize it and explore its patterns.

df_0 = dfArray[1]
injury = df_0[df_0['injury'] == 1]
notInjured = df_0[df_0['injury'] == 0]
print("INJURED DATES ID 1")
for i in injury['Date']:
  print(i)
print("NOT INJURED DATES ID 1:\n")
for i in notInjured['Date']:
  print(i)

Output:

The sudden jump in the 'dates' attribute from 400 to 700 for an athlete is perplexing and raises questions. To make accurate predictions and analyze the data effectively, we need consecutive and sequential dates. Despite some uncertainty and potential missing insights, we will still attempt to classify the running data. Moreover, our data exploration revealed a significant bias towards non-injured data points in the dataset. This imbalance could impact the model's performance and predictions.

plt.figure(figsize=(8, 8))
sns.countplot(x=dataframe["injury"])
plt.title('Unbalanced Classes')
plt.show()

Output:

In this visualization, it becomes evident how heavily skewed the dataset is towards non-injured cases.

df_1 = df_0.sort_values(by = 'Athlete ID');

shuffled_df_1 = dataframe.sample(frac=1,random_state=4)

# Put all the fraud class in a separate dataset.
injury_df_1 = shuffled_df_1.loc[shuffled_df_1['injury'] == 1]

#Randomly select 492 observations from the non-fraud (majority class)
non_injured_df_1 = shuffled_df_1.loc[shuffled_df_1['injury'] == 0].sample(n=575)

# Concatenate both dataframes again
normalized_df = pd.concat([injury_df_1, non_injured_df_1])

#plot the dataset after the undersampling
plt.figure(figsize=(8, 8))
sns.countplot(x=normalized_df['injury'])
plt.title('Balanced Classes')
plt.show()

Output:

In order to prevent overfitting in our predictive models, we have balanced the dataset using sampling techniques. This ensures that both the injured and non-injured cases are equally represented in the training data.

Modeling

Here, we will employ various machine learning algorithms along with their accuracies and confusion matrix.

Let's begin by demonstrating the impact of a skewed dataset. In the unbalanced dataset, the accuracy of the classifier appears to be very high because it simply identifies all data points as non-injured. However, this approach is not helpful for the objectives of this project, as it fails to accurately predict the injured cases.

y = df_0['injury']
X = df_0.drop('injury', axis=1)
X = df_0.drop('Athlete ID', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
             X, y, test_size = 0.3, random_state = 0)

K = []
training = []
test = []
scores = {}
  
for k in range(2, 21):
    clf = KNeighborsClassifier(n_neighbors = k)
    clf.fit(X_train, y_train)
  
    training_score = clf.score(X_train, y_train)
    test_score = clf.score(X_test, y_test)
    K.append(k)
  
    training.append(training_score)
    test.append(test_score)
    scores[k] = [training_score, test_score]
    
for keys, values in scores.items():
    print(keys, ':', values)

Output:

from sklearn.metrics import classification_report, plot_confusion_matrix 
clf.fit(X_train, y_train)
plot_confusion_matrix(clf, X_test, y_test)
metrics.plot_roc_curve(clf, X_test, y_test)
plt.show()

Output:

In the confusion matrix shown above, all data points are classified as non-injured. To address this issue, we will now proceed to use the balanced dataset.

y = normalized_df['injury']
X = normalized_df.drop('injury', axis=1)
X = normalized_df.drop('Athlete ID', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
             X, y, test_size = 0.3, random_state = 0)

K = []
training = []
test = []
scores = {}
  
for k in range(2, 21):
    clf = KNeighborsClassifier(n_neighbors = k)
    clf.fit(X_train, y_train)
  
    training_score = clf.score(X_train, y_train)
    test_score = clf.score(X_test, y_test)
    K.append(k)
  
    training.append(training_score)
    test.append(test_score)
    scores[k] = [training_score, test_score]
    
for keys, values in scores.items():
    print(keys, ':', values)

Output:

In this analysis, we observe that as we increase the parameter k, the training accuracy declines notably (from approximately 80% to 65%), while the testing accuracy slightly improves (from around 58% to 60%). To evaluate the model's performance in handling both non-injured and injured data, we will utilize the confusion matrix.

clf = KNeighborsClassifier(n_neighbors = 2)
 clf.fit(X_train, y_train)
  
 training_score = clf.score(X_train, y_train)
 test_score = clf.score(X_test, y_test)
 K.append(k)
  
 training.append(training_score)
 test.append(test_score)
 scores[k] = [training_score, test_score]
clf.fit(X_train, y_train)
plot_confusion_matrix(clf, X_test, y_test)
metrics.plot_roc_curve(clf, X_test, y_test)
plt.show()

Output:

With a k value of 2, the classifier demonstrates higher accuracy in predicting non-injured data.

clf = KNeighborsClassifier(n_neighbors = 12)
 clf.fit(X_train, y_train)
  
 training_score = clf.score(X_train, y_train)
 test_score = clf.score(X_test, y_test)
 K.append(k)
  
 training.append(training_score)
 test.append(test_score)
 scores[k] = [training_score, test_score]

clf.fit(X_train, y_train)
print("Accuracy:",accuracy_score(Y_test, Y_pred))
plot_confusion_matrix(clf, X_test, y_test)
metrics.plot_roc_curve(clf, X_test, y_test)
plt.show()

Output:

When using a high k value of 21, the injury prediction becomes more accurate. However, this also leads to a significant increase in false positives for non-injured data, meaning that the classifier mistakenly identifies more people who are not injured as injured.

After conducting extensive experimentation with various parameter values, we identified the optimal equilibrium point with a k value of 12. The classifier achieved an overall accuracy rate of 60%, displaying 52% accuracy in predicting non-injured data points and 68% accuracy in predicting injured data points. Considering the dataset's inherent bias, this level of accuracy is commendable. Nevertheless, we continue to explore alternative binary classifiers and different balancing methods to further enhance the accuracy of our predictions.

The code below demonstrates the utilization of a support vector machine classifier to make injury predictions for data points. To handle the imbalanced nature of our dataset, we adopt the undersampling technique. By balancing the representation of injured and non-injured cases, we aim to improve the classifier's accuracy in making predictions for both categories.

#SVM classifier using undersampling
from imblearn.under_sampling import RandomUnderSampler
X = dataframe.drop('injury', axis = 1)
Y = dataframe['injury']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, stratify = Y)
rus = RandomUnderSampler(random_state=0)
X_train, Y_train =rus.fit_resample(X_train,Y_train)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)
print("Accuracy:",accuracy_score(Y_test, Y_pred))
print(confusion_matrix(Y_test, Y_pred))


metrics.plot_roc_curve(clf, X_test, Y_pred)
plt.show()

Output:

In the code below, we utilize a support vector machine classifier to make predictions about whether a data point corresponds to an injury or not. To address the issue of our imbalanced dataset, we employ the oversampling technique. This approach aims to create a more balanced representation of the data by duplicating or generating additional instances of the minority class (injured data), allowing the classifier to achieve better accuracy in predicting both injured and non-injured cases.

#SVM Classifier using oversampling
from imblearn.over_sampling import SMOTE
X = dataframe.drop('injury', axis = 1)
Y = dataframe['injury']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, stratify = Y)
sm = SMOTE(random_state = 0)
X_train, Y_train = sm.fit_resample(X_train,Y_train)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)
print("Accuracy:",accuracy_score(Y_test, Y_pred))


print(confusion_matrix(Y_test, Y_pred))
metrics.plot_roc_curve(clf, X_test, Y_pred)
plt.show()

Output:

In the code below, we implement a bagging classifier with undersampling to make predictions about whether a data point corresponds to an injury or not.

#Bagging Classifier With Undersampling
import sklearn.ensemble
X = dataframe.drop('injury', axis = 1)
Y = dataframe['injury']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, stratify = Y)
rus = RandomUnderSampler(random_state=0)
X_train, Y_train =rus.fit_resample(X_train,Y_train)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
bag = sklearn.ensemble.BaggingClassifier(n_estimators = 35)
bag.fit(X_train, Y_train)
Y_pred = bag.predict(X_test)
print("Accuracy:",accuracy_score(Y_test, Y_pred))


print(confusion_matrix(Y_test, Y_pred))
metrics.plot_roc_curve(clf, X_test, Y_test)
plt.show()

Output:

In the code below, we implement a bagging classifier with oversampling to make predictions about whether a data point corresponds to an injury or not.

#Bagging Classifier With Oversampling
import sklearn.ensemble
X = dataframe.drop('injury', axis = 1)
Y = dataframe['injury']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, stratify = Y)
sm = SMOTE(random_state = 0)
X_train, Y_train = sm.fit_resample(X_train,Y_train)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
bag = sklearn.ensemble.BaggingClassifier(n_estimators = 30)
bag.fit(X_train, Y_train)
Y_pred = bag.predict(X_test)
print("Accuracy:",accuracy_score(Y_test, Y_pred))

print(confusion_matrix(Y_test, Y_pred))
metrics.plot_roc_curve(clf, X_test, Y_test)
plt.show()

Output:

Here, we use a bagging classifier to predict whether a data point is injured or use the undersampling technique to counter our imbalanced data set.

#XGBooster model with undersampling
from xgboost import XGBClassifier
X = dataframe.drop('injury', axis = 1)
Y = dataframe['injury']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, stratify = Y)
rus = RandomUnderSampler(random_state=0)
X_train, Y_train =rus.fit_resample(X_train,Y_train)
boost = XGBClassifier(max_depth = 3, n_estimators = 30)
boost.fit(X_train, Y_train)
Y_pred = boost.predict(X_test)
print("Accuracy:",accuracy_score(Y_test, Y_pred))

print(confusion_matrix(Y_test, Y_pred))
metrics.plot_roc_curve(clf, X_test, Y_test)
plt.show()

Output:

We employ a bagging classifier to determine if a data point corresponds to an injury, addressing the imbalance in our dataset through the use of the oversampling technique.

#XGBooster model with Oversampling
from xgboost import XGBClassifier
X = dataframe.drop('injury', axis = 1)
Y = dataframe['injury']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, stratify = Y)
sm = SMOTE(random_state = 0)
X_train, Y_train = sm.fit_resample(X_train,Y_train)
boost = XGBClassifier(max_depth = 2, n_estimators = 30)
boost.fit(X_train, Y_train)
Y_pred = boost.predict(X_test)
print("Accuracy:",accuracy_score(Y_test, Y_pred))

print(confusion_matrix(Y_test, Y_pred))
metrics.plot_roc_curve(clf, X_test, Y_test)
plt.show()

Output:

The model's accuracy was influenced mainly by the sampling method used. The oversampling approach performed better in classifying non-injured data points, while the undersampling method excelled in accurately identifying the injured data points.

To balance the data set, we applied various strategies like sampling, oversampling, and undersampling. Afterward, we evaluated the different balanced datasets using multiple binary classifiers, including KNN, SVM, Bagging, and XGBooster. The best performance was achieved using XGBooster and Bagging, with an impressive accuracy rate of around 99%.

Future Aspects of Injury Prediction in Competitive Runners Using Machine Learning

Machine learning holds great promise for sports medicine and athlete care. With advancements in wearable technology and data collection methods, machine learning models can analyze comprehensive datasets, providing deeper insights into athletes' performance, biomechanics, and physiological parameters. Integrating multi-modal data, such as genetics and environmental conditions, may reveal new patterns contributing to injury risk. Longitudinal studies will be crucial in building accurate prediction models by monitoring athletes over time.

Moreover, machine learning can offer personalized injury prevention plans based on individual characteristics. By embracing these advancements, injury prediction can optimize athlete performance and foster a healthier and more successful athletic community.

Conclusion

Injury prediction in competitive runners using machine learning holds tremendous promise for transforming sports medicine and enhancing athlete well-being. With the aid of advanced algorithms and extensive datasets, machine learning models offer real-time injury risk assessments, personalized training guidance, and data-driven injury prevention strategies.

However, the full benefits of machine learning in injury prediction can be achieved by addressing challenges like data quality, model interpretability, and accounting for complex risk factor interactions. Collaboration between data scientists, sports medicine experts, coaches, and athletes will be key in overcoming these hurdles and unlocking the full potential of machine learning in competitive running.

As researchers and practitioners explore the possibilities, injury prediction through machine learning can revolutionize how competitive runners approach training, recovery, and injury prevention, ultimately leading to a healthier and more successful athletic community.

Next TopicProtein Folding Using Machine Learning

← prev next →