Network Intrusion Detection System Using Machine Learning

Due to the rapid growth of the internet and communication technologies, the domain of network security has emerged as a central area of investigation. This encompasses the application of resources such as firewalls, antivirus software, and intrusion detection systems (IDS) to safeguard the security of networks and their resources within the digital expanse. Within this array of resources, network-based intrusion detection systems (NIDS) hold a critical position, as they continuously monitor network traffic to detect any potentially harmful or questionable actions.

The notion of IDS was initially introduced by Jim Anderson in 1980, paving the way for the development of various IDS products to cater to the requirements of network security. Nonetheless, the rapid advancement of technologies has brought about the expansion of networks and the management of vast volumes of data and applications, posing a challenge to secure network nodes and data. Current IDSs have exhibited limitations in recognizing novel attacks and reducing false alarms, which has given rise to a demand for effective and precise NIDS solutions.

To meet the demands of a robust IDS, researchers have ventured into the realm of artificial intelligence, specifically machine learning and deep learning techniques. These methods have gained prominence in network security, largely due to the availability of robust graphics processing units (GPUs). ML-based IDS relies on feature engineering to glean insights from network traffic, while DL-based IDS leverages its intricate architecture to autonomously learn intricate patterns from raw data.

In the past decade, researchers have proposed solutions based on ML and DL to boost the efficiency of the Network Intrusion Detection System in detecting malicious attacks. However, the escalating network traffic and mounting security threats pose challenges for NIDS to effectively pinpoint intrusions.

Application of Network Intrusion Detection System Using Machine Learning

Using Network Intrusion Detection Systems with Machine Learning has a big impact in many areas. These systems, powered by ML, help keep computer networks safe by spotting and stopping possible dangers. This makes sure that networks and important information stay secure. Here are some important ways NIDS using ML can be useful:

Anomaly Detection: Machine learning algorithms can be trained on large volumes of network traffic data to learn normal patterns and behaviors. By analyzing real-time network data, these algorithms can detect anomalies or deviations from normal behavior, which may indicate potential security threats such as intrusions or malicious activities. Anomaly detection helps identify previously unknown or zero-day attacks that traditional rule-based intrusion detection systems may miss.
Intrusion Detection and Prevention: Using machine learning, we can create models that can learn to sort network activity and recognize particular attack patterns like denial-of-service (DoS) attacks, SQL injection, malware spread, or unauthorized entry attempts. These models keep a constant watch on network behavior and can raise alarms or even take preventive actions in response to spotted attacks. For instance, they might block suspicious IP addresses or implement instant security measures.
Malware Detection: Machine learning algorithms can analyze network data, including packet payloads, to detect and classify malicious software or malware. By learning from known malware patterns and behaviors, these algorithms can identify new malware variants or previously unseen threats. Machine learning-based malware detection can enhance the efficiency and accuracy of detecting and mitigating malware infections within a network.
Threat Intelligence and Analysis: Machine learning can be applied to analyze large volumes of threat intelligence data, including security logs, vulnerability reports, and security advisories. By extracting relevant patterns and correlations from this data, machine learning algorithms can help identify emerging threats, predict attack trends, and provide actionable insights for proactive security measures. This helps organizations stay ahead of evolving threats and strengthen their overall security posture.
User and Entity Behavior Analytics (UEBA): Machine learning algorithms can analyze user behavior, such as login patterns, data access patterns, and resource usage, to detect anomalies that may indicate insider threats or compromised user accounts. UEBA systems can learn normal behavior profiles for users and entities within the network and raise alerts when deviations or suspicious activities are observed. This proactive approach to detecting insider threats helps organizations mitigate risks and prevent data breaches.
Network Traffic Analysis: Utilizing machine learning methods, we can employ data analysis to study network traffic data, recognizing connections, trends, and relationships that might signal security issues or possible weaknesses. Through processing substantial amounts of real-time network data, machine learning algorithms can offer valuable information about network conduct, traffic trends, and spot signs of potential threats (IOC). Machine-aided network traffic analysis helps in uncovering and countering advanced persistent threats (APTs) and other intricate attacks.
Security Event Correlation: Machine learning methods are handy in connecting security events and logs originating from various sources like firewalls, intrusion detection systems, and log files. Through analyzing these linked events, machine learning models can spot intricate attack patterns, recognize organized attack sequences, and give a comprehensive outlook on security status. Using machine learning for security event correlation improves incident response effectiveness while also decreasing false alarms by pinpointing important and pertinent security incidents.

Challenges of Network Intrusion Detection System Using Machine Learning

Network Intrusion Detection Systems (NIDS) using Machine Learning (ML) has a lot of applications and benefits, although it comes with its fair share of challenges that need careful consideration. These challenges include:

Data Imbalance: When training intrusion detection models, an issue arises due to a notable disparity in the volume of normal traffic samples compared to malicious ones. This imbalance in the dataset can introduce bias into the models, rendering them inadequate in accurately identifying infrequent or rare attacks. Addressing this imbalance is crucial to ensure models can proficiently discern both common and unusual threats.
Dynamic Network Behavior: The perpetual evolution of network behavior poses a substantial hurdle in crafting precise intrusion detection models using machine learning. Networks exhibit continual shifts in patterns due to software updates, shifts in user behavior, and the emergence of new security threats. Constructing models that can adeptly adapt to these evolving patterns-capturing legitimate actions while highlighting deviations indicative of malicious activities-presents a formidable challenge.
High-Dimensional Data: The inherent high-dimensionality of network traffic data introduces complications in terms of visualization, processing, and analysis. The sheer volume of variables contributing to network behavior poses computational challenges, potentially slowing down analysis and detection. Employing dimensionality reduction techniques becomes indispensable to streamline processing and enhance model efficiency.
Classifying New Attacks: Novel and sophisticated attacks not present in the training data pose a significant hurdle for machine learning models. These models may struggle to recognize these previously unseen threats, leading to false negatives and potential security vulnerabilities. Developing models that are adaptable and can generalize to emerging attack vectors remains a substantial challenge.
Adversarial Attacks: Attackers can manipulate network traffic to evade detection, exploiting vulnerabilities in machine learning models. Adversarial attacks necessitate ongoing model updates and robustness testing to ensure models remain effective against adversarial evasion attempts.
Model Interpretability: Many machine learning algorithms, especially deep learning models, operate as intricate "black boxes." The lack of transparency in their decision-making process presents challenges in understanding the rationale behind specific decisions. Interpreting and explaining these decisions, particularly to system administrators and security experts, proves to be a critical aspect of ensuring trust, transparency, and effective decision-making.
Privacy Concerns: Handling sensitive network data introduces privacy concerns, prompting the need for robust data anonymization and stringent security measures to safeguard sensitive information.

About the Dataset

The audit dataset provided comprises a diverse range of intrusions that were simulated within a military network environment. This environment was designed to replicate the conditions of a typical US Air Force LAN, capturing raw TCP/IP dump data. This involved emulating a real network setting and subjecting it to various attack simulations. In this context, a "connection" denotes a sequence of TCP packets occurring between specific time intervals, where data travels between a source and a target IP address following a defined protocol. Each of these connections is classified as either "normal" or as an "attack," with each attack being associated with a particular attack type. Every connection record encompasses approximately 100 bytes of data.

For every TCP/IP connection, a set of 41 quantitative and qualitative features is derived from both normal and attack data. These features include 3 qualitative and 38 quantitative attributes. The class variable in the dataset consists of two categories:

"Normal"
"Anomalous"

Now we will try to predict the Intrusion on the given dataset using various machine learning algorithms. We will also look at their accuracy and try to determine which is better for Intrusion Detection.

Importing Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.api.types import is_numeric_dtype
import warnings
import optuna
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree  import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import BernoulliNB
from lightgbm import LGBMClassifier
from sklearn.feature_selection import RFE
import itertools
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from tabulate import tabulate
import os
warnings.filterwarnings('ignore')
optuna.logging.set_verbosity(optuna.logging.WARNING)

Reading the Dataset

dataset=pd.read_csv('/kaggle/input/network-intrusion-detection/Train_data.csv')
dataset

Output:

EDA (Exploratory Data Analysis)

Exploratory Data Analysis (EDA) is a fundamental approach to analyzing data that includes methodically investigating and graphically representing datasets to extract valuable observations and trends. This process encompasses activities such as data profiling, summarizing, and visually representing information in order to grasp the spread, correlations, and features of the data. EDA seeks to pinpoint potential unusual data points, areas where data is absent, and irregularities. It also evaluates the reliability and appropriateness of the data for more advanced analysis or constructing models.

Output:

We have 42 columns and 25192 rows in our dataset.

Output:

Missing Data

total = dataset.shape[0]
missing_columns = [col for col in dataset.columns if dataset[col].isnull().sum() > 0]
for col in missing_columns:
    null_count = dataset[col].isnull().sum()
    per = (null_count/total) * 100
    print(f"{col}: {null_count} ({round(per, 3)}%)")

There isn't a single missing value to be found in our dataset. It is one of the remarkable things as it accounts for robust and reliable analyses.

Duplicate Rows

Output:

Again, we don't have any duplicate rows.

Outliers

Outliers refer to data points that exhibit considerable deviation from the general pattern or trend of the remaining dataset. These data values are notably distant from the larger cluster of other values within a dataset. These outliers have the ability to influence data analysis or model outcomes, often by introducing disturbances or irregularities that do not reflect the usual characteristics of the data.

for col in dataset:
    if col != 'class' and is_numeric_dtype(dataset[col]):
        fig, ax = plt.subplots(2, 1, figsize=(12, 8))
        g1 = sns.boxplot(x = dataset[col], ax=ax[0])
        g2 = sns.scatterplot(data=dataset, x=dataset[col],y=dataset['class'], ax=ax[1])
        plt.show()

Output:

We don't have any outliers throughout the dataset.

Correlation

plt.figure(figsize=(40,30))
sns.heatmap(dataset.corr(), annot=True)

# import plotly.express as px
# fig = px.imshow(df.corr(), text_auto=True, aspect="auto")
# fig.show()

Output:

Label Encoding

Label encoding is a technique used when getting data ready for analysis. It changes categories, like types of things, into numbers. Each category gets its own number. This helps computer programs, like those used in machine learning, understand and work with the data, especially when they need numbers to do their calculations.

def le(df):
    for col in df.columns:
        if df[col].dtype == 'object':
                label_encoder = LabelEncoder()
                df[col] = label_encoder.fit_transform(df[col])

le(dataset)

dataset.drop(['num_outbound_cmds'], axis=1, inplace=True)
dataset.head()

Output:

Feature Selection

Feature selection involves picking out the most meaningful and crucial attributes or factors from a dataset to use in a model or analysis. This streamlines the data, making it less complicated, and enhances the model's effectiveness. By pinpointing the correct features, we can concentrate on the most influential details, which leads to better accuracy and efficiency in our analysis or predictions.

We will try to pick the most meaningful attributes, as you already know that we have 42 columns in our dataset at first. Having a large number of attributes decreases the efficiency of the model.

X = dataset.drop(['class'], axis=1)
Y = dataset['class']

rfc = RandomForestClassifier()

rfe = RFE(rfc, n_features_to_select=10)
rfe = rfe.fit(X, Y)

feature_map = [(i, v) for i, v in itertools.zip_longest(rfe.get_support(), X.columns)]
selected_features = [v for i, v in feature_map if i==True]

selected_features

Output:

Above are the relevant features that will be suitable for our models.

X = X[selected_features]</p>
<ul class="points">
<li><strong>Scale and Split Data</strong></li>
</ul>
<p>Now we will split the dataset into a Training and Testing set.</p>
<div class="codeblock"><textarea name="code" class="java">
scale = StandardScaler()
X = scale.fit_transform(X)

x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.70, random_state=2)

Modeling

Next, we will proceed to train the following model and assess its score on both the Training and Testing datasets:

KNN (K Nearest Neighbors)
Logistic Regression
Decision Tree Classifier
Random Forest Classifier
SKLearn Gradient Boosting
XGBoost
Light Gradient Boosting
ADAboost
Catboost
Naive Bayes
Voting Model
SVM

1. KNN (K Nearest Neighbors)

def objective(try):
    n_neighbors = try.suggest_int('KNN_n_neighbors', 2, 16, log=False)
    obj_classifier = KNeighborsClassifier(n_neighbors=n_neighbors)
    obj_classifier.fit(x_train, y_train)
    accuracy = obj_classifier.score(x_test, y_test)
    return accuracy


KNN_study = optuna.create_study(direction='maximize')
KNN_study.optimize(objective, n_trys=1)
print(KNN_study.best_try)

Output:

model_KNN = KNeighborsClassifier(n_neighbors=KNN_study.best_try.params['KNN_n_neighbors'])
model_KNN.fit(x_train, y_train)

train_KNN, test_KNN = model_KNN.score(x_train, y_train), model_KNN.score(x_test, y_test)

print(f"Train Score: {train_KNN}")
print(f"Test Score: {test_KNN}")

Output:

2. Logistic Regression

model_lg = LogisticRegression(random_state = 42)
model_lg.fit(x_train, y_train)

Output:

train_lg, test_lg = model_lg.score(x_train , y_train), model_lg.score(x_test , y_test)

print(f"Training Score: {train_lg}")
print(f"Test Score: {test_lg}")

Output:

3. Decision Tree Classifier

def objective(try):
    dt_max_depth = try.suggest_int('dt_max_depth', 2, 32, log=False)
    dt_max_features = try.suggest_int('dt_max_features', 2, 10, log=False)
    obj_classifier = DecisionTreeClassifier(max_features = dt_max_features, max_depth = dt_max_depth)
    obj_classifier.fit(x_train, y_train)
    accuracy = obj_classifier.score(x_test, y_test)
    return accuracy

dt_study = optuna.create_study(direction='maximize')
dt_study.optimize(objective, n_trys=30)
print(dt_study.best_try)

Output:

dt = DecisionTreeClassifier(max_features = dt_study.best_try.params['dt_max_features'], max_depth = dt_study.best_try.params['dt_max_depth'])
dt.fit(x_train, y_train)

train_dt, test_dt = dt.score(x_train, y_train), dt.score(x_test, y_test)

print(f"Train Score: {train_dt}")
print(f"Test Score: {test_dt}")

Output:

fig = plt.figure(figsize = (30,12))
tree.plot_tree(dt, filled=True);
plt.show()

Output:

from matplotlib import pyplot as plt

def f_importance(coef, names, top=-1):
    imp = coef
    imp, names = zip(*sorted(list(zip(imp, names))))

    # Show all features
    if top == -1:
        top = len(names)

    plt.barh(range(top), imp[::-1][0:top], align='center')
    plt.yticks(range(top), names[::-1][0:top])
    plt.title('feature importance for dt')
    plt.show()

# whatever your features are called
features_names = selected_features

# Indicate the number of features you wish to display visually.
# You can choose to omit the abs() function as well.
# If you're concerned about the adverse impact of features.


f_importance(abs(dt.feature_importances_), features_names, top=7)

Output:

4. Random Forest Classifier

def objective(try):
    rf_max_depth = try.suggest_int('rf_max_depth', 2, 32, log=False)
    rf_max_features = try.suggest_int('rf_max_features', 2, 10, log=False)
    rf_n_estimators = try.suggest_int('rf_n_estimators', 3, 20, log=False)
    obj_classifier = RandomForestClassifier(max_features = rf_max_features, max_depth = rf_max_depth, n_estimators = rf_n_estimators)
    obj_classifier.fit(x_train, y_train)
    accuracy = obj_classifier.score(x_test, y_test)
    return accuracy

rf_study = optuna.create_study(direction='maximize')
rf_study.optimize(objective, n_trys=30)
print(rf_study.best_try)

Output:

rf = RandomForestClassifier(max_features = rf_study.best_try.params['rf_max_features'], max_depth = rf_study.best_try.params['rf_max_depth'], n_estimators = rf_study.best_try.params['rf_n_estimators'])
rf.fit(x_train, y_train)

train_rf, test_rf = rf.score(x_train, y_train), rf.score(x_test, y_test)

print(f"Train Score: {train_rf}")
print(f"Test Score: {test_rf}")

Output:

from matplotlib import pyplot as plt

def f_importance(coef, names, top=-1):
    imp = coef
    imp, names = zip(*sorted(list(zip(imp, names))))

    # Show all features
    if top == -1:
        top = len(names)

    plt.barh(range(top), imp[::-1][0:top], align='center')
    plt.yticks(range(top), names[::-1][0:top])
    plt.title('feature importance for dt')
    plt.show()

# whatever your features are called
features_names = selected_features

# Indicate the number of features you wish to display visually.
# You can choose to omit the abs() function as well.
# If you're concerned about the adverse impact of features.

f_importance(abs(rf.feature_importances_), features_names, top=7)

Output:

5. SKLearn Gradient Boosting

SKGB = GradientBoostingClassifier(random_state=42)
SKGB.fit(x_train, y_train)

Output:

train_SKGB, test_SKGB = SKGB.score(x_train , y_train), SKGB.score(x_test , y_test)

print(f"Training Score: {train_SKGB}")
print(f"Test Score: {test_SKGB}")

Output:

6. XGBoost

model_xgb = XGBClassifier(objective="binary:logistic", random_state=42)
model_xgb.fit(x_train, y_train)

Output:

xgb_train, test_xgb = model_xgb.score(x_train , y_train), model_xgb.score(x_test , y_test)

print(f"Training Score: {xgb_train}")
print(f"Test Score: {test_xgb}")

Output:

7. Light Gradient Boosting

model_lgb = LGBMClassifier(random_state=42)
model_lgb.fit(x_train, y_train)

Output:

train_lgb, test_lgb = model_lgb.score(x_train , y_train), model_lgb.score(x_test , y_test)

print(f"Training Score: {train_lgb}")
print(f"Test Score: {test_lgb}")

Output:

8. ADAboost

Output:

train_ab, test_ab = model_ab.score(x_train , y_train), model_ab.score(x_test , y_test)

print(f"Training Score: {train_ab}")
print(f"Test Score: {test_ab}")

Output:

9. Catboost

Output:

train_cb, test_cb = model_cb.score(x_train , y_train), model_cb.score(x_test , y_test)

print(f"Training Score: {train_cb}")
print(f"Test Score: {test_cb}")

Output:

10. Naive Bayes

model_BNB = BernoulliNB()
model_BNB.fit(x_train, y_train)

Output:

train_BNB, test_BNB = model_BNB.score(x_train , y_train), model_BNB.score(x_test , y_test)

print(f"Training Score: {train_BNB}")
print(f"Test Score: {test_BNB}")

Output:

11. Voting Model

v_clf = VotingClassifier(estimators=[('KNeighborsClassifier', model_KNN), ("XGBClassifier", model_xgb), ("RandomForestClassifier", rf), ("DecisionTree", dt), ("XGBoost", model_xgb), ("LightGB", model_lgb), ("AdaBoost", model_ab), ("Catboost", model_cb)], voting = "hard")

Output:

train_voting, test_voting = v_clf.score(x_train , y_train), v_clf.score(x_test , y_test)

print(f"Training Score: {train_voting}")
print(f"Test Score: {test_voting}")

Output:

12. SVM

def objective(try):
    kernel = try.suggest_categorical('kernel', ['linear', 'rbf', 'poly', 'linearSVC'])
    c = try.suggest_float('c', 0.02, 1.0, step=0.02)
    if kernel in ['linear', 'rbf']:
        obj_classifier = SVC(kernel=kernel, C=c).fit(x_train, y_train)
    elif kernel == 'linearSVC':
        obj_classifier = LinearSVC(C=c).fit(x_train, y_train)
    elif kernel == 'poly':
        degree = try.suggest_int('degree', 2, 10)
        obj_classifier = SVC(kernel=kernel, C=c, degree=degree).fit(x_train, y_train)
        
    accuracy = obj_classifier.score(x_test, y_test)
    return accuracy

svm_study = optuna.create_study(direction='maximize')
svm_study.optimize(objective, n_trys=30)
print(svm_study.best_try)

Output:

if svm_study.best_try.params['kernel'] in ['linear', 'rbf']:
    model_SVM = SVC(kernel=svm_study.best_try.params['kernel'], C=svm_study.best_try.params['c'])
elif kernel == 'linearSVC':
    model_SVM = LinearSVC(C=svm_study.best_try.params['c'])
elif kernel == 'poly':
    model_SVM = SVC(kernel=svm_study.best_try.params['kernel'], C=svm_study.best_try.params['c'], degree=svm_study.best_try.params['degree'])

model_SVM.fit(x_train, y_train)

Output:

train_SVM, test_SVM = model_SVM.score(x_train , y_train), model_SVM.score(x_test , y_test)

print(f"Training Score: {train_SVM}")
print(f"Test Score: {test_SVM}")

Output:

Model Selection

Now we will look at all the model scores of the models that we used, and select the model

data = [["KNN", train_KNN, test_KNN], 
        ["Logistic Regression", train_lg, test_lg],
        ["Decision Tree", train_dt, test_dt], 
        ["Random Forest", train_rf, test_rf], 
        ["GBM", train_SKGB, test_SKGB], 
        ["XGBM", xgb_train, test_xgb], 
        ["Adaboost", train_ab, test_ab], 
        ["light GBM", train_lgb, test_lgb],
        ["CatBoost", train_cb, test_cb], 
        ["Naive Baye Model", train_BNB, test_BNB], 
        ["Voting", train_voting, test_voting],
        ["SVM", train_SVM, test_SVM]]

col_names = ["Model", "Train Score", "Test Score"]
print(tabulate(data, headers=col_names, tablefmt="fancy_grid"))

Output:

As we see, almost all the models have a high accuracy. So as per intuition, we can choose the model for further practice. It is recommended to use Ensemble methods like Random Forest, Gradient Boosting, and Voting to use ensemble techniques that combine multiple models for improved performance. They can often handle complex relationships in data effectively.

If you want an easier interpretability model, then Decision Trees are easy to interpret and visualize, which might be beneficial for understanding the intrusion detection process. Logistic Regression and Naive Bayes are also relatively interpretable models.

Over the High Dimension Data, we can use SVM.

Future Aspects of Network Intrusion Detection System Using Machine Learning

The future of Network Intrusion Detection Systems (NIDS) paired with Machine Learning holds great promise. As NIDS models advance, they will become more adaptable in detecting emerging threats and attacks. The concept of transfer learning will speed up the sharing of knowledge, making detection even better. The ability to identify unusual activities (anomalies) will become more precise, even when they're subtle. Quick analysis in real-time will help respond faster to threats. By bringing together different types of data, a more complete picture of potential dangers will be possible.

Making sure the models' decisions are clear (model interpretability) will be a priority. The collective wisdom of defense systems will work together to counter threats. Using AI, responses to incidents will be automated. Detailed analysis will provide a more nuanced understanding of threats. Concerns about privacy will be addressed using methods that protect sensitive information. Ongoing learning and the integration of advanced computing (like quantum computing) will make NIDS even stronger. In summary, NIDS powered by Machine Learning will evolve using new methods, teamwork, and automation to improve cybersecurity.

Conclusion

Network Intrusion Detection Systems using Machine Learning represent a paradigm shift in cybersecurity. As threats become more advanced, NIDS must evolve to match their sophistication. Machine Learning equips NIDS with the adaptability, accuracy, and real-time capabilities necessary to effectively combat modern cyber threats. While challenges persist, the future holds the promise of even more advanced techniques and collaborative approaches to ensure the security of our digital landscapes. With NIDS leveraging Machine Learning, organizations can confidently navigate the complex and ever-changing landscape of cybersecurity.

Next TopicTitanic- Machine Learning From Disaster

← prev next →