Detecting Phishing Websites using Machine Learning

Phishing is a cybercrime that involves the use of fraudulent emails, messages, and websites to steal sensitive information such as passwords, credit card details, and other personal data. With the growth of the internet and online transactions, phishing attacks have become increasingly sophisticated, making it difficult for individuals to detect and avoid them.

Phishing is still one of the best and most successful ways for hackers to cheat us out of our money and steal our personal and financial information.

Today's phishing assaults are complex and getting harder and harder to detect. According to a survey by Intel, 97% of security specialists are unable to differentiate between legitimate emails and phishing emails.

Machine learning can be a powerful tool in detecting phishing websites. By training machine learning algorithms on a large dataset of both legitimate and fraudulent websites, the algorithms can learn to distinguish between the two. This can lead to the development of effective phishing detection systems that can automatically identify and warn users about potentially dangerous websites.

There are several types of machine learning algorithms that can be used for phishing detection, including supervised learning, unsupervised learning, and deep learning. Supervised learning algorithms are trained on labelled data, where the features of each website are used to classify it as either legitimate or phishing. Unsupervised learning algorithms, on the other hand, cluster websites based on their features, allowing the detection of outliers that may be indicative of phishing websites.

Deep learning algorithms, such as convolutional neural networks (CNNs), use complex neural network architectures to analyze website features and make predictions.

When training machine learning algorithms for phishing detection, it is important to use a large and diverse dataset of websites. This will help ensure that the algorithms are able to learn and detect phishing websites that are representative of the various types of phishing attacks that exist. Additionally, the features used by the algorithms to distinguish between legitimate and phishing websites must be carefully selected. Common features used in phishing detection include URL structure, website content, and visual cues such as the use of official logos or security certificates.

One of the advantages of using machine learning for phishing detection is that it can be more accurate and effective than traditional methods such as blacklists or heuristics-based systems. This is because machine learning algorithms can learn to identify phishing websites based on their features rather than relying on predefined rules or signatures. This makes them more robust and less prone to false positives or false negatives.

Another advantage of using machine learning for phishing detection is that it can be easily integrated into existing security systems and workflows. For example, machine learning algorithms can be used to automatically scan incoming emails and flag any messages that contain links to phishing websites. They can also be integrated into browser extensions, allowing users to be warned about potentially dangerous websites before they visit them.

Despite the many benefits of using machine learning for phishing detection, there are some limitations and challenges that must be addressed. One of the main challenges is ensuring that the algorithms are able to detect new and evolving types of phishing attacks. This requires ongoing updates to the training data and features used by the algorithms. Additionally, machine learning algorithms can be vulnerable to adversarial attacks, where attackers manipulate the features of phishing websites to evade detection. To address this, it is important to use robust and secure machine learning models that are resistant to these attacks.

Implementation of Phishing detection ML Model using Python

Dataset Details

11430 URLs with 89 retrieved characteristics are part of the supplied dataset. The dataset is intended to serve as the benchmark for phishing detection systems that employ machine learning. The features come from three separate classes: seven are extracted via contacting other services, while the remaining 56 are taken from the structure and syntax of URLs. The collection is evenly distributed; it comprises precisely 50% genuine URLs and 50% phishing URLs.

Now we need to implement it in the code.

Importing Libraries

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
pd.set_option("display.max_columns",None)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset,DataLoader

Loading the Dataset

EDA(Exploratory Data Analysis)

Output:

There are a total of 11430 rows and 89 columns in the dataset.

Output:

# removing url features
dataframe=dataframe.drop(labels="url",axis=1)
dataframe.head()

Output:

# checking object dtype features
object_features=[col for col in dataframe.columns if dataframe[col].dtype=="O"]
print(object_features)

Output:

# checking unique values and counts from the collected object features
dataframe['status'].value_counts()

Output:

We can interpret that the number of legitimate websites and phishing websites are the same(5715).

with plt.style.context(style="bmh"):
    fig=dataframe['status'].value_counts().plot.bar(figsize=(6,5),
                                             fontsize=15,
                                             title='Analysing status feature using bar-chart',
                                            xlabel='class labels',
                                            ylabel='number of records')
    plt.show()

Output:

with plt.style.context(style="fivethirtyeight"):
    plt.pie(x=dict(dataframe['status'].value_counts()).values(),
           labels=dict(dataframe['status'].value_counts()).keys(),
           autopct="%.2f%%",
           colors=['red','orangered'],
           startangle=90,
           explode=[0,0.05])
    centre_circle=plt.Circle((0,0),0.70,fc='white')
    fig=plt.gcf()
    fig.gca().add_artist(centre_circle)
    plt.title(label="Analysing status feature using donut-chart")
    plt.show()

Output:

class_labels=dataframe['status'].unique().tolist()
class_labels.sort()
print(class_labels)

Output:

class_dict={}
for idx,label in enumerate(class_labels):
    class_dict[label]=idx
print(class_dict)

Output:

We then encode the status column as for legitimate is 0 and phishing is 1.

# label encoding
dataframe['status']=dataframe['status'].map(class_dict)
dataframe.head()

Output:

X=dataframe.iloc[:,:-1]
y=dataframe.iloc[:,-1:]

X.head()

Output:

# data normalizationnormalization using MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(X.values)
X_scaled=scaler.transform(X.values)
print(X_scaled)

Output:

We normalize the dataset as the value would come in the range of 0 and 1.

import pickle
with open(file="scaler.pkl",mode="wb") as file:
    pickle.dump(obj=scaler,file=file)


new_X=pd.DataFrame(data=X_scaled,columns=X.columns)
new_X.head()

Output:

Splitting the Dataset

We split the dataset into a training set and a testing set.

X_train,X_test,y_train,y_test=train_test_split(new_X,y,test_size=0.2,random_state=42,shuffle=True,stratify=y)
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

Output:

We then create tensors from the numpy array.

train_input_tensor=torch.from_numpy(X_train.values).float()
train_label_tensor=torch.from_numpy(y_train['status'].values).float()
val_input_tensor=torch.from_numpy(X_test.values).float()
val_label_tensor=torch.from_numpy(y_test['status'].values).float()


train_input_tensor

Output:

train_label_tensor=train_label_tensor.unsqueeze(1)
train_label_tensor

Output:

val_label_tensor=val_label_tensor.unsqueeze(1)
val_label_tensor

Output:

# wrapping training tensors and validation tensors
train_dataset=TensorDataset(train_input_tensor,train_label_tensor)
val_dataset=TensorDataset(val_input_tensor,val_label_tensor)

# performing splitting tensors into batches and shuffling the data, and making wrapped tensors as iterative
train_loader=DataLoader(dataset=train_dataset,batch_size=32,shuffle=True)
val_loader=DataLoader(dataset=val_dataset,batch_size=32,shuffle=True)


print(f"number of batches in train_loader: {len(train_loader)}")
print(f"number of records in train_loader: {len(train_loader.dataset)}")
print(f"number of batches in val_loader: {len(val_loader)}")
print(f"number of records in val_loader: {len(val_loader.dataset)}")

Output:

device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

Output:

class MLP(nn.Module):
    def __init__(self,dropout=0.4):
        super(MLP,self).__init__()
        self.network=nn.Sequential(
            nn.Linear(in_features=87,out_features=300), # in_feature value is 87. because this dataset has 87 independent features
            nn.ReLU(),
            nn.BatchNorm1d(num_features=300),
            nn.Dropout(p=dropout),
           
            nn.Linear(in_features=300,out_features=100),
            nn.ReLU(),
            nn.BatchNorm1d(num_features=100),
           
            nn.Linear(in_features=100,out_features=1),
            nn.Sigmoid()
        )
    def forward(self,x):
        x=self.network(x)
        return x


model=MLP(dropout=0.4)
print(model)

Output:

Creating a function train_loop, with this function, we will train our model in the loop.

optimizer=torch.optim.Adam(params=model.parameters(),lr=0.001)
criterion=nn.BCELoss()


def train_loop(model,train_loader,val_loader,device,optimizer,criterion,batch_size,epochs):
    model=model.to(device)
    train_batch_size=len(train_loader)
    val_batch_size=len(val_loader)
   
    history={"train_accuracy":[],"train_loss":[],"val_accuracy":[],"val_loss":[]}
   
    for epoch in range(epochs):
        model.train() # training mode
       
        train_accuracy=0
        train_loss=0
        val_accuracy=0
        val_loss=0
       
        for X,y in train_loader:
            X=X.to(device)
            y=y.to(device)
           
            # forward propagation
            outputs=model(X)
            pred=torch.round(outputs)
           
            # loss computation
            loss=criterion(outputs,y)
           
            # backward propagation
            optimizeroptimizer.zero_grad()
            loss.backward()
            optimizer.step()
           
            cur_train_loss=loss.item()
            cur_train_accuracy=(pred==y).sum().item()/batch_size
           
            train_accuracy+=cur_train_accuracy
            train_loss+=cur_train_loss
        model.eval()
        with torch.no_grad():
            for X,y in val_loader:
                X=X.to(device)
                y=y.to(device)
               
                outputs=model(X)
                pred=torch.round(outputs)
               
                loss=criterion(outputs,y)
               
                cur_val_loss=loss.item()
                cur_val_accuracy=(pred==y).sum().item()/batch_size
               
                val_accuracy+=cur_val_accuracy
                val_loss+=cur_val_loss
        train_accuracy=train_accuracy/train_batch_size
        train_loss=train_loss/train_batch_size
        val_accuracy=val_accuracy/val_batch_size
        val_loss=val_loss/val_batch_size
       
        print(f"[{epoch+1:>3d}/{epochs:>3d}], train_accuracy:{train_accuracy:>5f}, train_loss:{train_loss:>5f}, val_accuracy:{val_accuracy:>5f}, val_loss:{val_loss:>5f}")
       
        history['train_accuracy'].append(train_accuracy)
        history['train_loss'].append(train_loss)
        history['val_accuracy'].append(val_accuracy)
        history['val_loss'].append(val_loss)
    PATH="/kaggle/working/trained_model.pt"
    torch.save(model.state_dict(),PATH)
    return history

Now we will train our model on the training dataset over 100 epochs.

history=train_loop(model,train_loader,val_loader,device,optimizer,criterion,batch_size=32,epochs=100)

Output:

Now we will plot the graph for accuracy.

with plt.style.context(style="fivethirtyeight"):
    plt.figure(figsize=(18,8))
    plt.plot(history['train_accuracy'],label="train accuracy")
    plt.plot(history['val_accuracy'],label="val accuracy")
    plt.title(label="Accuracy plots")
    plt.xlabel(xlabel='epochs')
    plt.ylabel(ylabel='accuracy')
    plt.show()
   
    plt.figure(figsize=(18,8))
    plt.plot(history['train_loss'],label="train loss")
    plt.plot(history['val_loss'],label="val loss")
    plt.title(label="loss plots")
    plt.xlabel(xlabel='epochs')
    plt.ylabel(ylabel='loss')
    plt.show()

Output:

Train accuracy with value accuracy is in the blue line, and train loss with value loss is in red.

Most of the predicted data point in the accuracy between 95% and 99%,

In conclusion, machine learning can be a powerful tool for detecting phishing websites. By training algorithms on a large dataset of both legitimate and fraudulent websites, it is possible to develop accurate and effective systems that can automatically identify and warn users about potentially dangerous websites. However, the limitations and challenges associated with machine learning for phishing detection must be addressed to ensure that these systems remain effective and secure.

Next TopicHidden Markov Model in Machine Learning

← prev next →