Sarcasm Detection Using Neural Networks

Sarcasm is defined as the words or language which is used to insult or taunt someone. It shows an angry or irritating personality. Sarcasm may be used to make the conversation funny.

Conversations with sarcasm may depict negative feelings conveyed positively or funnily. It may sound not nice sometimes. This generation uses social media platforms to troll someone directly or indirectly, using sarcasm. Twitter, nowadays, is gaining popularity, where people share their thoughts and troll each other with sarcastic words. Using neural networks, we can detect this sarcasm on Twitter by making different machine learning models.

Problem Statement

We will make sarcasm detection using neural networks with the help of multiple machine learning models. Then, we will classify the input text as sarcastic or non sarcastic.

Approach to the Problem Statement

Importing required libraries: We will import multiple libraries like numpy, matplotlib, NLTK, etc.
Load dataset: We will load the dataset containing different sarcastic or non sarcastic tweets after importing libraries.
Preprocessing the data: Data preprocessing is a vital step in analyzing the data, its structure, data visualization, and many more.
Data Cleaning: After analyzing the basic structure of the data and its aspects, we will clean our data or text by checking null values and handling it, if any, by replacing it with some other values. As this is textual data, we will load the stop words. The stop words are some common words that need to be ignored while processing the data. Data Cleaning is also a part of the data preprocessing.
Train and test data: Now, we will split the dataset into training and testing data
Building Model: We will build our model or a neural network by adding different layers and then fit the model with the training data set.
Evaluation: We will now evaluate the accuracy of the model. Also, we will analyze it using different charts and graphs.
Predicting the outcome: We will predict whether the input text is sarcastic.
After predictions, we will make a confusion matrix and a classification report to analyze our results more. We can also check the sarcasm in each text separately.

We must know the structure more deeply before implementing the sarcasm detector using neural networks.

The problem statement we are discussing is a classification problem. The textual statement and sarcasm analysis are part of Natural Language Processing. Natural Language Processing is a branch of Artificial Intelligence that helps the machine to understand and process human language.

We are using neural networks for building the model to predict the sarcasm in the tweets. Neural Networks is a process in artificial intelligence that teaches machines to understand human language.

Implementation of Sarcasm Detector using neural networks

Step 1:

Importing Libraries

Code:

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import nltk
from nltk.corpus import stopwords
from wordcloud import wordcloud 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import classification_report, confusion_matrix

Explanation

The NLTK (Natural language toolkit) library removes stop words, lemmatizes the text, etc.
re (Regular expression) library removes special characters and symbols from the text.
The Tokenizer is used to tokenize and break the text into tokens.

Step 2:

Loading Dataset

Code:

data = pd.read_csv('Sarcasm data.json', lines = True)
print(data.head())

Output:

                                        article_link  \
0  https://www.huffingtonpost.com/entry/versace-b...   
1  https://www.huffingtonpost.com/entry/roseanne-...   
2  https://local.theonion.com/mom-starting-to-fea...   
3  https://politics.theonion.com/boehner-just-wan...   
4  https://www.huffingtonpost.com/entry/jk-rowlin...   

                                        tweet                                   is_sarcastic  
0  former versace store clerk sues over secret 'b...             0  
1  the 'roseanne' revival catches up to our thorn...             0  
2  mom starting to fear son's web series closest ...             1  
3  boehner just wants wife to listen, not come up...            1  
4  j.k. rowling wishes snape happy birthday in th...             0

Step 3:

Data Preprocessing

Code:

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26709 entries, 0 to 26708
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   article_link  26709 non-null  object
 1   tweet         26709 non-null  object
 2   is_sarcastic  26709 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 626.1+ KB

Explanation:

The info( ) function is used to define the structure of the data.

Checking Null values in the data

Output:

article_link    0
headline        0
is_sarcastic    0
dtype: int64

Checking the Sarcastic and Non Sarcastic Word count in the dataset

Output:

0    14985
1    11724
Name: is_sarcastic, dtype: int64

We will visualize the data based on the sarcasm tweets in it.

Code:

plt.figure(figsize=(10, 5))
sns.countplot(x='is_sarcastic', data=data, palette="Set2").set_title(
    "Countplot of Tweets")
plt.show()

Output:

Explanation:

We have made a count plot, which tells us how much sarcasm and non sarcasm text are in the data set.

Checking the minimum and maximum word count length

word_cnt = data["headline"].apply(lambda x: len(x.split()))
min(word_cnt), max(word_cnt)

Output:

(2, 39)

Visualizing the word count

plt.figure(figsize=[20, 4])
sns.countplot(x = word_cnt)

Output:

Maximum length of the words in the data set

Output:

Making a unique vocabulary containing unique words

unique_vocab = set(i for i in data["headline"] for i in i.split())
len(unique_vocab)

Output:

Step 4:

Data Cleaning

Code:

nltk.download('stopwords')
stopwords_list = stopwords.words('english')

Explanation:

Using the nltk library, we downloaded the stop words corpus in English, which are the common words that need to be ignored while processing the data.

Now, we will clean the data by removing special characters and punctuation marks.

Code:

def clean(tweet):
    # converting the text into lowercase
    txt = tweet.lower()
    # Removing the square brackets of the text
    txt = re.sub('\[.*?\]', '', txt)
    # removing the punctuations
    txt = re.sub('[%s]' % re.escape(string.punctuation), '', txt)
    # removing the alphanum words
    txt = re.sub('\w*\d\w*', '', txt)
    # Joining the words
    txt = ' '.join([word for word in txt.split()
                     if word not in list_of_stopwords])
    return txt
 
 
print(data['tweet'].iloc[10])
clean(data['tweet'].iloc[10])

Output:

airline passengers tackle man who rushes cockpit in bomb threat
Out[ ]:
'airline passengers tackle man rushes cockpit bomb threat'

Explanation:

We have made a function clean( ), which will clean our data. Using the re object, we have removed the punctuation marks, special characters, etc.

Now, we will make the word cloud. It means the frequent characters used in the dataset.

For the Sarcastic Text

Code:

Sarcasm_text = ' '.join(
    data['tweet'][data['is_sarcastic'] == 1].tolist())
 
# word cloud of the sarcasm text 
wordcloud = WordCloud(width=800, height=600,
                      background_color='pink').generate(Sarcasm_text)
 
plt.figure(figsize=(100, 5))
plt.imshow(wordcloud, interpolation='hamming')
plt.axis('off')
plt.title('Sarcasm Text')
plt.show()

Output:

For Non Sarcastic Text

Code:

Non_Sarcasm_text = ' '.join(
    data['tweet'][data['is_sarcastic'] == 0].tolist())
 
# word cloud of non sarcasm text
wordcloud = WordCloud(width=800, height=400,
                      background_color='black').generate(Non_Sarcasm_text)
 
plt.figure(figsize=(50, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Not Sarcasm Text')
plt.show()

Output:

Step 5:

Training and Testing Data

Code:

txt = data['tweet'].tolist()
lbl = data['is_sarcastic'].tolist()

Explanation:

We converted the data into the list to split the dataset into testing and training datasets.

Splitting the dataset into training and testing data

Code:

train_percent = .8
train_size = int(len(txt) * train_percent)

# Training dataset
train_data = txt[ : train_size]
train_label = lbl[ : train_size]
# Validation dataset
validation_size = train_size + int((len(txt) - train_size) / 2)
validation_data = txt[train_size : validation_size]
validation_label = lbl[train_size : validation_size]
# Testing dataset
test_data = txt[validation_size :]
test_label = lbl[validation_size :]

# Check
print('Training dataset :', len(train_data), len(train_label))
print('Validation dataset :', len(validation_data), len(validation_label))
print('Testing dataset :', len(test_data), len(test_label))

Output:

Training dataset : 21367 21367
Validation dataset : 2671 2671
Testing dataset : 2671 2671

Explanation:

We have split the data set into train, test, and validation data in the ratio of 80:10:10, which means 80% of the data is taken for training, 10% for validation, and the rest 10% for testing. It will calculate and print the size of the subset after extracting the text from the dataset.

Assigning parameters for training the dataset

Code:

vocab_size = 40000

# Embedding dimension value
embedding_dim = 300

# Maximum length of sentence
max_len = 80

# padding type 
padding_type = 'post'

oov_tokens = '<OOV>'
# Tokenizing and padding
tk = Tokenizer(num_words = vocab_size, oov_token = oov_tokens)
tk.fit_on_texts(train_data)

Explanation:

We have assigned a vocabulary size of 40000, an embedding dimension of 300, and the maximum length of sentences of 80. The padding type is set as post. The unknown word tokens are set as <OOV> (Out of vocabulary). It contains a list of unknown words. Then, we fit these parameters for tokenization and padding. The tokenizer maps the word to the index from the training dataset using the vocabulary size defined and the OOV token. The tokenizer will convert the text into sequence.

Let's understand these parameters in detail:

Tokenization: The process of breaking down the text into different tokens. It can be sentences, words, or characters.
Padding: It conserves the dimensions by adding parameters like tokens and values to the sequences of different lengths.
Padding sequence: The pad_sequence function provided by tensorflow is used to check the same length of the tokens.

Making the index of the words from the tokenizer

Code:

word_ind = tokenizer.word_index
word_ind

Output:

{'': 1,
 'to': 2,
 'of': 3,
 'the': 4,
 'in': 5,
 'for': 6,
 'a': 7,
 'on': 8,
 'and': 9,
 'with': 10,
 'is': 11,
 'new': 12,
 'trump': 13,
 'man': 14,
 'from': 15,
 'at': 16,
 'about': 17,
 'you': 18,
 'by': 19,
 'this': 20,
 'after': 21,
 'up': 22,
 'out': 23,
 'be': 24,
 'how': 25,
 'that': 26,
 'it': 27,
 'as': 28,
 'not': 29,
 'are': 30,
 'your': 31,
 'what': 32,
 'his': 33,
 'all': 34,
 'he': 35,
 'who': 36,
 'just': 37,
 'has': 38,
 'will': 39,
 'more': 40,
 'into': 41,
 'one': 42,
 'year': 43,
 'report': 44,
 'have': 45,
 'over': 46,
 'area': 47,
 'why': 48,
 'donald': 49,
 'u': 50,
 'day': 51,
 'can': 52,
 'says': 53,
 's': 54,
 'first': 55,...}

Explanation:

We made the word index to give the indexes of the words existing in the dataset.

Converting the training dataset into the sequence

Code:

Output:

[[320, 13336, 681, 3589, 2357, 46, 381, 2358, 13337, 6, 2750, 9270],
 [4, 7191, 2989, 2990, 22, 2, 154, 9271, 388, 2751, 6, 265, 9, 965],
 [156, 924, 2, 865, 1530, 2097, 599, 5049, 220, 135, 39, 45, 2, 9272],
 [1352, 37, 218, 382, 2, 1680, 29, 294, 22, 10, 2359, 1416, 5903, 1004],
 [715, 682, 5904, 1005, 9273, 662, 583, 5, 4, 95, 1292, 90],
 [9274, 4, 383, 71],
 [4, 7192, 372, 6, 470, 3590, 1979, 1467]]

Padding the training data set by sequencing it to a fixed length

padded_train_data = pad_sequences(train_ind, padding=padding_type, maxlen=max_len)
print(padded_train_data)

Output:

[[  320 13336   681 ...     0     0     0]
 [    4  7191  2989 ...     0     0     0]
 [  156   924     2 ...     0     0     0]
 ...
 [ 1020  3614     5 ...     0     0     0]
 [ 3702 12639    12 ...     0     0     0]
 [ 1247  1017  1087 ...     0     0     0]

Tokenizing the validation and test data into a sequence of index

Code:

validation_ind = tokenizer.texts_to_sequences(validation_data)
padded_validation_data = pad_sequences(validation_ind,
                                padding=padding_type,
                                maxlen=max_len)

test_ind = tokenizer.texts_to_sequences(test_data)
padded_test_data = pad_sequences(test_ind,
                            padding=padding_type,
                            maxlen=max_len)

print('Training vector :', padded_train_data.shape)
print('Validation vector :', padded_validation_data.shape)
print('Testing vector :', padded_test_data.shape)

Output:

Training vector : (21367, 80)
Validation vector : (2671, 80)
Testing vector : (2671, 80)

Explanation:

In this, we have converted the testing and validating data into the index sequences and then padded them to make the shape and size equal.

Checking the padded train data for any random index

Output:

['brian boitano sobs quietly in dark                                                                          ']

Explanation:

We have decoded the training vector at 1200 index. It will first convert the sequence of indices into the text with the help of reverse mapping. We have fixed the maximum length as 80, which will match the length.

Step 6:

Building the Model

Using different layers of the neural networks, we will build our model. We are building a sequential model with dense, dropout, and embedding layers.

Sequential Model: The model in which layers have exactly one input and output.
Embedding layer: This layer forms the word embeddings for the input. Then, it converts the word index to the vector to understand the text and its meanings.
Pooling layer: Pooling extracts key features from the embedded words. We used the global max pooling, which selects the highest value from each feature map.
Dense layer: This layer processes the pooled features. This is used to form fully connected layers.
Dropout layer: This layer neglects or removes some neurons from the model.
Output layer: This layer gives the output using the activation function.

Code:

import tensorflow as tf

# Sequential model
model = tf.keras.Sequential([
    # Adding Embedding layer 
    tf.keras.layers.Embedding(
        vocab_size, embedding_dim, input_length=max_len),

    # Adding GlobalMaxPooling layer 
    tf.keras.layers.GlobalMaxPool1D(),

    # Adding a Dense layer with 30 neurons and ReLU activation
    tf.keras.layers.Dense(30, activation='relu'),

    # Adding Dropout layer 
    tf.keras.layers.Dropout(0.5),

    # Adding a Dense layer with 40 neurons and ReLU activation
    tf.keras.layers.Dense(40, activation='relu'),

    # Adding Dropout layer
    tf.keras.layers.Dropout(0.5),

    # Adding a Dense layer with 20 neurons and ReLU activation
    tf.keras.layers.Dense(20, activation='relu'),

    # Adding Dropout layer
    tf.keras.layers.Dropout(0.2),

    # Last Dense layer with 1 neuron and sigmoid activation 
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Output:

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_8 (Embedding)     (None, 80, 300)           12000000  
                                                                 
 global_max_pooling1d_5 (Gl  (None, 300)               0         
 obalMaxPooling1D)                                               
                                                                 
 dense_23 (Dense)            (None, 30)                12040     
                                                                 
 dropout_16 (Dropout)        (None, 30)                0         
                                                                 
 dense_24 (Dense)            (None, 40)                820       
                                                                 
 dropout_17 (Dropout)        (None, 40)                0         
                                                                 
 dense_25 (Dense)            (None, 20)                210       
                                                                 
 dropout_18 (Dropout)        (None, 20)                0         
                                                                 
 dense_26 (Dense)            (None, 1)                 11        
                                                                 
=================================================================
Total params: 12013081 (45.83 MB)
Trainable params: 12013081 (45.83 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Explanation:

The summary( ) function will summarize and give an overview of the layers used to build the model.

Compiling the model

Code:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Explanation:

We have compiled our model using the Adam optimizer, binary cross entropy loss function, and the accuracy matrix.

Step 7:

Deploying and Evaluating the Model

Code:

m = model.fit(
    padded_train_data, train_label,
    epochs=num_of_epochs,
    validation_data=(padded_validation_data, validation_label)
)

Output:

Epoch 1/10
668/668 [==============================] - 287s 430ms/step - loss: 0.0106 - accuracy: 0.9978 - val_loss: 1.1091 - val_accuracy: 0.8540
Epoch 2/10
668/668 [==============================] - 277s 414ms/step - loss: 0.0103 - accuracy: 0.9977 - val_loss: 1.0149 - val_accuracy: 0.8502
Epoch 3/10
668/668 [==============================] - 254s 380ms/step - loss: 0.0063 - accuracy: 0.9984 - val_loss: 1.4693 - val_accuracy: 0.8495
Epoch 4/10
668/668 [==============================] - 236s 354ms/step - loss: 0.0049 - accuracy: 0.9989 - val_loss: 1.5654 - val_accuracy: 0.8510
Epoch 5/10
668/668 [==============================] - 270s 404ms/step - loss: 0.0045 - accuracy: 0.9990 - val_loss: 1.2844 - val_accuracy: 0.8499
Epoch 6/10
668/668 [==============================] - 243s 364ms/step - loss: 0.0055 - accuracy: 0.9985 - val_loss: 1.9587 - val_accuracy: 0.8476
Epoch 7/10
668/668 [==============================] - 259s 387ms/step - loss: 0.0081 - accuracy: 0.9978 - val_loss: 1.9838 - val_accuracy: 0.8510
Epoch 8/10
668/668 [==============================] - 233s 349ms/step - loss: 0.0050 - accuracy: 0.9987 - val_loss: 1.7891 - val_accuracy: 0.8472
Epoch 9/10
668/668 [==============================] - 235s 352ms/step - loss: 0.0036 - accuracy: 0.9993 - val_loss: 2.2813 - val_accuracy: 0.9502
Epoch 10/10
668/668 [==============================] - 242s 362ms/step - loss: 0.0045 - accuracy: 0.9987 - val_loss: 0.2687 - val_accuracy: 0.9854

Explanation:

We trained our model by setting several epochs (here, 10) using the fit( ) method and then evaluating its accuracy. We found that the validation accuracy is 98%.

Visualizing the accuracy of the model

Code:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# validation loss
ax1.plot(m.history['loss'], label='Training Loss')
ax1.plot(m.history['val_loss'], label='Validation Loss',color='blue')
ax1.set_title('Validation Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()

# validation accuracy
ax2.plot(m.history['accuracy'], label='Training Accuracy')
ax2.plot(m.history['val_accuracy'], label='Validation Accuracy', color='red')
ax2.set_title('Validation Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()

plt.tight_layout()
plt.show()

Output:

Explanation:

We have made two plots of validation loss vs training loss and validation accuracy vs training accuracy.

Evaluating the Model

Code:

loss, acc = model.evaluate(padded_test_data, testing_label)
print(f'The Accuracy on the test dataset :{round(acc * 100, 2)} %')

Output:

84/84 [==============================] - 0s 670us/step - loss: 0.2684 - accuracy: 0.9739
The Accuracy on the test dataset : 97.39%

Explanation:

We calculated the accuracy of the model and found 97.39% accuracy.

Step 8:

Predictions

Code:

pred = model.predict(padded_test_data)
pred_label = [1 if p >= 0.5 else 0 for p in pred]
pred_label[:8]

Output:

84/84 [==============================] - 1s 5ms/step
[1, 0, 0, 1, 0, 0, 1, 1]

Explanation:

We have predicted our test data and printed the labels.

Making Confusion Matrix

Code:

confusion_matrix = confusion_matrix(test_label, pred_label)
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='gist_yarg',
            xticklabels=['Not Sarcastic', 'Sarcastic'],
            yticklabels=['Not Sarcastic', 'Sarcastic'])
plt.title('Confusion Matrix')
plt.show()

Output:

Creating Classification Report

Code:

print("\nClassification Report:")
print(classification_report(test_label, pred_label,
                            target_names=['Not Sarcastic', 'Sarcastic']))

Output:

Classification Report:
               precision    recall  f1-score   support

Not Sarcastic       0.84      0.89      0.87      1536
    Sarcastic       0.84      0.77      0.81      1135

     accuracy                           0.84      2671
    macro avg       0.84      0.83      0.84      2671
 weighted avg       0.84      0.84      0.84      2671

Predicting the sarcasm for different statements

Code:

while True:
    user_inp = input(
        "Enter a headline for predicting Sarcasm (type 'no' to quit): ")
    
    if user_inp.lower() == 'no':
        break
    final_input = clean(user_inp)
    tokenized_inp = tokenizer.texts_to_sequences(
        [final_input]) 
    
    padded_inp = pad_sequences(
        tokenized_inp, maxlen=max_len, padding=padding_type) 

    # Predict sarcasm
    pred = model.predict(padded_inp)

    # Print the prediction result
    if pred >= 0.5:
        print(f"Headline: {user_inp}\nPrediction: Text is Sarcastic")
    else:
        print(f"Headline: {user_inp}\nPrediction: Text is Not Sarcastic")

Output:

Enter a headline for prediction (type 'no' to quit): hello
1/1 [==============================] - 0s 58ms/step
Headline: hello
Prediction: Text is Not Sarcastic
Enter a headline for prediction (type 'no' to quit): you are a good person
1/1 [==============================] - 0s 46ms/step
Headline: you are a good person?
Prediction: Text is Sarcastic
Enter a headline for prediction (type 'no' to quit): are you doing the work?
1/1 [==============================] - 0s 33ms/step
Headline: are you doing the work?
Prediction: Text is Not Sarcastic
Enter a headline for prediction (type 'no' to quit): no.

Finally, we detected the sarcasm in the text. By giving any input text, we can now predict whether it has sarcasm or not.

Next TopicSARSA Reinforcement Learning

← prev next →