NLP Analysis of Restaurant Reviews

The field of natural Language Processing (NLP) is a branch of computer science, as well as artificial intelligence, that focuses on the interaction between computer systems and human (natural) languages and, specifically, the way computers, are programmed to process and analyse vast quantities of data from natural languages It is a branch of machine learning that involves analysing any text and implementing a predictive analysis. Scikit-learn is an open-source machine learning library compatible with using the Python programming language. Scikit-learn is mainly written in Python and has some basic methods written using Cython to provide performance. Cython is a subset of Python programming language designed to emulate C's performance by using the majority of code written in Python.

Let's explore the various processes involved in text processing and the process of NLP.

This algorithm could analyse any text, such as classifying books into Romance and Friction. In the meantime, let's make use of existing reviews of restaurant datasets to analyse positive and negative feedback.

Steps Involved:

Following is the stepwise explanation of how to analyse the restaurant reviews using NLP:

Step 1: Import data with delimiter setting by 't' in columns. They are separated by tab space. Reviews and the category (0 and 1) do not have any symbol to separate them. However, they are separated using tab space, as the majority of all other symbols in the review (like $ for value, ...! etc.) The algorithm could utilize them as a delimiter, resulting in weird behaviour (like odd output, errors) in the output.

Code:

# First, we will import required Libraries
import numpy as nmp
import pandas as pnd

# Here, we will import dataset
dataset1 = pnd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

Step 2: Text Cleaning or Pre-processing

Eliminate Punctuations and Numbers: Punctuations and Numbers do not help with processing the text in question. In the event that they are added, they only add to the size of a word list that we create as the final step. They will also reduce the effectiveness of the algorithm.
Stemming: to take the roots of the word
Convert every word into a lowercase: such as it's useless to have words that are in a different case (e.g., "good" and "GOOD').

Code:

# here, we will import library for cleaning data
import re as RE

# now, we will import Natural Language Tool Kit
import nltk as NLTK

NLTK.download('stopwords')

# now, we will use corpus for removing stopword
from nltk.corpus import stopwords as SW

# here, we will stem.porter for Stemming propose
from nltk.stem.porter import PorterStemmer as PS

# Now, we will initialize empty array for appending clean text
corpus = []

# Now, we will set 1000 (reviews) rows to cleaning
for k in range(0, 1000):
	
	# column: "Review", row kth
	review1 = RE.sub('[^a-zA-Z]', ' ', dataset1['Review'][k])
	
	# Now, we will convert all cases to lower cases
	review1 = review1.lower()
	
	# Now, we will split to array(default delimiter is " ")
	review1 = review1.split()
	
	# Now, we will create PorterStemmer object to take main 
# stem of each word
	ps1 = PS()
	
	# Now, we will create a loop for stemming each word 
# in string array at kth row
	review1 = [ps1.stem(word) for word in review1
				if not word in set(SW.words('english'))]
				
	# now, we will rejoin all string array elements for creating back 
# into a string
	review1 = ' '.join(review1)
	
	# here, we will append each string for creating array of clean text
	corpus.append(review1)

Step 3: Tokenization This involves splitting words and sentences from the text's body.

Step 4: The bag of words using a sparse matrix

Consider all the review words from the database without repeating words.
A column per word. Hence, there will be several columns.
Rows are reviews
If a word appears within the column of a database review, then the amount of that word will appear on the line of the bag of words located under the column for the word.

For this purpose we need CountVectorizer class from sklearn.feature_extraction.text. It is also possible to set a number of features that are allowed (max none. features that aid the most) by setting attributes "max_features"). Do the training on the corpus and then apply the same transformation to the corpus ".fit_transform(corpus)" and then convert it into an array. If the review is negative or positive, that is indicated within column 2 of the data [ 1All rows and the 1st column (indexing from zero).

Code:

# Here, we will start creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer as CV

# For extracting max 1500 feature. "max_features" is attribute toexperiment 
# with to get better results
cv1 = CV(max_features = 1500)

# here, P contains corpus (dependent variable)
P = cv1.fit_transform(corpus).toarray()

# q contains answers if review is positive or negative
q = dataset1.iloc[:, 1].values

The description for the data to be employed:

Columns are separated by t (tab space)
The first column is about reviews of individuals.
0 stands for negative review in the second column, while 1 is for a positive review.

Step 5: Separating Corpus in the Test and Training set. For this, we need class train_test_split from sklearn.cross_validation. Split can be constructed 70/30, 70/30, 80/20 or 85/15, or 75/25. In this case, I've chosen 75/25 through "test_size." The words bag and y can be 0 and 1. (positive or negative).

Code:

# Now, we will Split the dataset into the Training set and Test set
# using cross validation
from sklearn.model_selection import train_test_split as tts

# Now, we will experiment with "test_size" for getting better results
P_train, P_test, q_train, q_test = tts(P, q, test_size = 0.25)

Step 6: Use a predictive model (here, we will use RandomForest Classifier)

As RandomForest Classifier is an ensemble model (that means it is made of many trees) from sklearn.ensemble library we will import RandomForestClassifier module.
With 501 trees, or "n_estimators" and criterion as "entropy."
The model is fitted using the .fit() procedure using the attributes X_train and y_train

Code:

# Now we will user Random Forest Classification for the Training set
from sklearn.ensemble import RandomForestClassifier as RFC

# here, we can say n_estimators as number of trees, 
# experiment with n_estimators for getting better results
model1 = RFC(n_estimators = 501,
							criterion = 'entropy')
							
model1.fit(P_train, q_train)

Step 7: Predicting End Results using the .predict() technique using the attribute X_test.

Code:

# Predicting the Test set results
q_pred = model1.predict(P_test)

q_pred

Output:

NOTE: Accuracy with the random forest was 72 percent. (It could be different if we experimented using diverse test sizes; in this case, it is 0.25).

Step 8: For being aware of the exactness of the accuracy result, we do need to get the confusion matrix.

The Confusion Matrix is a 2X2 Matrix.

TRUE POSITIVE: It is the measure of the percentage of positives in the real world that have been appropriately recognized.

TRUE NEGATIVE: It determines the percentage of real positives not accurately determined.

FALSE POSITIVE: It is the measure of the proportion of actual negatives adequately recognized.

FALSE NEGATIVE: It determines the percentage of negatives that aren't correctly recognized in real life.

Code:

# Here, we will create the Confusion Matrix
from sklearn.metrics import confusion_matrix as CM

cm1 = CM(q_test, q_pred)

cm1

Output:

array([[105,  25],
       [ 41,  79]], dtype=int64)

Next TopicWhat are LSTM Networks

← prev next →

For Videos Join Our Youtube Channel: Join Now

Feedback

Send your Feedback to [email protected]

Help Others, Please Share

Learn Latest Tutorials

Splunk

SPSS

Swagger

Transact-SQL

Tumblr

ReactJS

Regex

Reinforcement Learning

R Programming

RxJS

React Native

Python Design Patterns

Python Pillow

Python Turtle

Keras

Preparation

Aptitude

Reasoning

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

AWS

Selenium

Cloud Computing

Hadoop

ReactJS

Data Science

Angular 7

Blockchain

Git

Machine Learning

DevOps

B.Tech / MCA

DBMS

Data Structures

DAA

Operating System

Computer Network

Compiler Design

Computer Organization

Discrete Mathematics

Ethical Hacking

Computer Graphics

Software Engineering

Web Technology

Cyber Security

Automata

C Programming

C++

Java

.Net

Python

Programs

Control System

Data Mining

Data Warehouse

^{Like/Subscribe us for latest updates or newsletter}

Machine Learning

Supervised Learning

Classification

Miscellaneous

Related Tutorials

Interview Questions

NLP Analysis of Restaurant Reviews

Steps Involved:

NOTE: Accuracy with the random forest was 72 percent. (It could be different if we experimented using diverse test sizes; in this case, it is 0.25).

Feedback

Help Others, Please Share

Learn Latest Tutorials

Preparation

Trending Technologies

B.Tech / MCA