Python Code for Naive Bayes Algorithm

Assume you're a product manager, and you wish to divide client evaluations into categories of good and negative feedback. Or Which loan applicants are safe or dangerous, as a loan manager, do you wish to identify? You want to forecast which people would get diabetic illness as a healthcare analyst. The issue of categorizing reviews, loan applications, and patients is present in all situations.

The simplest and fastest classification technique, Naive Bayes, is appropriate for handling vast amounts of data. Numerous applications, including spam filtering, text classification, sentiment analysis, and recommender systems, successfully utilize the naive Bayes classifier. For class prediction, it applies the Bayes theorem of probability.

Workflow for Classification

When doing classification, the initial stage is to comprehend the issue, find potential features, and assign a label to them. Features are those traits or qualities that have an impact on the label's outcomes. For instance, bank management will learn the customer's employment, income, age, location, past loan history, transaction history, and credit score in the case of a loan distribution. These traits are referred to as features that assist the model in categorizing clients.

A learning phase and an evaluation phase make up the categorization process. The classifier uses a provided dataset to train its model during the learning phase, and performance is assessed during the evaluation phase. Performance is assessed based on a number of factors, including accuracy, error, precision, and recall.

Python Code for Naive Bayes Algorithm

Naive Bayes Classifier: What is it?

A statistical classification method based on the Bayes Theorem is called naive Bayes. One of the easiest supervised learning methods is this one. The quick, accurate, and dependable approach is the naive Bayes classifier. On big datasets, naive Bayes classifiers perform quickly and accurately.

Naive The Bayes classifier assumes that an individual feature's impact on a class is unrelated to the effects of other characteristics. For instance, a loan applicant's suitability depends on factors including their income, history of loans and transactions, age, and geography. These traits are nonetheless taken into account separately, even though they are interconnected. This assumption is regarded as naïve since it makes calculation easier. The term "class conditional independence" refers to this presumption.

Python Code for Naive Bayes Algorithm
  • P(h): the likelihood that, given all the available information, hypothesis h is correct. The prior probability of h refers to this.
  • P(D): the likelihood that the data will occur, independent of the hypothesis. The prior probability is referred to as this.
  • P(h|D): the likelihood that hypothesis h will occur given the data D. The term "posterior probability" refers to this.
  • P(D|h): the likelihood that data d would exist if hypothesis h were true. The term "posterior probability" refers to this.

How Does the Naive Bayes Classifier Operate?

Let's use an example to understand better how Naive Bayes works. Give the example of playing sports and the weather. It would help if you determined the likelihood of participating in sports. It would help if you now categorized whether or not participants will play depending on the weather.

First Method (In the case of a single feature): The Naive Bayes classifier does the following calculations to determine the likelihood of an event:

Determine the likelihood probability for each characteristic for each class in Step 2. Determine the prior probability for the provided class labels in Step 3. Enter these values into the Bayes formula and compute the posterior probability.

Step 4: Determine which class, provided that the input belongs to the higher probability class, has a greater likelihood.

The two tables, frequency and likelihood tables, can be used to streamline the calculation of prior and posterior probabilities. The prior and posterior probabilities may be calculated using any of these tables. The frequency of labels for each characteristic is listed in the Frequency table. Two probability tables are shown. Prior probabilities for labels are shown in Likelihood Table 1, whereas posterior probabilities are shown in Likelihood Table 2.

Python Code for Naive Bayes Algorithm

You want to determine the likelihood of playing while it is cloudy outside.

Likelihood of participating:

P(Yes | Overcast) equals P(Yes | Overcast). P (Yes) / P (Cloudy) .....................(1)

Identify Prior Probabilities:

P(Cloudy) = 4/14 = 0.29

P(Yes)= 9/14 = 0.64

First, determine the posterior probabilities.

P(Cloudy | Yes) = 4/9 = 0.44

Fill out equation (1) using the prior and posterior probability. P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98 (Higher).

Similarly, you may determine the likelihood of not participating:

Chance of not participating: P(No | Overcast) = P(No | Overcast). P(No) / P (Overcast)..........(2)

Identify Prior Probabilities:

P(Cloudy) = 4/14 = 0.29 P(No) = 5/14 = 0.36

First, determine the posterior probabilities.

P(Cloudy | No) = 0/9 = 0.

1. Include the probability of the past and future in equation (2).

P (No | Overcast) equals 0*0.36/0.29, or zero.

There is a larger likelihood of a "Yes" class. Therefore, you can assess whether or not participants will participate in the sport if it is cloudy outside.

Second Method (When There Are Multiple Features)

Python Code for Naive Bayes Algorithm

You want to determine the likelihood that you will play while it is cloudy and warm outside.

Likelihood of participating:

PP(Weather=Overcast, Temp=Mild | Play= Yes) = (Play= Yes | Weather= Overcast, Temp=Mild)P(Play=Yes) (1) P(Play=Yes, Weather=Overcast, Temp=Mild)=P(Overcast |Yes) P(Mild |Yes)..........(2)

  1. Identify Prior Probabilities: P(Yes)= 9/14 = 0.64
  2. Determine Prior Probabilities: P(Cloudy | Yes) = 4/9 = 0.44 P(Mild |Yes) = 4/9 equals 0.44.
  3. Input the posterior probabilities into equation (2) as follows: P(Weather=Overcast, Temp=Mild | Play= Yes) = 0.44 * 0.44 = 0.1936(Higher).
  4. Fill out equation (1) using the prior and posterior probabilities: P(Play=Yes | Weather=Overcast, Temp=Mild) = 0.1936*0.64 = 0.124

Similarly, you may determine the likelihood of not participating:

Chance of not participating:

PP(Weather=Overcast, Temp=Mild | Play= No) = (Play= No | Weather= Overcast, Temp=Mild)P(Play=No)... (3)

Weather: Overcast, Mild, Play: No, P(Weather: Overcast | Play: No) P(Temp=Low | Play=None) ………..(4)

  1. Identify Prior Probabilities: P(No)= 5/14 = 0.36
  2. Determine the posterior probabilities. For example, P(Weather=Overcast | Play=No) = 0/9 = 0 and P(Temp=Mild | Play=No) = 2/5 = 0.4
  3. Input posterior probabilities into the following equation: P(Weather=Overcast, Temp=Mild | Play=No) = 0 * 0.4 = 0.
  4. Fill out equation (3) using prior and posterior probability. P(Play= No | Weather= Cloudy, Temp: Mild) = 0*0.36 = 0.

There is a larger likelihood of a "Yes" class. Therefore, it may be concluded that athletes will participate in their sport when the sky is gloomy.

Naive Bayes Classifier

Building in Scikit-Learn with Synthetic Dataset

Using scikit-learn to create artificial data, we will train and assess the Gaussian Naive Bayes algorithm in the first example.

Establishing the Dataset

To create the dataset and test several machine learning algorithms, Scikit-learn offers us a machine learning environment.

The ' make_classification' function is being used in this instance to create a dataset with six features, three classes, and 800 samples.

Explanation

The make_classification function from the sklearn.datasets module is imported in this code.

  • For classification jobs, the make_classification function creates a random dataset.
  • The function accepts the following arguments: The dataset's number of features, also known as independent variables, is known as n_features.
  • There are six aspects to this situation.
  • n_classes: how many classes (or target variables) there are in the dataset.
  • There are three classes in this instance.
  • n_samples: The dataset's total number of samples (or observations).
  • There are 800 samples in this situation.
  • The dataset's informative features count, n_informative.
  • These characteristics have a real impact on the target variable.
  • There are two instructive features in this instance.
  • random_state: the random number generator's seed value.
  • This guarantees the dataset's reproducibility.
  • n_clusters_per_class: how many clusters there are for each class.
  • This controls how far apart the courses are from one another.
  • There is just one cluster per class in this instance.
  • Two arrays are returned by the function X: A shape (n_samples, n_features) array holding the dataset's features.
  • y: a shape (n_samples) array holding the dataset's target variable.

To display the dataset, we will make use of the scatter function in matplotlib. pyplot.

Explanation:

The scatter() method is used to generate a scatter plot after importing the matplotlib.pyplot module.

  • It is assumed that the X and Y variables are arrays or data frames that have already been established.
  • The scatter() method accepts the following three arguments: The first and second columns of the X array are represented by the variables X[:, 0] and X[:, 1], respectively, and c=y gives a color to each point depending on its corresponding value in the y array.
  • The marker parameter, in this example, an asterisk, determines the type of marker to be used for each point.
  • The values in the first column of X will be plotted on the x-axis, the values in the second column of X will be shown on the y-axis, and each point will be colored according to its corresponding value in y.

There are three different target label types, as we can see, and we will be developing a multiclass classification model.

Split Test Train

We must divide the dataset into training and testing for model assessment before we begin the training procedure.

Explanation

  • From the sklearn.model_selection module, this code imports the train_test_split function.
  • The training and testing sets are created using this function, which divides the dataset.
  • Four arguments, X, y, test_size, and random_state, are required by the train_test_split function.
  • The target variable is denoted by y, and X represents the input features.
  • The percentage of the dataset that ought to be allotted to the testing set is called test_size.
  • In this instance, it is set to 0.33, which denotes that testing will use 33% of the data.
  • By using random_state to configure the random number generator's seed, you can make sure that the same random split is produced each time the program is executed.
  • The X_train, X_test, Y_train, and Y_test variables are the four that the function returns.
  • X_train and y_train are the training set, while X_test and y_test are the testing set.
  • An artificial intelligence model can be tested and trained using these variables.

Model Construction and Training

Construct a general Gaussian Naive Bayes model, then train it using training data. The model is then fed a random test sample to get a projected value.

Explanation:

The scikit-learn package is used in this code to create a Gaussian Naive Bayes classifier.

  • The code first imports the GaussianNB class from the sklearn.naive_bayes module.
  • After that, the variable "model" is given a fresh instance of the GaussianNB class.
  • After that, the model is trained utilizing the fit() technique, which requires the training data (X_train) and associated target values (y_train).
  • The model is used to forecast the results for a single test data point, represented by the seventh element in the X_test array, once it has been trained.
  • The 'predicted' variable contains the predicted value.
  • At last, y_test[6] prints the test data point's actual value, and predicted[0] prints the anticipated value.

Output:

Actual Value: 0
Predicted Value: 0

Model Assessment

We won't train the model on a test dataset that has yet to be seen. First, we will make predictions for the test dataset's values and utilize those predictions to compute accuracy and the F1 score.

Output

Accuracy: 0.8484848484848485
F1 Score: 0.8491119695890328

Explanation:

  • This code imports the accuracy_score, confusion_matrix, ConfusionMatrixDisplay, and f1_score methods from the sklearn.metrics package.
  • These operations are used to assess how well a machine-learning model is working.
  • The code then generates predictions for the test data (X_test) using the model. Predict function.
  • The accuracy_score and f1_score functions are used to compare these predictions to the actual labels (y_test).
  • The code prints out the accuracy and F1 score of the model's predictions after computing the accuracy of the model's predictions using the accuracy_score function and the F1 score, which is a weighted average of precision and recall, using the f1_score function.
  • It displays the F1 score and accuracy of a model that was developed using given data.
  • The accuracy is 0.84848484848485, which indicates that in 84.8% of the situations, the model accurately anticipated the outcome.

The F1 score, which measures the model's accuracy by accounting for both precision and recall, is 0.8491119695890328. A higher F1 value denotes a model that performs better.

When calculating the true positives and true negatives for the confusion matrix, we will use the 'confusion_matrix' function. When displaying the confusion matrix with labels, we will use the 'ConfusionMatrixDisplay' function.

Explanation:

ConfusionMatrixDisplay is used in this code to generate a confusion matrix using the scikit-learn module.

  • A list of labels with the values 0, 1, and 2 is first constructed.
  • Following that, the confusion_matrix function is invoked with the labels list, the test labels (y_test), and the predicted labels (y_pred) as inputs.
  • As a result, a confusion matrix with the designated labels is produced.
  • After that, the confusion matrix and labels list are used as inputs to generate a ConfusionMatrixDisplay object.
  • The plot method is then used on the display object to display the confusion matrix graphically.

We may enhance model performance by scaling, preprocessing cross-validations, and hyperparameter tweaking. Our model has performed fairly well.

Loan Dataset with Naive Bayes Classifier

Let's use the actual dataset to train the Naive Bayes Classifier. With the exception of data exploration and preparation, we will repeat the majority of the operations.

Loading of data

Using the pandas 'read_csv' method, we will load Loan Data from the DataCamp Workspace in this example.

Explanation:

  • The "pd" alias is used in this code to import the panda's library.
  • Next, a CSV file named "loan_data.csv" is read using the pandas read_csv() function, and the resultant DataFrame is assigned to the variable df.
  • Finally, it uses the head() function to show the DataFrame's top five rows.
  • Python was used to create this code.

Exploration of Data

'.info()' will be used to learn more about the dataset.

  • There are 9578 rows and 14 columns in the dataset.
  • Columns are either floats or integers, except from "purpose."
  • "Not.fully.paid" is the column we are aiming for.

Explanation

This Python program invokes the info() function on a Pandas DataFrame object with the identifier df.

For a rapid grasp of the structure and contents of a DataFrame, the info() function delivers a summary of the DataFrame, including the number of rows and columns, the data types of each column, and the number of non-null values in each column.

Explanation:

The summary of a dataset with 9578 entries and 14 columns is displayed in this bit of code.

  • The first line displays the index's range, which is 0 to 9577.
  • The total number of non-null values and the data type for each column are displayed on the second line.
  • The column titles are displayed on the right side, and numbers identify the columns.
  • Each column's data type, which can be float64, int64, or object, is displayed in the Dtype column.
  • The dataset's memory use, which is 1.0+ MB, is also displayed.
  • The dataset and its structure are described in this summary, which might be helpful for data exploration and analysis.

In this case, we'll be creating a model to identify the clients who still need to repay the loan in full. Let's examine the purpose and target column using the countplot from Seaborn.

Explanation:

The purpose column in the df dataframe is counted using the Python tools seaborn and matplotlib, and the not.fully.paid column is utilized to separate the bars by color.

  • The figure is made using the sns.countplot() function from seaborn, with the data argument set to df to select the dataframe to use.
  • The hue parameter is set to 'not.fully.paid' to distinguish the bars by color depending on the values in that column, and the x parameter is set to 'purpose' to define the column to plot.
  • The x-axis labels are rotated by 45 degrees and aligned to the right for easier reading using the matplotlib plt.xticks() method.

The imbalance in our dataset will impact the model's performance. To gain practical experience working with unbalanced datasets, see the tutorial Resample an unbalanced Dataset.

Processing of Data

We will now use the pandas 'get_dummies' method to convert the categorical 'purpose' column to an integer.

Explanation:

This code generates fake variables for the DataFrame 'df''s 'purpose' column using the Python pandas package.

  • Each distinct value in the provided column(s) results in the creation of a new DataFrame with binary columns by the pd.get_dummies() method.
  • In this instance, it makes new columns for each distinct value in the 'purpose' column.
  • The drop_first=True option removes the first column from regression models to prevent multicollinearity.
  • The new DataFrame is created, its first few rows are shown using the.head() function, and it is then assigned to the variable pre_df.

The dataset will then be divided into training and testing sets once we specify the feature (X) and target (Y) variables.

Explanation:

The train_test_split function from the sklearn.model_selection module is imported in this code.

  • Following that, the code constructs the variables X and y.
  • Y is given the values in the not.fully.paid column of the pre_df dataframe, whereas X is given the values in all other columns in the pre_df dataframe, excluding the not.fully.paid column.
  • This is followed by calling the train_test_split function with the parameters X, y, test_size=0.33, and random_state=125.
  • The data are divided into training and testing sets using this function.
  • The random_state argument defines the seed for the random number generator used in the splitting procedure, while the test_size parameter indicates the percentage of the data that should be utilized for testing.
  • X_train, X_test, Y_train, and Y_test are the four variables that the function returns.
  • The training and testing sets for the X and y data are contained in these variables.
  • Training sets are used to develop a machine learning model, while testing sets are used to assess the model's effectiveness.

Model Construction and Training

Building and training models are quite easy. On a training dataset, we will train a model with the default hyperparameters.

Explanation:

The from sklearn.naive_bayes import GaussianNB line is used in this code to import the Gaussian Naive Bayes algorithm from the Scikit-Learn library.

  • The model = GaussianNB() expression is then used to generate a fresh instance of the GaussianNB model.
  • X_train and y_train, the training data, are sent as parameters when the fit() function is lastly performed on the model object.
  • This uses the provided data to train the model.
  • Overall, using the training data supplied, this code creates and trains a Gaussian Naive Bayes model.

Model Assessment

We will measure model performance using accuracy and f1 score, and the Gaussian Naive Bayes method has done fairly well.

Explanation:

This code imports the accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score, and classification_report methods from the sklearn.metrics module.

  • These operations are used to assess how well a machine-learning model is working.
  • The code then uses the predict function of a model object to provide predictions for a collection of test data called X_test. The y_pred variable contains these forecasts.
  • Next, the code determines the model's predictions' accuracy and F1 score using the accuracy_score and f1_score functions, respectively.
  • The accuracy_score function calculates the percentage of accurate predictions by comparing the genuine labels in y_test to the predicted labels in y_pred.

The code then uses the print function to report the accuracy and F1 score to the console. The f1_score function computes the F1 score, which is a weighted average of precision and recall.

Explanation:

This snippet of code is the result of some code being run rather than the actual code itself.

  • It displays the F1 score and accuracy of a model that was developed using given data.
  • A greater accuracy and F1 score show that the model is working well on the data; the accuracy is a measure of how frequently the model correctly predicted the class of the data, while the F1 score is a measure of the model's precision and recall.

We can observe that the confusion matrix offers a different narrative since the data are unbalanced. More mislabeled data has been found on a minority target: "not fully paid."

Explanation:

This piece of code may be used to plot the confusion matrix for a binary classification issue.

  • First, the terms "Fully Paid" and "Not fully Paid" are used to establish a list of labels.
  • Next, using the confusion_matrix() function, which accepts the true labels (y_test) and predicted labels (y_pred) as inputs, the confusion matrix is computed.
  • The confusion matrix and the list of labels are then inputted into the creation of a ConfusionMatrixDisplay object.
  • The confusion matrix plot is then shown by using the plot() function on the ConfusionMatrixDisplay object.

Check out the Naive Bayes Classification Tutorial with Scikit-learn Workspace if you need help with training or model assessment. It includes a dataset, the source code, and the results.

Problem with Zero Probability

In the event that there isn't a tuple for a hazardous loan in the dataset, the posterior probability will be 0, and the model won't be able to forecast anything. This issue is referred to as Zero Probability since there are no instances of that specific class.

The Laplacian correction or Laplace Transformation can be used to resolve such a problem. One method for achieving smoothness is the Laplacian adjustment. Here, you may assume that the dataset is big enough that the estimated probability will stay the same if you add one row of each type. This will fix the problem of probability values being zero.

For instance, Assume that the database contains 1000 training tuples for the class loan risk. There are 0 tuples for low income, 990 for medium income, and 10 for high income in this database's income field. Without the Laplacian adjustment, the odds of these occurrences are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000).

Apply the Laplacian adjustment on the provided dataset now. Per each income-value pair, let's add one more tuple. The likelihood of these occurrences:

Advantages

  • Making predictions using this strategy is not only easy but also quick and precise.
  • In comparison to continuous variables, naive Bayes performs better with discrete response variables, may be used for issues involving numerous classes of data, and has a very cheap computational cost.
  • It also works well when there are issues with text analytics.
  • A Naive Bayes classifier performs better than other models like logistic regression when the assumption of independence is true.

Disadvantages

  • The presumption that distinct traits exist. In reality, it is quite unlikely that a model will obtain a collection of predictors that are 100 percent independent.
  • Zero posterior probability results when a training tuple for a certain class does not exist. The model is unable to predict in this situation. The Zero Probability/Frequency Problem is the name given to this issue.

Conclusion

You have completed this tutorial, so congratulations!

You learned about the Naive Bayes method, how it functions, concerns with the assumption, how it is implemented, and its benefits and drawbacks in this session. You have also gained knowledge of model development and assessment in scikit-learn for binary and multinomial classes along the way.

The most basic and effective method is naive Bayes. Even if machine learning has made great strides in the previous several years, it has nonetheless established itself as useful. It has been implemented with success in a variety of applications, including text analytics and recommendation engines.