Python Code for Naive Bayes AlgorithmAssume you're a product manager, and you wish to divide client evaluations into categories of good and negative feedback. Or Which loan applicants are safe or dangerous, as a loan manager, do you wish to identify? You want to forecast which people would get diabetic illness as a healthcare analyst. The issue of categorizing reviews, loan applications, and patients is present in all situations. The simplest and fastest classification technique, Naive Bayes, is appropriate for handling vast amounts of data. Numerous applications, including spam filtering, text classification, sentiment analysis, and recommender systems, successfully utilize the naive Bayes classifier. For class prediction, it applies the Bayes theorem of probability. Workflow for ClassificationWhen doing classification, the initial stage is to comprehend the issue, find potential features, and assign a label to them. Features are those traits or qualities that have an impact on the label's outcomes. For instance, bank management will learn the customer's employment, income, age, location, past loan history, transaction history, and credit score in the case of a loan distribution. These traits are referred to as features that assist the model in categorizing clients. A learning phase and an evaluation phase make up the categorization process. The classifier uses a provided dataset to train its model during the learning phase, and performance is assessed during the evaluation phase. Performance is assessed based on a number of factors, including accuracy, error, precision, and recall. Naive Bayes Classifier: What is it?A statistical classification method based on the Bayes Theorem is called naive Bayes. One of the easiest supervised learning methods is this one. The quick, accurate, and dependable approach is the naive Bayes classifier. On big datasets, naive Bayes classifiers perform quickly and accurately. Naive The Bayes classifier assumes that an individual feature's impact on a class is unrelated to the effects of other characteristics. For instance, a loan applicant's suitability depends on factors including their income, history of loans and transactions, age, and geography. These traits are nonetheless taken into account separately, even though they are interconnected. This assumption is regarded as naïve since it makes calculation easier. The term "class conditional independence" refers to this presumption.
How Does the Naive Bayes Classifier Operate?Let's use an example to understand better how Naive Bayes works. Give the example of playing sports and the weather. It would help if you determined the likelihood of participating in sports. It would help if you now categorized whether or not participants will play depending on the weather. First Method (In the case of a single feature): The Naive Bayes classifier does the following calculations to determine the likelihood of an event: Determine the likelihood probability for each characteristic for each class in Step 2. Determine the prior probability for the provided class labels in Step 3. Enter these values into the Bayes formula and compute the posterior probability. Step 4: Determine which class, provided that the input belongs to the higher probability class, has a greater likelihood. The two tables, frequency and likelihood tables, can be used to streamline the calculation of prior and posterior probabilities. The prior and posterior probabilities may be calculated using any of these tables. The frequency of labels for each characteristic is listed in the Frequency table. Two probability tables are shown. Prior probabilities for labels are shown in Likelihood Table 1, whereas posterior probabilities are shown in Likelihood Table 2. You want to determine the likelihood of playing while it is cloudy outside. Likelihood of participating:P(Yes | Overcast) equals P(Yes | Overcast). P (Yes) / P (Cloudy) .....................(1) Identify Prior Probabilities: P(Cloudy) = 4/14 = 0.29 P(Yes)= 9/14 = 0.64 First, determine the posterior probabilities. P(Cloudy | Yes) = 4/9 = 0.44 Fill out equation (1) using the prior and posterior probability. P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98 (Higher). Similarly, you may determine the likelihood of not participating:Chance of not participating: P(No | Overcast) = P(No | Overcast). P(No) / P (Overcast)..........(2) Identify Prior Probabilities: P(Cloudy) = 4/14 = 0.29 P(No) = 5/14 = 0.36 First, determine the posterior probabilities. P(Cloudy | No) = 0/9 = 0. 1. Include the probability of the past and future in equation (2). P (No | Overcast) equals 0*0.36/0.29, or zero. There is a larger likelihood of a "Yes" class. Therefore, you can assess whether or not participants will participate in the sport if it is cloudy outside. Second Method (When There Are Multiple Features)You want to determine the likelihood that you will play while it is cloudy and warm outside. Likelihood of participating:PP(Weather=Overcast, Temp=Mild | Play= Yes) = (Play= Yes | Weather= Overcast, Temp=Mild)P(Play=Yes) (1) P(Play=Yes, Weather=Overcast, Temp=Mild)=P(Overcast |Yes) P(Mild |Yes)..........(2)
Similarly, you may determine the likelihood of not participating: Chance of not participating:PP(Weather=Overcast, Temp=Mild | Play= No) = (Play= No | Weather= Overcast, Temp=Mild)P(Play=No)... (3) Weather: Overcast, Mild, Play: No, P(Weather: Overcast | Play: No) P(Temp=Low | Play=None) ………..(4)
There is a larger likelihood of a "Yes" class. Therefore, it may be concluded that athletes will participate in their sport when the sky is gloomy. Naive Bayes ClassifierBuilding in Scikit-Learn with Synthetic Dataset Using scikit-learn to create artificial data, we will train and assess the Gaussian Naive Bayes algorithm in the first example. Establishing the Dataset To create the dataset and test several machine learning algorithms, Scikit-learn offers us a machine learning environment. The ' make_classification' function is being used in this instance to create a dataset with six features, three classes, and 800 samples. Explanation The make_classification function from the sklearn.datasets module is imported in this code.
To display the dataset, we will make use of the scatter function in matplotlib. pyplot. Explanation: The scatter() method is used to generate a scatter plot after importing the matplotlib.pyplot module.
There are three different target label types, as we can see, and we will be developing a multiclass classification model. Split Test TrainWe must divide the dataset into training and testing for model assessment before we begin the training procedure. Explanation
Model Construction and TrainingConstruct a general Gaussian Naive Bayes model, then train it using training data. The model is then fed a random test sample to get a projected value. Explanation: The scikit-learn package is used in this code to create a Gaussian Naive Bayes classifier.
Output: Actual Value: 0 Predicted Value: 0 Model AssessmentWe won't train the model on a test dataset that has yet to be seen. First, we will make predictions for the test dataset's values and utilize those predictions to compute accuracy and the F1 score. Output Accuracy: 0.8484848484848485 F1 Score: 0.8491119695890328 Explanation:
The F1 score, which measures the model's accuracy by accounting for both precision and recall, is 0.8491119695890328. A higher F1 value denotes a model that performs better. When calculating the true positives and true negatives for the confusion matrix, we will use the 'confusion_matrix' function. When displaying the confusion matrix with labels, we will use the 'ConfusionMatrixDisplay' function. Explanation: ConfusionMatrixDisplay is used in this code to generate a confusion matrix using the scikit-learn module.
We may enhance model performance by scaling, preprocessing cross-validations, and hyperparameter tweaking. Our model has performed fairly well. Loan Dataset with Naive Bayes ClassifierLet's use the actual dataset to train the Naive Bayes Classifier. With the exception of data exploration and preparation, we will repeat the majority of the operations. Loading of dataUsing the pandas 'read_csv' method, we will load Loan Data from the DataCamp Workspace in this example. Explanation:
Exploration of Data'.info()' will be used to learn more about the dataset.
Explanation This Python program invokes the info() function on a Pandas DataFrame object with the identifier df. For a rapid grasp of the structure and contents of a DataFrame, the info() function delivers a summary of the DataFrame, including the number of rows and columns, the data types of each column, and the number of non-null values in each column. Explanation: The summary of a dataset with 9578 entries and 14 columns is displayed in this bit of code.
In this case, we'll be creating a model to identify the clients who still need to repay the loan in full. Let's examine the purpose and target column using the countplot from Seaborn. Explanation: The purpose column in the df dataframe is counted using the Python tools seaborn and matplotlib, and the not.fully.paid column is utilized to separate the bars by color.
The imbalance in our dataset will impact the model's performance. To gain practical experience working with unbalanced datasets, see the tutorial Resample an unbalanced Dataset. Processing of DataWe will now use the pandas 'get_dummies' method to convert the categorical 'purpose' column to an integer. Explanation: This code generates fake variables for the DataFrame 'df''s 'purpose' column using the Python pandas package.
The dataset will then be divided into training and testing sets once we specify the feature (X) and target (Y) variables. Explanation: The train_test_split function from the sklearn.model_selection module is imported in this code.
Model Construction and TrainingBuilding and training models are quite easy. On a training dataset, we will train a model with the default hyperparameters. Explanation: The from sklearn.naive_bayes import GaussianNB line is used in this code to import the Gaussian Naive Bayes algorithm from the Scikit-Learn library.
Model AssessmentWe will measure model performance using accuracy and f1 score, and the Gaussian Naive Bayes method has done fairly well. Explanation: This code imports the accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score, and classification_report methods from the sklearn.metrics module.
The code then uses the print function to report the accuracy and F1 score to the console. The f1_score function computes the F1 score, which is a weighted average of precision and recall. Explanation: This snippet of code is the result of some code being run rather than the actual code itself.
We can observe that the confusion matrix offers a different narrative since the data are unbalanced. More mislabeled data has been found on a minority target: "not fully paid." Explanation: This piece of code may be used to plot the confusion matrix for a binary classification issue.
Check out the Naive Bayes Classification Tutorial with Scikit-learn Workspace if you need help with training or model assessment. It includes a dataset, the source code, and the results. Problem with Zero ProbabilityIn the event that there isn't a tuple for a hazardous loan in the dataset, the posterior probability will be 0, and the model won't be able to forecast anything. This issue is referred to as Zero Probability since there are no instances of that specific class. The Laplacian correction or Laplace Transformation can be used to resolve such a problem. One method for achieving smoothness is the Laplacian adjustment. Here, you may assume that the dataset is big enough that the estimated probability will stay the same if you add one row of each type. This will fix the problem of probability values being zero. For instance, Assume that the database contains 1000 training tuples for the class loan risk. There are 0 tuples for low income, 990 for medium income, and 10 for high income in this database's income field. Without the Laplacian adjustment, the odds of these occurrences are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000). Apply the Laplacian adjustment on the provided dataset now. Per each income-value pair, let's add one more tuple. The likelihood of these occurrences: Advantages
Disadvantages
ConclusionYou have completed this tutorial, so congratulations! You learned about the Naive Bayes method, how it functions, concerns with the assumption, how it is implemented, and its benefits and drawbacks in this session. You have also gained knowledge of model development and assessment in scikit-learn for binary and multinomial classes along the way. The most basic and effective method is naive Bayes. Even if machine learning has made great strides in the previous several years, it has nonetheless established itself as useful. It has been implemented with success in a variety of applications, including text analytics and recommendation engines. Next TopicPython-prediction-algorithm |