## Maximum Likelihood EstimationDensity estimation is the process of estimating the probability distribution for a subset of data from a problem domain. Density estimation may be solved using a variety of strategies, but maximum likelihood estimation (MLE) is a typical paradigm used in machine learning. The goal of maximum likelihood is to find the optimal way to fit a distribution to the data. ## LikelihoodLikelihood covers how to discover the optimum distribution of data for a specific characteristic or scenario, whereas probability tells how to calculate the possibility of anything given a sample distribution of data. For example, if we wanted to know the likelihood of a mouse weighing more than 34 grams, we would edit the bit on the left side of the figure below. The right side, which specifies the form and placement of the distribution, remains the same. Given that we weighed a 34-gram mouse, the chance of distribution with mean = 32 and standard deviation = 2.5 is high. As you can see, the measurements on the right side remain constant, but we can change the form and position of the distribution on the left side. ## Concept of MLEThe primary goal of Maximum Likelihood Classification is to forecast the class label y that maximizes the likelihood of the observed data x. We shall assume x to be a random vector and y to be a non-random parameter that influences the distribution of x. First, we must make an assumption about x's distribution (which is often Gaussian). Then, the learning of our data includes the following: - We divided our dataset into sections based on each label y.
- We use just the data inside each subset to estimate the parameters of our hypothesized distribution for x.
When making a forecast about a fresh data vector x: - The Probability Density Function(PDF) of our hypothetical distribution is calculated using our estimated parameters for each label y.
- Return the label y for which the assessed PDF contained the highest value.
Let's begin with a basic example with a one-dimensional input x and two classes: y = 0, and y = 1. Assume that we calculated our parameters under both the y = 0 and y = 1 situations, resulting in the two PDFs shown above. The blue plot (y = 0) has mean μ=1 and standard deviation σ=1, while the orange plot (y = 1) has μ=-2 and σ=1.5. To forecast the label y given a new data point x = -1, we analyze two PDFs: fy=0(-1)≈0.05 and fy=1(-1)≈0.21. The highest result, 0.21, was obtained when we considered y = 1, hence we forecast label y = 1. That was a basic example, but in real-world circumstances, we will have more input factors to employ when making predictions. So, we require a multivariate Gaussian distribution with the following PDF: Where: - x= a column vector with data from one observation
- d= dimension of x (x is a d×1 vector)
- mean of x (also d×1)
- covariance matrix of x ( d×d )
To use this approach, the covariance matrix (Σ) must be positive definite, symmetric, and have positive eigenvalues. The covariance matrix (Σ) comprises the covariances between all pairs of components of x: Σij=cov(xi,xj). So it is a symmetric matrix since cov(xi,xj)=cov(xj, xi), and all we have to verify is that all eigenvalues are positive; otherwise, we would display a warning. If there are more observations than variables and the variables don't have a high correlation between them, this condition should be satisfied, Σ should be positive definite Now, we will implement it.
## Importing Libraries## Reading the Dataset
The data is relevant to social networking adverts, which include the gender, age, and projected wage of the social network's members. The gender is a categorical column that has to be tagged and encoded before feeding the data to the learner.
The encoded results are stored in a new feature called gender, leaving the original unmodified. To train and validate the learner, divide the data into training and testing sets.
The learner line in the above figure, which shows the relationship between feature age and prediction, was created using the maximum likelihood estimation principle, which assisted the logistic regression model in classifying the outcomes. So, in the background, the algorithm selects a probability scaled by age of detecting "1" and utilizes it to compute the chance of observing "0". This will apply to all data points, and it will eventually multiply all of the likelihoods of data supplied in the line. The multiplication procedure continues until the highest probability or best-fit line is identified. |