Density Estimation

Density estimation lies on the boundary of data modeling, feature engineering, and unsupervised learning. A few of the most well-liked and practical methods for estimating density are neighbor-based approaches like the kernel density estimate (KernelDensity) and mixture models like Gaussian Mixtures (GaussianMixture). Since Gaussian Mixtures can be used as an unsupervised clustering system, the topic is covered in greater detail in the context of clustering.

The majority of people are already familiar with one popular density estimate method, the histogram, and density estimation is a relatively basic idea.

Density Estimation: Histograms

A histogram is a straightforward data visualization in which the number of data points within each bin is counted, and bins are established. A histogram is depicted in the upper-left panel of the subsequent figure:

Easy Estimation of 1D Kernel Density

One of the issues with using histograms to visualize the density of points in 1D is demonstrated in the first plot. A method in which a unit "block" is placed above each point on a regular grid is intuitively comparable to a histogram. However, as the upper two panels demonstrate, selecting a different gridding for these blocks might result in drastically different interpretations of the underlying density distribution structure. Alternatively, we can obtain the estimate displayed in the lower left panel by centering each block on the point that it represents. This is an estimate of the kernel density using a "top hat" kernel.

This concept is transferable to various kernel shapes: a Gaussian kernel density estimate over the same distribution is displayed in the first figure's bottom-right panel.Scikit-learn offers efficient kernel density estimation using either a Ball Tree or KD Tree structure, using the KernelDensity estimator. The second image in this example illustrates the available kernels. The kernel density estimates for a distribution of one hundred samples in one dimension are compared in the third figure. Even though 1D distributions are used in this example, kernel density estimation can be effectively extended to higher dimensions with ease.

Density Estimation
Density Estimation

But one of the main issues with histograms is that the visualization produced can vary greatly depending on the binning selection. Examine the panel on the upper right of the figure above. Over the same data, a histogram with the bins moved to the right is displayed. The two visualisations' outputs have completely distinct looks, which could result in various ways to understand the data.

A histogram can also be intuitively compared to a stack of bricks, with one block for each point. We restore the histogram by stacking the blocks in the proper grid space. However, what if we centered each block on the point it represented and added the total height at each place instead of stacking the blocks on a standard grid? The lower-left visualization is the result of this concept. Although it may not be as tidy as a histogram, it is a far better representation of the underlying data because the data determines where the blocks are placed.

Kernel Density Estimation

The Ball Tree or KD Tree is used for efficient searches in the KernelDensity estimator in scikit-learn, which implements kernel density estimation (see Nearest Neighbours for an explanation of this). Although the example above is carried out in any number of dimensions, nevertheless, in high dimensions, its performance could be better due to the curse of dimensionality.

The following figure displays the kernel density estimates for three different kernel choices after 100 points are selected from a bimodal distribution:

Density Estimation

It is evident how the kernel shape influences the resulting distribution's smoothness. The following are some uses for the scikit-learn kernel density estimator:

Estimating the Kernel Density of Species Distributions

This illustrates an example of a Ball Tree constructed using the Haversine distance metric, or distances over points in latitude and longitude, to perform a neighbors-based query (specifically, a kernel density estimate) on geographical data. The dataset is from Phillips et al. (2006). The example plots South America's national borders and coastlines using a base map if one is available.

There is no learning over the data in this example; for an example of classification based on the attributes in this dataset, see Species Distribution Modeling. It merely displays the observed data points' kernel density estimate in geographic coordinates.

Output:

computing KDE in spherical coordinates
- plot coastlines from coverage
- computing KDE in spherical coordinates
- plot coastlines from coverage

Density Estimation

The underlying model of the data is reflected in these additional samples.

Output:

best bandwidth: 3.79269019073225

The "new" data consists of linear combinations of the input data, with weights probabilistically drawn given the KDE model.

Density Estimation

Examples:

  • Easy One Dimension Kernel Density Estimation: a straightforward method for calculating one-dimensional kernel densities.
  • One application of kernel density estimation is the study of a generative version of the handwritten digit data and the creation of new samples from it.
  • Species Distribution Kernel Density Estimate: This is an example of a Kernel Density Estimate that uses the Haversine distance metric to visualise geographical data.

Conclusion:

In conclusion, for density estimation in statistics, both histograms and kernel density estimation (KDE) are useful methods. Histograms are straightforward and easy-to-understand illustrations of a dataset's distribution that divide it into distinct bins and give the frequency of the data a clear visual depiction. Nevertheless, the interpretation may be affected by the bin size selection, and it can miss subtle patterns in the data.

On the other hand, Kernel Density Estimation (KDE) provides a more continuous and smooth estimate of the probability density function. It may bring out finer details in the data and doesn't rely on pre-established bins. Although less dependent on bin size, bandwidth, and kernel selection are important.

To summarise, kernel density estimation delivers a smoother, continuous estimate that can be more reliable at capturing underlying patterns in the data. However, histograms offer an easy-to-understand and straightforward method of visualizing data distributions. The type of data and the precise objectives of the study are major factors in selecting one of these approaches.


Next TopicOverlay Network




Latest Courses