Kernel density estimation (KDE) Plot

Previously, we’ve seen how to use the histogram method to infer the probability density function (PDF) of a random variable (population) using a finite data sample. In this tutorial, we’ll carry on the problem of probability density function inference, but using another method: Kernel density estimation.

Kernel density estimates (KDE) are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

Why do we care?

One of the main problems in practical applications is that the needed probability distribution is usually not readily available, but rather it must be derived from other existing information (e.g. sample data). KDEs are similar to histograms in terms of being non-parametric method, so there are no restrictive assumptions about the shape of the density function, but KDE is far more superior to histograms as far as accuracy and continuity.

Overview

Let’s consider a finite data sample $\{x_1,x_2, \cdots ,x_N\}$ observed from a stochastic (i.e. continuous and random) process. We wish to infer the population probability density function.

In the histogram method, we select the left bound of the histogram (x_o ), the bin’s width (h ), and then compute the bin k probability estimator f_h(k):

  • Bin k represents the following interval $[x_o+(k-1)h, x_o+k\times h)$
  • $\hat f_h(k)$ is defined as follow: $$\hat f_h(k) =\frac{\sum_{i=1}^N I\{(k-1)h\leq x_i-x_o\leq k\times h)\}}{N}$$
  • $I\{.\}$is an event function that returns 1 (one) if the condition is true, 0 (zero) otherwise.

The choice of bins, especially the bin width (h ), has a substantial effect on the shape and other properties of $f_h(k)$.

Finally, we can think of the histogram method as follows:

  • Each observation (event) is statistically independent of all others, and its occurrence probability is equal to $\frac{1}{N}$.
  • $f_h(k)$ is simply the integral (sum) of the event probabilities in each bin.

What is a kernel

A kernel is a non-negative, real-valued, integrable function $K(.)$ satisfying the following two requirements:

$$\int_{-\infty}^{\infty}K(u)du=1$$ $$K(u)=k(-u)$$

And, as a result, the scaled function $K^*(u)$, where $K^*(u)=\lambda K(\lambda \times u)$, is a kernel as well.

Now, place a scaled kernel function at each observation in the sample and compute the new probability estimators $f_h(x)$ for a value x (compared to an earlier bin in the histogram).

$$\hat f_h(x)=\frac{1}{N}\sum_{i=1}^N K_h(x-x_i)$$ $$K_h(u)=\frac{1}{h}K(u/h)$$ $$\hat f_h(x)=\frac{1}{N\times h}\sum_{i=1}^N K((x-x_i)/h)$$

As an example, let $K(.)$ be the standardized Gaussian density function. The KDE looks like the sum of Gaussian curves, each centered on one observation.

Demonstration picture for Gaussian kernels in constructing a kernel density estimator

Note: For Gaussian kernel, the bandwidth is the same as the standard deviation of ($x-x_i$ ).

The KDE method replaces the discrete probability:

$$P(x)=\left\{\begin{matrix} \frac{1}{N} & x \in \{x_1,x_2,\cdots , x_N\} \\ 0 & x \notin \{x_1,x_2,\cdots , x_N\} \end{matrix}\right.$$

with a kernel function. This permits overlap between kernels, thus promoting continuity in the probability estimator.

Why KDE?

Due to our data sampling, we are left with a finite set of values for continuous random variables. Using a kernel instead of discrete probabilities, we promote the continuity nature in the underlying random variable.

To proceed with KDE, you’ll need to decide on two key parameters: Kernel function and bandwidth.

Which kernel should I use?

A range of kernel functions are commonly used: uniform, triangular, biweight, triweight and Epanechnikov. The Gaussian kernel is often used; $K(.)=\phi(.)$ , where $\phi$ is the standard normal density function.

How do I properly compute kernel bandwidth?

Intuitively, one wants to choose an h as small as the data allows, but there is a trade-off between the bias of the estimator and its variance.

Selection of the bandwidth of a kernel estimator is a subject of considerable research. We will outline two popular methods:

  1. Subjective selection - One can experiment by using different bandwidths and simply selecting one that “looks right” for the type of data under investigation.
  2. Selection with reference to some given distribution - Here one selects the bandwidth that would be optimal for a particular PDF. Keep in mind that you are not assuming that $f(x)$ is normal, but rather selecting an h which would be optimal if the PDF were normal. Using a Gaussian kernel, the optimal bandwidth $h_{opt}$ is defined as follows:
    $$h_{opt}=\sigma\times \sqrt[5]{\frac{4}{3N}}$$
    The normal distribution is not a “wiggly” distribution; it is uni-modal and bell–shaped. It is therefore to be expected that $h_{opt}$ will be too large for multimodal distributions. Furthermore, the sample variance ($s^2$ ) is not a robust estimator of $\sigma^2$; it overestimates if some outliers (extreme observations) are present. To overcome these problems, Silverman proposed the following bandwidth estimator:
    $$h_{opt}=\frac{0.9\times \hat\sigma}{\sqrt[5]N}$$
    $$\hat \sigma = \textrm{min}(s,\frac{R}{1.34})$$
    $$R=IQR=Q3-Q1$$
    Where $IQR$ is the interquartile range and “$s$” is the sample standard deviation.
  3. Data driven estimation – this is an area of current research using several different methods: Fourier transform, diffusion based, etc.

Process

Using the NumXL add-in for Excel, you can compute the KDE values for different kernel functions (e.g. Gaussian, uniform, triangular, etc.) and (optionally) with a bandwidth value.

For our sample data, we are using 50 randomly generated values of the normal distribution (using the random generator in the Excel Analysis Pack). We plotted the histogram for our reference:

sample data layout for the KDE in Excel using NumXL histogram-plot-random-gaussian-data

Now we are ready to construct our KDE plot. First, select the empty cell in your worksheet where you wish the output table to be generated, then locate and click on the “Descriptive Statistics” icon in the NumXL tab (or toolbar). Then, select the “Kernel density estimation” item from the drop down menu.
KDE-ICON-NUMXL-TOOLBAR

KDE-Wizard-General-Tab

Select the cells range for the values of the input variable.

Notes:

  1. The cells range includes (optional) the heading (“Label”) cell, which would be used in the output tables where it references those variables.
  2. By default, the output table cells range is set to the current selected cell in your worksheet.
  3. By default, the output graph cells range is set to the 7 cells to the right of the currently selected cell in your worksheet.

Finally, once we select the input data (X) cells range, the “Options” and “Missing Values” tabs become available (enabled).

Next, select the “Options” tab:

KDE-Wizard-option-Tab

Notes:

  1. By default, the Gaussian kernel function is selected. Let’s leave this option unchanged.
  2. By default, the “optimal bandwidth” option is checked. The KDE function will use the Silverman estimate for the bandwidth. Leave it checked.
  3. By default, the output table size is set to 5. Leave it unchanged.
  4. “Overlay Normal distribution” is checked. This option in effect instructs the wizard to generate a second curve for the Gaussian distribution for comparison purposes. Leave this option checked.

Now, click on the “Missing Values” tab.
KDE-Wizard-missing-values-Tab

In this tab, you can select an approach to handle missing values in the data set (X’s). By default, any observation with missing value would be excluded from the analysis.

This treatment is a good approach for our analysis, so let’s leave it unchanged.

Now, click “OK” to generate the output tables.
KDE-GAUSSIAN-RANDOM-DATA-OUTPUT-TABLE

Notes:

  1. The values of all X are sorted in ascending order.
  2. The summary statistics in the 1st row are computed merely to facilitate the creation of the table or computing the overlay Gaussian distribution function.

The generated plot of the KDE is shown below:

KDE-GAUSSIAN-RANDOM-DATA-OUTPUT-PLOT

Note that the KDE curve (blue) tracks very closely with the Gaussian density (orange) curve.

Case 2

Now let’s try a non-normal sample data set. We generated 50 random values of a uniform distribution between -3 and 3. Following similar steps, we plotted the histogram and the KDE:

UNIFORM-DATA-HISTOGRAM-PLOTUNIFORM-DATA-KDE-PLOT

Note that the KDE curve (blue) tracks much more closely with the underlying distribution (i.e. uniform) than the histogram.

Case 3

For our 3rd case, we generated 50 random values of a binomial distribution (p=0.2 and batch size=20). Following similar steps, we plotted the histogram and the KDE.

BINOMIAL-DATA-HISTOGRAM-PLOTBINOMIAL-DATA-KDE-PLOT

Note that KDE curve (blue) tracks much more closely with the underlying distribution (i.e. uniform) than the histogram.

Conclusion

In this tutorial, we demonstrated the process to generate a kernel density estimation in Excel using NumXL’s add-in functions.

The KDE method is a major improvement for inferring the probability density function of the population, in terms of accuracy and continuity of the function. Nevertheless, it introduce a new challenge: selecting a proper bandwidth. In the majority of cases, the Silverman estimator for the bandwidth proves to be satisfactory, but is it optimal? Do we care?

Where do we go from here?

First, to answer the question of optimality, we need to introduce additional algorithms to estimate its values. For example, in “Annals of Statistics, Volume 38, Number 5, pages 2916-2957”, Z. I. Botev, J. F. Grotowski, and D. P. Kroese described a numerical sample data-driven method for finding the optimal bandwidth using a Kernel density estimation via the diffusion approach.

Second, in cases where the range of values that the random number can take are known to be constrained from one side (e.g. prices, binomial data, etc.), or in a range (e.g. survival rate, default rate, etc.), then how do we adapt the KDE to factor in those constraints?

Finally, we defined the KDE probability estimator using a fixed bandwidth ( ) for all observations. If the bandwidth is not held fixed, but is varied depending upon the location of either the estimate (balloon estimator) or the samples (point wise estimator), this produces a particularly powerful method known as adaptive or variable bandwidth kernel density estimation.

 

Have more questions? Submit a request

0 Comments