Monday, April 22, 2013

Presenting data in histograms

I've recently been working on a relatively large amount of data sets of the error between different measurements of the chemical shifts from a specific atom in a protein. A problem I encountered was how to determine the bin width of the histograms without manually choosing a unique value for each set. And if you have to fit a function to the distribution, then the chosen bin width might greatly affect the resulting fit.
The python module astroML contains an improved version of pylab's hist, where the form of the histogram can be automatically chosen based on different statistical models. Two noteworthy models are the Freedman-Diaconis Rule and the Bayesian block method discussed here, with examples of usage shown here.
The following python code calculates the optimal bin number based on the Freedman-Diaconis Rule without the use of the astroML module:
def bins(t):
    t.sort()
    n=len(t)
    width = 2*(t[3*n/4]-t[n/4])*n**(-1./3)
    return int((t[-1]-t[0])/width)
An interesting alternative to histograms for estimating a distribution is Kernel Density Estimation, where kernels (a non-skewed function that integrates to one e.g. a normal distribution) are placed at each datapoint and the sum of the kernels give the kernel density estimate.

As an example, the following code generates a data set x, and plots the data by the three mentioned methods:

import numpy as np
from astroML.plotting import hist
from scipy.stats.kde import gaussian_kde
import pylab

#generate data from two gaussians
x1 = np.random.normal(0,0.5,size=1000)
x2 = np.random.normal(1.5,0.3, size=1000)
x = np.concatenate((x1,x2))

#plot histogram from the Freedman-Diaconis Rule (filled opaque blue bins)
hist(x, normed=1, alpha = 0.2, bins='freedman')

#plot histogram using bayesian blocks (green step line)
hist(x, normed=1, bins='blocks', histtype = 'step')

#plot KDE using gaussian kernels (red):
my_pdf = gaussian_kde(x)
data_range =  np.linspace(min(x),max(x),100)
pylab.plot(data_range,my_pdf(data_range))

pylab.savefig('test')

Which produces the following plot, where the filled bins are from the Freedman-Diaconis Rule, the green step line is the bayesian block method and the red line is the KDE.

No comments: