Explain Codes LogoExplain Codes Logo

Fitting empirical distribution to theoretical ones with Scipy (Python)?

python
dataframe
pandas
best-practices
Anton ShumikhinbyAnton Shumikhin·Feb 7, 2025
TLDR

In just three steps, you can fit an empirical distribution to a theoretical one using the stats module from Scipy:

  1. Import stats: from scipy import stats.
  2. Select a theoretical distribution: for instance, stats.norm for Gaussian.
  3. Fit to the data: params = stats.norm.fit(data).

Here's a Python snippet example:

from scipy import stats import numpy as np # Your empirical data data = np.array([...]) # Fit and retrieve parameters params = stats.norm.fit(data) print(f"Mean: {params[0]}, Std: {params[1]}")

Check the fit quality with statistical tests or plots. Ensure the model represents your data accurately.

Exploring all available distributions

Now, let's go fishing in the pool of Scipy's distributions, looking for the one that fits your data best:

from scipy import stats import numpy as np import warnings # Your empirical data data = np.array([...]) sse = {} # Sum of Square Error storage for dist_name in dir(stats): if isinstance(getattr(stats, dist_name), stats.rv_continuous): distribution = getattr(stats, dist_name) with warnings.catch_warnings(): warnings.filterwarnings('ignore') # Ignore warnings by doing the '🙈' # Fit distribution to data & calculate SSE params = distribution.fit(data) fitted_pdf = distribution.pdf(np.sort(data), *params[:-2], loc=params[-2], scale=params[-1]) sse[dist_name] = np.sum(np.power(data - fitted_pdf, 2.0)) # SSE it's like finding shoe sizes with a blindfold. # The distrib kind tuxedo to your data party best_fit = min(sse, key=sse.get)

Smooth operator

When your fitted PDFs appear more like a porcupine than a snake, especially with integer data, considering smoothing:

from scipy.ndimage.filters import gaussian_filter1d # fitted_pdf here is the PDF of a fitted distribution smooth_pdf = gaussian_filter1d(fitted_pdf, sigma=2.0) # Your PDF had a shave!

This way, you'll get visually appealing fits that could better capture your data distribution.

Understanding unusual data

Sometimes, data can surprise us and not fit well with standard distributions. In such cases:

Selecting optimal models

When in doubt between several models, use AIC or BIC, or the log-likelihood to decide:

aic = 2 * num_params - 2 * log_likelihood # Lower AIC is better bic = log(n) * num_params - 2 * log_likelihood # Lower BIC is better too

Predicting the future

With a fitted model, you can evaluate the probability of new data occurrences:

p_value = distribution.sf(new_data_point, *params) # "In what universe this could have happened?" - you, probably.

This could be useful for anomaly detection or hypothesis testing.

Useful scipy.stats functions

For discrete distributions, consider using:

  • bincount: It's like counting M&Ms in a box. Useful for integer data.
  • cumsum: Think of it as filling a jar with marbles, one by one.

Also, check out resources like Wikipedia for some background on tail functions like ccdf.