Fitting empirical distribution to theoretical ones with Scipy (Python)?

python

dataframe

pandas

best-practices

byAnton Shumikhin·Feb 7, 2025

In just three steps, you can fit an empirical distribution to a theoretical one using the stats module from Scipy:

Import stats: from scipy import stats.
Select a theoretical distribution: for instance, stats.norm for Gaussian.
Fit to the data: params = stats.norm.fit(data).

Here's a Python snippet example:

from scipy import stats
import numpy as np

# Your empirical data
data = np.array([...])

# Fit and retrieve parameters
params = stats.norm.fit(data)
print(f"Mean: {params[0]}, Std: {params[1]}")

Check the fit quality with statistical tests or plots. Ensure the model represents your data accurately.

Exploring all available distributions

Now, let's go fishing in the pool of Scipy's distributions, looking for the one that fits your data best:

from scipy import stats
import numpy as np
import warnings

# Your empirical data
data = np.array([...])
sse = {}  # Sum of Square Error storage

for dist_name in dir(stats):
    if isinstance(getattr(stats, dist_name), stats.rv_continuous):
        distribution = getattr(stats, dist_name)
        with warnings.catch_warnings():
            warnings.filterwarnings('ignore')  # Ignore warnings by doing the '🙈'

            # Fit distribution to data & calculate SSE
            params = distribution.fit(data) 
            fitted_pdf = distribution.pdf(np.sort(data), *params[:-2], loc=params[-2], scale=params[-1])
            sse[dist_name] = np.sum(np.power(data - fitted_pdf, 2.0))  # SSE it's like finding shoe sizes with a blindfold.

# The distrib kind tuxedo to your data party
best_fit = min(sse, key=sse.get)

Smooth operator

When your fitted PDFs appear more like a porcupine than a snake, especially with integer data, considering smoothing:

from scipy.ndimage.filters import gaussian_filter1d

# fitted_pdf here is the PDF of a fitted distribution
smooth_pdf = gaussian_filter1d(fitted_pdf, sigma=2.0)  # Your PDF had a shave!

This way, you'll get visually appealing fits that could better capture your data distribution.

Understanding unusual data

Sometimes, data can surprise us and not fit well with standard distributions. In such cases:

Selecting optimal models

When in doubt between several models, use AIC or BIC, or the log-likelihood to decide:

aic = 2 * num_params - 2 * log_likelihood  # Lower AIC is better
bic = log(n) * num_params - 2 * log_likelihood  # Lower BIC is better too

Predicting the future

With a fitted model, you can evaluate the probability of new data occurrences:

p_value = distribution.sf(new_data_point, *params)  # "In what universe this could have happened?" - you, probably.

This could be useful for anomaly detection or hypothesis testing.

Useful scipy.stats functions

For discrete distributions, consider using:

bincount: It's like counting M&Ms in a box. Useful for integer data.
cumsum: Think of it as filling a jar with marbles, one by one.

Also, check out resources like Wikipedia for some background on tail functions like ccdf.

explain-codes / Python / Fitting empirical distribution to theoretical ones with Scipy (Python)?

Linked

Moving average or running mean



Calculating arithmetic mean (one type of average) in Python



How to make good reproducible pandas examples



How to find all occurrences of an element in a list



How to extract text from a PDF file?



How to find all positions of the maximum value in a list?



How to calculate rolling / moving average using python + NumPy / SciPy?



Exploring all available distributions Smooth operator Understanding unusual data