Maximum likelihood¶

The probability density function p = f(x, θ) tells you a probability of occurrence of a random value near x. Likelihood is a reverse operation of estimating unknown paramter θ from the same equation using p and x.

Lead by example: pick the proper Gaussian¶

Observations
Probability of observations
Observed sample is considered the most likely one
Maximisation of probability allows to compute distribution parameters

1. Observations.¶

Consider an experiment where we draw two random indepenent values (x₁, x₂) from the same distribution, e.g. the weight of two mice in grams (30, 50). There is some prior knowledge given to you that the distribution is normal with a center around μ (average mouse weight) and standard deviation of σ: x ~ N(μ, σ²).

2. Probability of observations.¶

What was the probability of encountering (x₁, x₂) = (30, 50)? It is the product of individual event probabilities L = f(x₁) · f(x₂). This value itself depends on unknown μ and σ, so can be written as L(μ, σ) = f(x₁ | μ, σ) · f(x₂ | μ, σ), where f is probability density function.

3. Observed sample is considered the most likely one.¶

It is prudent to assume that observed (x₁, x₂) = (30, 50) was the realisation of the most probable possible event. This way we take good use of available scarce information and make a better guess. If we decide we just saw an extreme event, we will be systematically wrong on this decision (a tourist sees a working fountain in town provides extra intuition).

4. Maximisation of probability allows to compute distribution parameters.¶

So, upon a fact of observation of (x₁, x₂) we tend to believe this has to be an event with maximum probability (likelihood) of happening. From this assumption we can find μ and σ that maximise L. Sometimes it is possible to do it analytically, as with normal distributions, sometimes we have to search for solution numerically (hoping it is unique).

Generalisation of example above¶

We usualy denote a set of parameters like μ and σ as θ, a vector of parameters. Our task is to estimate parameter θ given:

a sample of observations of а random variable X = (x₁, x₂, ..., xₙ), and
a pre-defined probability density function f(x, θ).

Solution steps:

Collect observations X = (x₁, x₂, ..., xₙ)
Make a decision about which probability density function f(x, θ) is appropriate for this data
Compose joint probability of observations as a function of θ: L(θ) = f(x₁, θ)·f(x₂, θ)· ...·f(xₙ, θ).
Come to terms with a principle “if we really did observe this event, it was the most probable outcome of all possible events given this distribution”
Find which θ maiximises joint probability of observations

Code¶

Python code below below relies on scipy.optimixe.minimize solver to find parameters of a normal distribution based on two measurements of mice weights rom example above. It can be easily applied to more observations.

"""
Maximum likelihood with two mice.
"""

import numpy as np
from scipy.optimize import minimize

def dnorm(x, mu=0, sigma=1):
    """Normal distribution probability density fucntion."""
    const = 1 / (sigma * np.sqrt(2 * np.pi))
    power = - (x - mu)**2 / (2 * sigma**2)
    return  const * np.exp(power)

def log_likelihood(observed_x):
    """Sum of logs of probability densities at *observed_x*.
    Return:
       function of mu and sigma
    """
    def foo(mu, sigma):
        logs = [np.log(dnorm(x, mu, sigma)) for x in observed_x]
        return sum(logs)
    return foo

def maximise(f, start_mu, start_sigma):
    """Return mu and lambda, which maximise *f*."""
    f = lambda p: -1 * l_func(mu=p[0], sigma=p[1])
    res = minimize(f, x0=[start_mu, start_sigma])
    return res.x[0], res.x[1]

# two mice weigths are given, similar to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6143748/
events = [30, 50]

# construct likelihood as a function of unknown mu and sigma
l_func = log_likelihood(events)

# run maximisation procedure
# attention: need a sensible pick for start variables, eg (0, 1) will fail
estimated_mu, estimated_sd = maximise(l_func, start_mu=30, start_sigma=3)

# test outcomes
#estimated_mu is 39.99999527669165
assert np.isclose(estimated_mu, 40)
#estimated_sd is 9.999976480910071
assert np.isclose(estimated_sd, 10)

# Result: observed values [30, 50] were most likely coming from
#         normal distribution with parameters μ=40 and σ=10.

Other code examples¶

Links¶

Nice video with weight of mice
Maximum likelihood estimation in layman terms

Say you have some data. Say you’re willing to assume that the data comes from some distribution – perhaps Gaussian. There are an infinite number of different Gaussians that the data could have come from (which correspond to the combination of the infinite number of means and variances that a Gaussian distribution can have). MLE will pick the Gaussian (i.e., the mean and variance) that is “most consistent” with your data (the precise meaning of consistent is explained below).

Why is maximum likelihood estimation considered to be a frequentist technique