1 Maximum likelihood estimator

1.1 log-Likelihood

Let \(y_1,\ldots,y_n\) a i.i.d. sample with density (or p.m.f.) \(f\) with parameter \(\theta\). The likelihood of the sample \(y_1,\ldots,y_n\) is \[ \mathcal L(\theta; y_1,\ldots,y_n) = \prod_{i=1}^n \mathcal L(\theta; y_i) \]

where \(\mathcal L(\theta; y_i)\) is the function \(\theta \mapsto f(y_i,\theta)\). Generally, Instead of the likelihood function, we consider the log-likelihood:

\[ \log \mathcal L(\theta; y_1,\ldots,y_n) = \log\left(\prod_{i=1}^n \mathcal L(\theta; y_i)\right) = \sum_{i=1}^n \log \mathcal L(\theta; y_i). \]

The Maximum Likelihood Estimation (MLE) consists to maximizing the log-likelihood of the sample, (or minimizing \(-\log \mathcal L(\theta; y_1,\ldots,y_n)\) ). With some distributions, it can be done to solve the scoring equations, that are:

\[ \frac{\partial}{\partial\theta}\log \mathcal L(\theta; y_1,\ldots,y_n) =0. \] for many distributions, the scoring equations do not have an explicit solution, and an optimization procedure can be down to obtain the MLE, use optim in R.

  • Let \(y_1,\ldots,y_n\) be a sample from a Gaussian distribution with mean value \(\mu\) and variance \(\sigma^2\). Found the MLE of \(\mu\) and \(\sigma^2\).

\[\begin{align*} \log \mathcal L(\mu,\sigma^2; y_1,\ldots,y_n) &=\sum_{i=1}^n \log \mathcal L(\mu,\sigma^2; y_i) = \sum_{i=1}^n \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)\\ &=-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right) \end{align*}\] So \[\begin{align*} \frac{\partial}{\partial\mu}\log L(\mu,\sigma^2; y_1,\ldots,y_n) = 0 &\Leftrightarrow \frac{\partial}{\partial\mu}\left(-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)=0\\ &\Leftrightarrow \frac{\partial}{\partial\mu}\sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)=0\\ &\Leftrightarrow -\frac{1}{\sigma^2}\sum_{i=1}^n (y_i-\mu)=0\\ &\Leftrightarrow \mu=\bar y_n \end{align*}\]

\[\begin{align*} \frac{\partial}{\partial\sigma^2}\log L(\mu,\sigma^2; y_1,\ldots,y_n) = 0 &\Leftrightarrow \frac{\partial}{\partial\sigma^2}\left(-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)=0\\ &\Leftrightarrow \frac{\partial}{\partial\sigma^2}\left(-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)=0\\ &\Leftrightarrow -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^n(y_i-\mu)^2=0\\ &\Leftrightarrow \sigma^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \bar y_n)^2 \end{align*}\]

  • Generate a sample (use function) of size \(n=10^6\) from a Gaussian distribution with mean value \(\mu = 2\) and variance \(\sigma^2 = 4\). Estimate the MLE and illustrate the a.s. convergence of the estimator, that is to say the convergence of the estimator when the sample size increase

1.2 Exponential distribution

Do the thing using the exponential distribution given by \[ f(\lambda,x) = \lambda \exp(-\lambda x) \]

  • Define the MLE here some calculus (on the derivative of the log-density function) lead to

\[ \hat \lambda = \frac{1}{\bar{y_n}} \]

  • Illustrate the almost surely convergence using \(\lambda=4\)

1.3 Gamma distribution

The density of the Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\) is defined by \[ f(x;\alpha,\beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}exp(-\beta x),\quad x>0, \alpha>0, \beta>0, \]

where the Gamma function \(\Gamma\) is defined by

\[ \Gamma(z)=\int_0^\infty t^{z-1}e^{-t}dt. \] Note that \(\Gamma(z+1) = z\Gamma(z)\) and \(\Gamma(1)=1\). If Y follows a Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\), \[ \mathbb E(Y) = \frac{\alpha}{\beta},\quad Var(Y) = \frac{\alpha}{\beta^2}. \]

  • Check that for \(\alpha=1\), the Gamma distribution is the exponential distribution.

  • Plot the density of the Gamma distribution for many parameter \(\alpha\) and \(\beta\) (use dgamma function).

  • Generate a sample (use rgamma function) of size \(n=10^6\) from a gamma distribution with shape value \(\alpha = 3\) and rate parameter \(\beta = 5\). Estimate the MLE using the R function optim then using the function fitdist (using library(fitdistrplus))

2 gradient descent

Classical algorithms to reach MLE of a \(d\)-dimensional parameter \(\theta=(\theta_1,\ldots,\theta_d)\) and more generally to reach the maximum of a function \(f\) (or the minimum of the opposite function \(-f\)) are gradient descent algorithms. They consist to:

where \(\gamma\) is a learning rate. If \(\gamma\) is too large, gradient descent may not converge. If \(\gamma\) is too small, convergence may be slow. I recommend start by considering \(\gamma = \frac{1}{n}\).

2.1 Exponential distribution

  • Simulate a exponential distribution with parameter \(\lambda= 5\).

  • Use a gradient descent algorithm to estimate the distribution parameter. Compare with the output from fitdist

  • Determine the MLE using the \(fidistrplus\) R package and compare with your result. Again using the \(fidistrplus\) R package, draw the likelihood curve and add the successive parameter values of the algorithm.

2.2 Gamma distribution

  • For \(n=2000\), simulate a sample \(y_1,\ldots,y_n\) from a gamma distribution with parameter shape parameter \(\alpha= 3\) and rate parameter \(\beta = 5\).

  • Use a gradient descent algorithm to estimate the distribution parameter by MLE.

The derivatives of the log-likelihood function are

\[ \frac{\partial}{\partial \alpha} \log \mathcal L(y_1,\ldots,y_n, \alpha , \beta) = n\log(\beta) - n\psi(\alpha) + \sum_{i=1}^n\log(y_i) \] where \(\psi\) is the digamma function (use the digamma R function): \(\psi(\alpha)= \Gamma'(\alpha)/\Gamma(\alpha)\), and

\[ \frac{\partial}{\partial \beta} \log \mathcal L(y_1,\ldots,y_n, \alpha , \beta) = n\frac{\alpha}{\beta} - \sum_{i=1}^ny_i. \]

  • Determine the MLE using the \(fidistrplus\) R package and compare with your result. Again using the \(fidistrplus\) R package, draw the contour of the loglikelihood function and add the successive parameter values of the algorithm.