Let \(y_1,\ldots,y_n\) a i.i.d. sample with density (or p.m.f.) \(f\) with parameter \(\theta\). The likelihood of the sample \(y_1,\ldots,y_n\) is \[ \mathcal L(\theta; y_1,\ldots,y_n) = \prod_{i=1}^n \mathcal L(\theta; y_i) \]
where \(\mathcal L(\theta; y_i)\) is the function \(\theta \mapsto f(y_i,\theta)\). Generally, Instead of the likelihood function, we consider the log-likelihood:
\[ \log \mathcal L(\theta; y_1,\ldots,y_n) = \log\left(\prod_{i=1}^n \mathcal L(\theta; y_i)\right) = \sum_{i=1}^n \log \mathcal L(\theta; y_i). \]
The Maximum Likelihood Estimation (MLE) consists to maximizing the log-likelihood of the sample, (or minimizing \(-\log \mathcal L(\theta; y_1,\ldots,y_n)\) ). With some distributions, it can be done to solve the scoring equations, that are:
\[ \frac{\partial}{\partial\theta}\log \mathcal L(\theta; y_1,\ldots,y_n) =0. \] for many distributions, the scoring equations do not have an explicit solution, and an optimization procedure can be down to obtain the MLE, use optim in R.
\[\begin{align*} \log \mathcal L(\mu,\sigma^2; y_1,\ldots,y_n) &=\sum_{i=1}^n \log \mathcal L(\mu,\sigma^2; y_i) = \sum_{i=1}^n \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)\\ &=-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right) \end{align*}\] So \[\begin{align*} \frac{\partial}{\partial\mu}\log L(\mu,\sigma^2; y_1,\ldots,y_n) = 0 &\Leftrightarrow \frac{\partial}{\partial\mu}\left(-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)=0\\ &\Leftrightarrow \frac{\partial}{\partial\mu}\sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)=0\\ &\Leftrightarrow -\frac{1}{\sigma^2}\sum_{i=1}^n (y_i-\mu)=0\\ &\Leftrightarrow \mu=\bar y_n \end{align*}\]
\[\begin{align*} \frac{\partial}{\partial\sigma^2}\log L(\mu,\sigma^2; y_1,\ldots,y_n) = 0 &\Leftrightarrow \frac{\partial}{\partial\sigma^2}\left(-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)=0\\ &\Leftrightarrow \frac{\partial}{\partial\sigma^2}\left(-\frac{n}{2}\log(\sigma^2) - \sum_{i=1}^n\left(\frac{(y_i-\mu)^2}{2\sigma^2}\right)\right)=0\\ &\Leftrightarrow -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^n(y_i-\mu)^2=0\\ &\Leftrightarrow \sigma^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \bar y_n)^2 \end{align*}\]
Do the thing using the exponential distribution given by \[ f(\lambda,x) = \lambda \exp(-\lambda x) \]
\[ \hat \lambda = \frac{1}{\bar{y_n}} \]
The density of the Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\) is defined by \[ f(x;\alpha,\beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}exp(-\beta x),\quad x>0, \alpha>0, \beta>0, \]
where the Gamma function \(\Gamma\) is defined by
\[ \Gamma(z)=\int_0^\infty t^{z-1}e^{-t}dt. \] Note that \(\Gamma(z+1) = z\Gamma(z)\) and \(\Gamma(1)=1\). If Y follows a Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\), \[ \mathbb E(Y) = \frac{\alpha}{\beta},\quad Var(Y) = \frac{\alpha}{\beta^2}. \]
Check that for \(\alpha=1\), the Gamma distribution is the exponential distribution.
Plot the density of the Gamma distribution for many parameter \(\alpha\) and \(\beta\) (use dgamma function).
Generate a sample (use rgamma function) of size \(n=10^6\) from a gamma distribution with shape value \(\alpha = 3\) and rate parameter \(\beta = 5\). Estimate the MLE using the R function optim then using the function fitdist (using library(fitdistrplus))
Classical algorithms to reach MLE of a \(d\)-dimensional parameter \(\theta=(\theta_1,\ldots,\theta_d)\) and more generally to reach the maximum of a function \(f\) (or the minimum of the opposite function \(-f\)) are gradient descent algorithms. They consist to:
where \(\gamma\) is a learning rate. If \(\gamma\) is too large, gradient descent may not converge. If \(\gamma\) is too small, convergence may be slow. I recommend start by considering \(\gamma = \frac{1}{n}\).
Simulate a exponential distribution with parameter \(\lambda= 5\).
Use a gradient descent algorithm to estimate the distribution parameter. Compare with the output from fitdist
Determine the MLE using the \(fidistrplus\) R package and compare with your result. Again using the \(fidistrplus\) R package, draw the likelihood curve and add the successive parameter values of the algorithm.
For \(n=2000\), simulate a sample \(y_1,\ldots,y_n\) from a gamma distribution with parameter shape parameter \(\alpha= 3\) and rate parameter \(\beta = 5\).
Use a gradient descent algorithm to estimate the distribution parameter by MLE.
The derivatives of the log-likelihood function are
\[ \frac{\partial}{\partial \alpha} \log \mathcal L(y_1,\ldots,y_n, \alpha , \beta) = n\log(\beta) - n\psi(\alpha) + \sum_{i=1}^n\log(y_i) \] where \(\psi\) is the digamma function (use the digamma R function): \(\psi(\alpha)= \Gamma'(\alpha)/\Gamma(\alpha)\), and
\[ \frac{\partial}{\partial \beta} \log \mathcal L(y_1,\ldots,y_n, \alpha , \beta) = n\frac{\alpha}{\beta} - \sum_{i=1}^ny_i. \]