1 probability reminder

1.1 Distribution functions

  • For a continuous variable \(X\), the probability density function pdf is the mathematical function \(f\) whose the area under curve between \(a\) and \(b\) give the probability:

\[ P(a\leq X\leq b)= \int_{a}^b f(x)dx. \]

  • For a discrete random variable \(X\) taking values in a space value \(\xi\), the probability mass functions p.m.f are the probabilities \[P(X = k), k\in\xi\]

  • the cumulative distribution function c.d.f. is

\[ F(x) = P(X\leq x) \]

  • the quantile function is

\[ Q(\alpha) = F^{-1}(\alpha);\quad \text{that is to say the value of $x$ such that } P(X\leq x) = \alpha. \] N.B.: \(Q(0.5)\) is the median of \(X\), \(Q(0.25)\) is the first quartile and \(Q(0.75)\) is the third quartile

1.2 Expectation, variance, covariance

For a continuous random variable taking value in \(\mathbb R\) and with density \(f\), the expectation of \(X\) is

\[ \mathbb E X = \int_{\mathbb R} xf(x) dx. \] For a discrete variable taking value on a discrete space \(\xi\), the expectation of \(X\) is \[ \mathbb E X = \sum_{k\in\xi} k P(X=k). \] The variance is \[ var(X) = \mathbb E\left((X-\mathbb E(X))^2\right) = \mathbb E(X^2)-(\mathbb E(X))^2 \] The standard deviation is \(sd(X)=\sqrt{var(X)}\). Finally, for two random variables \(X\) and \(Y\), the covariance between \(X\) and \(Y\) is \[ cov(X,Y)=E\left((X-E(X))(Y-E(Y))\right) = \mathbb E(XY) - \mathbb E(X)\mathbb E(Y) \]

1.3 On a sample

For an independent and identically distributed (i.i.d.) sample \(y_1,\ldots,y_n\), an estimation of expectation and variance of the sample are the mean value (or empirical mean) and the empirical variance given by: \[ \bar y=\frac{1}{n}\sum_{i=1}^n y_i,\quad S_y^2 = \frac{1}{n}\sum_{i=1}^n(y_i - \bar y)^2 = \bar{y^2} - (\bar y)^2. \] Standard deviation (sd) on the sample is \(sd(y_1,\ldots,y_n)=\sqrt{S^2_y}\)

In the same way, the empirical covariance between the samples \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_n\) is \[ C_{x,y}=\frac{1}{n}\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y) = \bar{xy} - \bar x\bar y. \]

Finally the (Pearson’s) correlation between \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_n\) is \[ r_{x,y}=\frac{C_{x,y}}{\sqrt{S_x^2S_y^2}} \] Pearson’s correlation is on \([-1,1]\), \(\pm 1\) for perfect linear dependence and 0 for no linear dependence.

2 Distributions

2.1 Generating distributions

In the following exercises, we will use the R functions:

  • rnorm, rexp, rbinom, rpois to simulate random variables with respectively Gaussian, exponential, binomial and poisson distributions
  • pnorm, pexp, pbinom, ppois to to evaluate the c.d.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions
  • qnorm, qexp, qbinom, qpois to to evaluate the quantile function, i.e. the inverse of the c.d.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions
  • dnorm, dexp, dbinom, dpois to to evaluate the density or p.m.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions
plot(function(x) dnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="density of the gaussian distribution")

plot(function(x) pnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="c.d.f. of the gaussian distribution")

 barplot(dpois(0:10,4),ylab="Probability",xlab="k",
space=2,ylim=c(0,0.2), names.arg=0:10, main="p.m.f. of the poisson distribution with parameter 4")

 plot(function(x) ppois(x,4), xlim=c(-1,10), ylab="", main="c.d.f. of the poisson distribution with parameter 4")

2.2 Simulation of classical distributions

First, fix the seed by using the following command

set.seed(1)
  • Using the R functions rexp, rnorm, rbinom, generate i.i.d. samples \(X_1\), \(X_2\) and \(X_3\) of size \(n=10000\) respectively from exponential distribution with rate parameter \(\lambda=2\), from Gaussian distribution with mean value \(\mu=3\) and standard deviation \(sd=2\) and from Bernoulli distribution with probability of success \(p=0.25\).
  • Create a data frame \(X\) (R command data.frame) which contains \(X_1\), \(X_2\) and \(X_3\). Print the top of the data frame (use head).
  • Using the R function apply, Calculate the mean value and the empirical variance of the samples. Compare with the theoretical values. Evaluate the Pearson’s correlation between the 3 samples and also the covariance matrix (use \(cor(X)\), \(cov(X)\)).
  • Plot the histogram of the distributions of the exponential and Gaussian distribution and add the corresponding density curves (use hist(X1, freq=FALSE))
  • print the table of frequency (use table(X3)/n) and compare these values with \(p\) and \(1-p\)
n=10000
#?rexp
#?rnorm
#?rbinom

X1<-rexp(n,2) ##add comment
X2<-rnorm(n,3,2)
X3<-rbinom(n,1,0.25)

X=data.frame(exp=X1,Gauss=X2,Bernouilli=X3) 
head(X)  
##          exp    Gauss Bernouilli
## 1 0.37759092 4.869048          0
## 2 0.59082139 3.580626          0
## 3 0.07285336 3.387249          0
## 4 0.06989763 5.147031          1
## 5 0.21803431 2.286873          0
## 6 1.44748427 3.790633          0
apply(X,2,mean)
##        exp      Gauss Bernouilli 
##  0.4991806  3.0078905  0.2482000
apply(X,2,var)
##        exp      Gauss Bernouilli 
##  0.2578852  4.0025920  0.1866154
cov(X)
##                    exp        Gauss   Bernouilli
## exp         0.25788516 -0.023411310 -0.006222800
## Gauss      -0.02341131  4.002592022 -0.005486475
## Bernouilli -0.00622280 -0.005486475  0.186615422
hist(X1,breaks=30,freq=FALSE) ##add comment
plot(function(x) dexp(x,2),add=TRUE,col="red",xlim=c(0,5))##add comment

hist(X2,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,3,2),add=TRUE,col="red",xlim=c(-10,10))

table(X3)/n
## X3
##      0      1 
## 0.7518 0.2482
pie(table(X3))

2.3 Simulating distributions using uniform samplings

Let \(U\) a random variable distributed from the uniform distribution on \([0,1]\). Then for any continuous cumulative distribution function \(F\), the random variable \[Z=F^{-1}(U)\quad \text{has distribution } F.\] Using this result, simulate the same exponential, Gaussian samples and answer the same questions as exercise 1. Use the r function runif, qexp, qnorm

n<-10^3
U<-runif(n,0,1)
X1<-qnorm(U,2,3)
X2<-qexp(U,5)

hist(X1,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,2,3),add=TRUE,col="red",xlim=range(X1))

hist(X2,breaks=30,freq=FALSE)
plot(function(x) dexp(x,5),add=TRUE,col="red",xlim=range(X2))

cor(X1,X2)
## [1] 0.9117821

Let \(X\) a continuous random variable with c.d.f. \(F\). Then the random variable \(Z=F(X)\) is uniformly distributed on \([0,1]\).

Using this result, from a Gaussian distribution, then from an exponential and poisson distribution, simulate a uniform sample on \([0,1]\).

n<-10^4
X1<-rnorm(n,2,3)
X2<-rexp(n,5)
U1<-pnorm(X1,2,3)
U2<-pexp(X2,5)

hist(U1)

hist(U2)

2.3.1 Pareto distribution

The Pareto distribution has cumulative distribution function

\[ F(x) = 1 - (\frac{\mu}{x})^k\quad x\geq \mu,\quad k\in\mathbb R+ \] Determine the quantile function \(Q\). Verify the median is \(Q(0.5)=\mu\sqrt[k]{2}\). Create the function rPareto which for \(n, \mu ,k\) return a sample constituting with \(n\) Pareto variables with parameter \(\mu\) and \(k\). Try the function for different values of \(\mu\) and \(k\). What is the role of the these parameters?

##TODO

2.4 Theorem Central Limit (TCL), convergence in distribution

Let \(X_1,\ldots,X_n\) an i.i.d. sample with expectation \(\mathbb E(X)=\mu\) and variance \(Var(X) =\sigma^2\). Then we have the following convergence in sitribution:
\[ \sqrt{n}(\bar X_n - \mu) \overset{\mathcal D}{\rightarrow} \mathcal N(0,\sigma^2) \quad\text{as soons as } n\rightarrow\infty. \] That is to say the random variable \(\bar X_n\) is approximately normally distributed with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\).

2.4.1 Poison distribution

Using a loop ‘for’, simulate \(M=10^4\) samples of size \(n=1000\) from a Poisson distribution with parameter \(\lambda=5\) and verify the theorem. Note that for a Poisson distribution, the expectation and the variance are \(\lambda\): \(\mu = \mathbb EX= var(X)= \lambda\). Plot the histogram of the statistic with the density of the Gaussian.

M=10^4;n=10^3
Z=NULL;
lambda=5
mu=lambda
sigma=sqrt(lambda)
for(m in 1:M){
 X<-rpois(n,lambda)
 Z[m]<-sqrt(n)*(mean(X) - mu)
}
hist(Z,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,0,sigma),add=TRUE,col="red",xlim=c(min(Z),max(Z)))

2.4.2 Exponential distribution

Try now with a exponential distribution with rate parameter \(\lambda=5\). Note that for an exponential distribution the expectation is \(\mu = 1/\lambda\), and the variance is \(1/\lambda^2\).

2.5 Law of large numbers (LLN), almost surely convergence

Let \(X_1,\ldots,X_n\) an i.i.d. sample with expectation \(\mu\). Then \[P(\bar X_n \rightarrow \mu)=1.\] We say that \(\bar X_n\) converges almost surely (a.s.) to \(\mu\).

simulate a sample of size \(n=10^4\) of exponential distribution with parameter \(\lambda=2\). For \(k\) varying from 1 to \(n\), evaluate the mean value \(Z_k=\bar X_k\). Plot the sequence \(Z_k\) and add the horizontal line \(y=0.5\) (use abline(h=0.5)).

set.seed(2)
n=10^4
#TODO

2.6 Multivariate Gaussian distribution

2.6.1 Cholesky decomposition

Any symmetric definite-positive matrix \(\Sigma\) can to be decompose in the product of an lower triangular matrix \(L\) (composed with 0 above the diagonal) and his transposed: \[ \Sigma=LL^T. \] Find these two triangular matrices require specific algebra algorithm (refereed as the Cholesky decomposition).

Let \[ \Sigma= \left(\begin{matrix}1 & 0.4 \\0.4 & 1\end{matrix}\right) \] Using the eigen R function that provides the eigen value of a matrix, check that the matrix \(\Sigma\) is definite-positive (Note that a symmetric matrix is positive definite if and only if all the eigen values of \(\Sigma\) are strictly positives). Use the R function chol to get the lower triangular matrix. Make the matricial product to check that \(\Sigma=LL^T\). The R matricial product is %*% and the transpose operation is t().

sigma=matrix(c(1,0.4,0.4,1),2,2)
eigen(sigma)
## eigen() decomposition
## $values
## [1] 1.4 0.6
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068  0.7071068
L=t(chol(sigma))
L%*%t(L)
##      [,1] [,2]
## [1,]  1.0  0.4
## [2,]  0.4  1.0

2.6.2 Simulating of the multivariate Gaussian distribution

Let \(X=(Z_1,Z_2)\) be a . We say that \(X\) is multivariate Gaussian with mean value \(m=(m_1,m_2)\) and covariance matrix \(\Sigma\) if the density of \(X\) is

\[ f(x,m,\Sigma)=\frac{1}{(2 \pi)^{n / 2} \sqrt{\operatorname{det} \Sigma}} \exp \left(-\frac{1}{2}(x-m)^t \Sigma^{-1}(x-m)\right),\quad x\in \mathbb R^2 \]

library(rvinecopulib)

Cn <- bicop_dist(family = "gaussian", parameters =0.8)

contour(Cn,col="blue",main="contour-plot of bivariate gaussian distribution")

## plots
plot(Cn,margins="norm",main="Bivariate gaussian density") # surface plot of copula density

To simulate a multivariate Gaussian sample with mean value \(m=(m_1,m_2)\) and covariance matrix \(\Sigma\) we can use the Cholesky decomposition. In fact, Consider \(Z_1\) and \(Z_2\) two independent random variable from the Gaussian distribution with mean value 0 and sd 1. Let \(Z=(Z_1,Z_2)\). Then we have the property that the vector \(X=L^T Z + m\) is multivariate Gaussian with covariance matrix \(\Sigma\) and mean value \(m\).

  • Generate a sample of size \(n\) from the multivariate Gaussian distribution with covariance \(\Sigma\) and mean value \(m=c(0,0).\) To do that, use the cholesky decomposition (optional) or use the R package mvtnorm:
##or using mvtnorm
library(mvtnorm)
sigma=matrix(c(1,0.4,0.4,1),2,2) #matrix sigma
m=c(0,0) #mean value vector m
X<-rmvnorm(M,m,sigma)
  • Plot the sample and the two histograms

  • Calculate the mean value and the empirical variance of the two component of the samples and the empirical covariance between the two components.

  • Start again with \[ \Sigma_2= \left(\begin{matrix}1 & 0.1 \\0.1 & 1\end{matrix}\right) \text{ and } \Sigma_3= \left(\begin{matrix}1 & 0.9 \\0.9 & 1\end{matrix}\right). \]

  • Use the following code to add the ‘contour-plot’ of the distribution

library(rvinecopulib)

plot(X[,1],X[,2],main=paste("rho=",0.4))
Cn <- bicop_dist(family = "gaussian", rotation = 0, parameters =0.4)
contour(Cn,col="blue",add=TRUE)

2.7 Other multivariate distributions

Simulate a bivariate sample \((X_1,Y_1),\ldots,(X_n,Y_n)\) in which \(X_1,\ldots,X_n\) are i.i.d. and follow a exponential distribution with shape parameter \(\lambda=3\), \(Y_1,\ldots,Y_n\) are i.i.d. and follow a Gaussian distribution white mean value \(\mu=2\) and variance \(\sigma^2=1\), and such that \(X_1\) and \(X_2\) be correlated

To do that:

  • Simulate a \((X_1,Y_1),\ldots,(X_n,Y_n)\) following a multivariate Gaussian distribution with mean value \(m=(2,2)\) and covariance 0.4 using the R package mvtnorm and the function rmvnorm.

  • Transform \(X_1,\ldots,X_n\) in order to obtain an uniform sample following Exercise 2.

  • Transform the uniform sample in order to obtain an exponential sample following Exercise 2.

  • plot \(Y\) vs \(X\). Plot the histogram of \(X_1,\ldots,X_n\) and \(Y_1,\ldots,Y_n\)

  • print the Pearson’s correlation and Spearman’s correlation.

  • Use the following R code to have the scatterplot of the distribution and the histograms:

xhist <- hist(X[,1], plot=FALSE)
yhist <- hist(X[,2], plot=FALSE)
top <- max(c(xhist$density, yhist$density))
nf <- layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE)
par(mar=c(4,4,1,1))
#image(vx,vy,z,col=rev(heat.colors(101)))
plot(X,cex=0.2)
#points(X,cex=.2)
par(mar=c(0,3,1,1))
barplot(xhist$density, axes=FALSE, ylim=c(0, top), space=0)
lines((density(X[,1])$x-xhist$breaks[1])/diff(xhist$breaks)[1],
      dexp(density(X[,1])$x,3),col="red")
par(mar=c(3,0,1,1))
barplot(yhist$density, axes=FALSE, xlim=c(0, top), space=0, 
        horiz=TRUE)
lines(dnorm(density(X[,2])$x,2,1),(density(X[,2])$x-yhist$breaks[1])/
        diff(yhist$breaks)[1],col="red")