1 probability reminder

1.1 Distribution functions

For a continuous variable $X$, the probability density function pdf is the mathematical function $f$ whose the area under curve between $a$ and $b$ give the probability:

\[ P(a\leq X\leq b)= \int_{a}^b f(x)dx. \]

For a discrete random variable $X$ taking values in a space value $\xi$, the probability mass functions p.m.f are the probabilities \[P(X = k), k\in\xi\]
the cumulative distribution function c.d.f. is

\[ F(x) = P(X\leq x) \]

the quantile function is

\[ Q(\alpha) = F^{-1}(\alpha);\quad \text{that is to say the value of $x$ such that } P(X\leq x) = \alpha. \] N.B.: $Q(0.5)$ is the median of $X$, $Q(0.25)$ is the first quartile and $Q(0.75)$ is the third quartile

1.2 Expectation, variance, covariance

For a continuous random variable taking value in $\mathbb R$ and with density $f$, the expectation of $X$ is

\[ \mathbb E X = \int_{\mathbb R} xf(x) dx. \] For a discrete variable taking value on a discrete space $\xi$, the expectation of $X$ is \[ \mathbb E X = \sum_{k\in\xi} k P(X=k). \] The variance is \[ var(X) = \mathbb E\left((X-\mathbb E(X))^2\right) = \mathbb E(X^2)-(\mathbb E(X))^2 \] The standard deviation is $sd(X)=\sqrt{var(X)}$. Finally, for two random variables $X$ and $Y$, the covariance between $X$ and $Y$ is \[ cov(X,Y)=E\left((X-E(X))(Y-E(Y))\right) = \mathbb E(XY) - \mathbb E(X)\mathbb E(Y) \]

1.3 On a sample

For an independent and identically distributed (i.i.d.) sample $y_1,\ldots,y_n$, an estimation of expectation and variance of the sample are the mean value (or empirical mean) and the empirical variance given by: \[ \bar y=\frac{1}{n}\sum_{i=1}^n y_i,\quad S_y^2 = \frac{1}{n}\sum_{i=1}^n(y_i - \bar y)^2 = \bar{y^2} - (\bar y)^2. \] Standard deviation (sd) on the sample is $sd(y_1,\ldots,y_n)=\sqrt{S^2_y}$

In the same way, the empirical covariance between the samples $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ is \[ C_{x,y}=\frac{1}{n}\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y) = \bar{xy} - \bar x\bar y. \]

Finally the (Pearson’s) correlation between $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ is \[ r_{x,y}=\frac{C_{x,y}}{\sqrt{S_x^2S_y^2}} \] Pearson’s correlation is on $[-1,1]$, $\pm 1$ for perfect linear dependence and 0 for no linear dependence.

2 Distributions

2.1 Generating distributions

In the following exercises, we will use the R functions:

rnorm, rexp, rbinom, rpois to simulate random variables with respectively Gaussian, exponential, binomial and poisson distributions
pnorm, pexp, pbinom, ppois to to evaluate the c.d.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions
qnorm, qexp, qbinom, qpois to to evaluate the quantile function, i.e. the inverse of the c.d.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions
dnorm, dexp, dbinom, dpois to to evaluate the density or p.m.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions

plot(function(x) dnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="density of the gaussian distribution")

plot(function(x) pnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="c.d.f. of the gaussian distribution")

 barplot(dpois(0:10,4),ylab="Probability",xlab="k",
space=2,ylim=c(0,0.2), names.arg=0:10, main="p.m.f. of the poisson distribution with parameter 4")

 plot(function(x) ppois(x,4), xlim=c(-1,10), ylab="", main="c.d.f. of the poisson distribution with parameter 4")

2.2 Simulation of classical distributions

First, fix the seed by using the following command

set.seed(1)

Using the R functions rexp, rnorm, rbinom, generate i.i.d. samples $X_1$, $X_2$ and $X_3$ of size $n=10000$ respectively from exponential distribution with rate parameter $\lambda=2$, from Gaussian distribution with mean value $\mu=3$ and standard deviation $sd=2$ and from Bernoulli distribution with probability of success $p=0.25$.
Create a data frame $X$ (R command data.frame) which contains $X_1$, $X_2$ and $X_3$. Print the top of the data frame (use head).
Using the R function apply, Calculate the mean value and the empirical variance of the samples. Compare with the theoretical values. Evaluate the Pearson’s correlation between the 3 samples and also the covariance matrix (use $cor(X)$, $cov(X)$).
Plot the histogram of the distributions of the exponential and Gaussian distribution and add the corresponding density curves (use hist(X1, freq=FALSE))
print the table of frequency (use table(X3)/n) and compare these values with $p$ and $1-p$

n=10000
#?rexp
#?rnorm
#?rbinom

X1<-rexp(n,2) ##add comment
X2<-rnorm(n,3,2)
X3<-rbinom(n,1,0.25)

X=data.frame(exp=X1,Gauss=X2,Bernouilli=X3) 
head(X)

##          exp    Gauss Bernouilli
## 1 0.37759092 4.869048          0
## 2 0.59082139 3.580626          0
## 3 0.07285336 3.387249          0
## 4 0.06989763 5.147031          1
## 5 0.21803431 2.286873          0
## 6 1.44748427 3.790633          0

apply(X,2,mean)

##        exp      Gauss Bernouilli 
##  0.4991806  3.0078905  0.2482000

apply(X,2,var)

##        exp      Gauss Bernouilli 
##  0.2578852  4.0025920  0.1866154

cov(X)

##                    exp        Gauss   Bernouilli
## exp         0.25788516 -0.023411310 -0.006222800
## Gauss      -0.02341131  4.002592022 -0.005486475
## Bernouilli -0.00622280 -0.005486475  0.186615422

hist(X1,breaks=30,freq=FALSE) ##add comment
plot(function(x) dexp(x,2),add=TRUE,col="red",xlim=c(0,5))##add comment

hist(X2,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,3,2),add=TRUE,col="red",xlim=c(-10,10))

table(X3)/n

## X3
##      0      1 
## 0.7518 0.2482

pie(table(X3))

2.3 Simulating distributions using uniform samplings

Let $U$ a random variable distributed from the uniform distribution on $[0,1]$. Then for any continuous cumulative distribution function $F$, the random variable \[Z=F^{-1}(U)\quad \text{has distribution } F.\] Using this result, simulate the same exponential, Gaussian samples and answer the same questions as exercise 1. Use the r function runif, qexp, qnorm

n<-10^3
U<-runif(n,0,1)
X1<-qnorm(U,2,3)
X2<-qexp(U,5)

hist(X1,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,2,3),add=TRUE,col="red",xlim=range(X1))

hist(X2,breaks=30,freq=FALSE)
plot(function(x) dexp(x,5),add=TRUE,col="red",xlim=range(X2))

cor(X1,X2)

## [1] 0.9117821

Let $X$ a continuous random variable with c.d.f. $F$. Then the random variable $Z=F(X)$ is uniformly distributed on $[0,1]$.

Using this result, from a Gaussian distribution, then from an exponential and poisson distribution, simulate a uniform sample on $[0,1]$.

n<-10^4
X1<-rnorm(n,2,3)
X2<-rexp(n,5)
U1<-pnorm(X1,2,3)
U2<-pexp(X2,5)

hist(U1)

hist(U2)

2.3.1 Pareto distribution

The Pareto distribution has cumulative distribution function

\[ F(x) = 1 - (\frac{\mu}{x})^k\quad x\geq \mu,\quad k\in\mathbb R+ \] Determine the quantile function $Q$. Verify the median is $Q(0.5)=\mu\sqrt[k]{2}$. Create the function rPareto which for $n, \mu ,k$ return a sample constituting with $n$ Pareto variables with parameter $\mu$ and $k$. Try the function for different values of $\mu$ and $k$. What is the role of the these parameters?

##TODO

2.4 Theorem Central Limit (TCL), convergence in distribution

Let $X_1,\ldots,X_n$ an i.i.d. sample with expectation $\mathbb E(X)=\mu$ and variance $Var(X) =\sigma^2$. Then we have the following convergence in sitribution:
\[ \sqrt{n}(\bar X_n - \mu) \overset{\mathcal D}{\rightarrow} \mathcal N(0,\sigma^2) \quad\text{as soons as } n\rightarrow\infty. \] That is to say the random variable $\bar X_n$ is approximately normally distributed with mean $\mu$ and variance $\frac{\sigma^2}{n}$.

2.4.1 Poison distribution

Using a loop ‘for’, simulate $M=10^4$ samples of size $n=1000$ from a Poisson distribution with parameter $\lambda=5$ and verify the theorem. Note that for a Poisson distribution, the expectation and the variance are $\lambda$: $\mu = \mathbb EX= var(X)= \lambda$. Plot the histogram of the statistic with the density of the Gaussian.

M=10^4;n=10^3
Z=NULL;
lambda=5
mu=lambda
sigma=sqrt(lambda)
for(m in 1:M){
 X<-rpois(n,lambda)
 Z[m]<-sqrt(n)*(mean(X) - mu)
}
hist(Z,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,0,sigma),add=TRUE,col="red",xlim=c(min(Z),max(Z)))

2.4.2 Exponential distribution

Try now with a exponential distribution with rate parameter $\lambda=5$. Note that for an exponential distribution the expectation is $\mu = 1/\lambda$, and the variance is $1/\lambda^2$.

2.5 Law of large numbers (LLN), almost surely convergence

Let $X_1,\ldots,X_n$ an i.i.d. sample with expectation $\mu$. Then \[P(\bar X_n \rightarrow \mu)=1.\] We say that $\bar X_n$ converges almost surely (a.s.) to $\mu$.

simulate a sample of size $n=10^4$ of exponential distribution with parameter $\lambda=2$. For $k$ varying from 1 to $n$, evaluate the mean value $Z_k=\bar X_k$. Plot the sequence $Z_k$ and add the horizontal line $y=0.5$ (use abline(h=0.5)).

set.seed(2)
n=10^4
#TODO

2.6 Multivariate Gaussian distribution

2.6.1 Cholesky decomposition

Any symmetric definite-positive matrix $\Sigma$ can to be decompose in the product of an lower triangular matrix $L$ (composed with 0 above the diagonal) and his transposed: \[ \Sigma=LL^T. \] Find these two triangular matrices require specific algebra algorithm (refereed as the Cholesky decomposition).

Let \[ \Sigma= \left(\begin{matrix}1 & 0.4 \\0.4 & 1\end{matrix}\right) \] Using the eigen R function that provides the eigen value of a matrix, check that the matrix $\Sigma$ is definite-positive (Note that a symmetric matrix is positive definite if and only if all the eigen values of $\Sigma$ are strictly positives). Use the R function chol to get the lower triangular matrix. Make the matricial product to check that $\Sigma=LL^T$. The R matricial product is %*% and the transpose operation is t().

sigma=matrix(c(1,0.4,0.4,1),2,2)
eigen(sigma)

## eigen() decomposition
## $values
## [1] 1.4 0.6
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068  0.7071068

L=t(chol(sigma))
L%*%t(L)

##      [,1] [,2]
## [1,]  1.0  0.4
## [2,]  0.4  1.0

2.6.2 Simulating of the multivariate Gaussian distribution

Let $X=(Z_1,Z_2)$ be a . We say that $X$ is multivariate Gaussian with mean value $m=(m_1,m_2)$ and covariance matrix $\Sigma$ if the density of $X$ is

\[ f(x,m,\Sigma)=\frac{1}{(2 \pi)^{n / 2} \sqrt{\operatorname{det} \Sigma}} \exp \left(-\frac{1}{2}(x-m)^t \Sigma^{-1}(x-m)\right),\quad x\in \mathbb R^2 \]

library(rvinecopulib)

Cn <- bicop_dist(family = "gaussian", parameters =0.8)

contour(Cn,col="blue",main="contour-plot of bivariate gaussian distribution")

## plots
plot(Cn,margins="norm",main="Bivariate gaussian density") # surface plot of copula density

To simulate a multivariate Gaussian sample with mean value $m=(m_1,m_2)$ and covariance matrix $\Sigma$ we can use the Cholesky decomposition. In fact, Consider $Z_1$ and $Z_2$ two independent random variable from the Gaussian distribution with mean value 0 and sd 1. Let $Z=(Z_1,Z_2)$. Then we have the property that the vector $X=L^T Z + m$ is multivariate Gaussian with covariance matrix $\Sigma$ and mean value $m$.

Generate a sample of size $n$ from the multivariate Gaussian distribution with covariance $\Sigma$ and mean value $m=c(0,0).$ To do that, use the cholesky decomposition (optional) or use the R package mvtnorm:

##or using mvtnorm
library(mvtnorm)
sigma=matrix(c(1,0.4,0.4,1),2,2) #matrix sigma
m=c(0,0) #mean value vector m
X<-rmvnorm(M,m,sigma)

Plot the sample and the two histograms
Calculate the mean value and the empirical variance of the two component of the samples and the empirical covariance between the two components.
Start again with \[ \Sigma_2= \left(\begin{matrix}1 & 0.1 \\0.1 & 1\end{matrix}\right) \text{ and } \Sigma_3= \left(\begin{matrix}1 & 0.9 \\0.9 & 1\end{matrix}\right). \]
Use the following code to add the ‘contour-plot’ of the distribution

library(rvinecopulib)

plot(X[,1],X[,2],main=paste("rho=",0.4))
Cn <- bicop_dist(family = "gaussian", rotation = 0, parameters =0.4)
contour(Cn,col="blue",add=TRUE)

2.7 Other multivariate distributions

Simulate a bivariate sample $(X_1,Y_1),\ldots,(X_n,Y_n)$ in which $X_1,\ldots,X_n$ are i.i.d. and follow a exponential distribution with shape parameter $\lambda=3$, $Y_1,\ldots,Y_n$ are i.i.d. and follow a Gaussian distribution white mean value $\mu=2$ and variance $\sigma^2=1$, and such that $X_1$ and $X_2$ be correlated

To do that:

Simulate a $(X_1,Y_1),\ldots,(X_n,Y_n)$ following a multivariate Gaussian distribution with mean value $m=(2,2)$ and covariance 0.4 using the R package mvtnorm and the function rmvnorm.
Transform $X_1,\ldots,X_n$ in order to obtain an uniform sample following Exercise 2.
Transform the uniform sample in order to obtain an exponential sample following Exercise 2.
plot $Y$ vs $X$. Plot the histogram of $X_1,\ldots,X_n$ and $Y_1,\ldots,Y_n$
print the Pearson’s correlation and Spearman’s correlation.
Use the following R code to have the scatterplot of the distribution and the histograms:

xhist <- hist(X[,1], plot=FALSE)
yhist <- hist(X[,2], plot=FALSE)
top <- max(c(xhist$density, yhist$density))
nf <- layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE)
par(mar=c(4,4,1,1))
#image(vx,vy,z,col=rev(heat.colors(101)))
plot(X,cex=0.2)
#points(X,cex=.2)
par(mar=c(0,3,1,1))
barplot(xhist$density, axes=FALSE, ylim=c(0, top), space=0)
lines((density(X[,1])$x-xhist$breaks[1])/diff(xhist$breaks)[1],
      dexp(density(X[,1])$x,3),col="red")
par(mar=c(3,0,1,1))
barplot(yhist$density, axes=FALSE, xlim=c(0, top), space=0, 
        horiz=TRUE)
lines(dnorm(density(X[,2])$x,2,1),(density(X[,2])$x-yhist$breaks[1])/
        diff(yhist$breaks)[1],col="red")

Some numerical methods using R, TD1-2

Tom Rohmer (tom.rohmer@inrae.fr)

2026-01