\[ P(a\leq X\leq b)= \int_{a}^b f(x)dx. \]
For a discrete random variable \(X\) taking values in a space value \(\xi\), the probability mass functions p.m.f are the probabilities \[P(X = k), k\in\xi\]
the cumulative distribution function c.d.f. is
\[ F(x) = P(X\leq x) \]
\[ Q(\alpha) = F^{-1}(\alpha);\quad \text{that is to say the value of $x$ such that } P(X\leq x) = \alpha. \] N.B.: \(Q(0.5)\) is the median of \(X\), \(Q(0.25)\) is the first quartile and \(Q(0.75)\) is the third quartile
For a continuous random variable taking value in \(\mathbb R\) and with density \(f\), the expectation of \(X\) is
\[ \mathbb E X = \int_{\mathbb R} xf(x) dx. \] For a discrete variable taking value on a discrete space \(\xi\), the expectation of \(X\) is \[ \mathbb E X = \sum_{k\in\xi} k P(X=k). \] The variance is \[ var(X) = \mathbb E\left((X-\mathbb E(X))^2\right) = \mathbb E(X^2)-(\mathbb E(X))^2 \] The standard deviation is \(sd(X)=\sqrt{var(X)}\). Finally, for two random variables \(X\) and \(Y\), the covariance between \(X\) and \(Y\) is \[ cov(X,Y)=E\left((X-E(X))(Y-E(Y))\right) = \mathbb E(XY) - \mathbb E(X)\mathbb E(Y) \]
For an independent and identically distributed (i.i.d.) sample \(y_1,\ldots,y_n\), an estimation of expectation and variance of the sample are the mean value (or empirical mean) and the empirical variance given by: \[ \bar y=\frac{1}{n}\sum_{i=1}^n y_i,\quad S_y^2 = \frac{1}{n}\sum_{i=1}^n(y_i - \bar y)^2 = \bar{y^2} - (\bar y)^2. \] Standard deviation (sd) on the sample is \(sd(y_1,\ldots,y_n)=\sqrt{S^2_y}\)
In the same way, the empirical covariance between the samples \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_n\) is \[ C_{x,y}=\frac{1}{n}\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y) = \bar{xy} - \bar x\bar y. \]
Finally the (Pearson’s) correlation between \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_n\) is \[ r_{x,y}=\frac{C_{x,y}}{\sqrt{S_x^2S_y^2}} \] Pearson’s correlation is on \([-1,1]\), \(\pm 1\) for perfect linear dependence and 0 for no linear dependence.
In the following exercises, we will use the R functions:
plot(function(x) dnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="density of the gaussian distribution")
plot(function(x) pnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="c.d.f. of the gaussian distribution")
barplot(dpois(0:10,4),ylab="Probability",xlab="k",
space=2,ylim=c(0,0.2), names.arg=0:10, main="p.m.f. of the poisson distribution with parameter 4")
plot(function(x) ppois(x,4), xlim=c(-1,10), ylab="", main="c.d.f. of the poisson distribution with parameter 4")
First, fix the seed by using the following command
set.seed(1)
n=10000
#?rexp
#?rnorm
#?rbinom
X1<-rexp(n,2) ##add comment
X2<-rnorm(n,3,2)
X3<-rbinom(n,1,0.25)
X=data.frame(exp=X1,Gauss=X2,Bernouilli=X3)
head(X)
## exp Gauss Bernouilli
## 1 0.37759092 4.869048 0
## 2 0.59082139 3.580626 0
## 3 0.07285336 3.387249 0
## 4 0.06989763 5.147031 1
## 5 0.21803431 2.286873 0
## 6 1.44748427 3.790633 0
apply(X,2,mean)
## exp Gauss Bernouilli
## 0.4991806 3.0078905 0.2482000
apply(X,2,var)
## exp Gauss Bernouilli
## 0.2578852 4.0025920 0.1866154
cov(X)
## exp Gauss Bernouilli
## exp 0.25788516 -0.023411310 -0.006222800
## Gauss -0.02341131 4.002592022 -0.005486475
## Bernouilli -0.00622280 -0.005486475 0.186615422
hist(X1,breaks=30,freq=FALSE) ##add comment
plot(function(x) dexp(x,2),add=TRUE,col="red",xlim=c(0,5))##add comment
hist(X2,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,3,2),add=TRUE,col="red",xlim=c(-10,10))
table(X3)/n
## X3
## 0 1
## 0.7518 0.2482
pie(table(X3))
Let \(U\) a random variable distributed from the uniform distribution on \([0,1]\). Then for any continuous cumulative distribution function \(F\), the random variable \[Z=F^{-1}(U)\quad \text{has distribution } F.\] Using this result, simulate the same exponential, Gaussian samples and answer the same questions as exercise 1. Use the r function runif, qexp, qnorm
n<-10^3
U<-runif(n,0,1)
X1<-qnorm(U,2,3)
X2<-qexp(U,5)
hist(X1,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,2,3),add=TRUE,col="red",xlim=range(X1))
hist(X2,breaks=30,freq=FALSE)
plot(function(x) dexp(x,5),add=TRUE,col="red",xlim=range(X2))
cor(X1,X2)
## [1] 0.9117821
Let \(X\) a continuous random variable with c.d.f. \(F\). Then the random variable \(Z=F(X)\) is uniformly distributed on \([0,1]\).
Using this result, from a Gaussian distribution, then from an exponential and poisson distribution, simulate a uniform sample on \([0,1]\).
n<-10^4
X1<-rnorm(n,2,3)
X2<-rexp(n,5)
U1<-pnorm(X1,2,3)
U2<-pexp(X2,5)
hist(U1)
hist(U2)
The Pareto distribution has cumulative distribution function
\[ F(x) = 1 - (\frac{\mu}{x})^k\quad x\geq \mu,\quad k\in\mathbb R+ \] Determine the quantile function \(Q\). Verify the median is \(Q(0.5)=\mu\sqrt[k]{2}\). Create the function rPareto which for \(n, \mu ,k\) return a sample constituting with \(n\) Pareto variables with parameter \(\mu\) and \(k\). Try the function for different values of \(\mu\) and \(k\). What is the role of the these parameters?
##TODO
Let \(X_1,\ldots,X_n\) an i.i.d.
sample with expectation \(\mathbb
E(X)=\mu\) and variance \(Var(X)
=\sigma^2\). Then we have the following convergence in
sitribution:
\[
\sqrt{n}(\bar X_n - \mu) \overset{\mathcal D}{\rightarrow} \mathcal
N(0,\sigma^2) \quad\text{as soons as } n\rightarrow\infty.
\] That is to say the random variable \(\bar X_n\) is approximately normally
distributed with mean \(\mu\) and
variance \(\frac{\sigma^2}{n}\).
Using a loop ‘for’, simulate \(M=10^4\) samples of size \(n=1000\) from a Poisson distribution with parameter \(\lambda=5\) and verify the theorem. Note that for a Poisson distribution, the expectation and the variance are \(\lambda\): \(\mu = \mathbb EX= var(X)= \lambda\). Plot the histogram of the statistic with the density of the Gaussian.
M=10^4;n=10^3
Z=NULL;
lambda=5
mu=lambda
sigma=sqrt(lambda)
for(m in 1:M){
X<-rpois(n,lambda)
Z[m]<-sqrt(n)*(mean(X) - mu)
}
hist(Z,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,0,sigma),add=TRUE,col="red",xlim=c(min(Z),max(Z)))
Try now with a exponential distribution with rate parameter \(\lambda=5\). Note that for an exponential distribution the expectation is \(\mu = 1/\lambda\), and the variance is \(1/\lambda^2\).
Let \(X_1,\ldots,X_n\) an i.i.d. sample with expectation \(\mu\). Then \[P(\bar X_n \rightarrow \mu)=1.\] We say that \(\bar X_n\) converges almost surely (a.s.) to \(\mu\).
simulate a sample of size \(n=10^4\) of exponential distribution with parameter \(\lambda=2\). For \(k\) varying from 1 to \(n\), evaluate the mean value \(Z_k=\bar X_k\). Plot the sequence \(Z_k\) and add the horizontal line \(y=0.5\) (use abline(h=0.5)).
set.seed(2)
n=10^4
#TODO
Any symmetric definite-positive matrix \(\Sigma\) can to be decompose in the product of an lower triangular matrix \(L\) (composed with 0 above the diagonal) and his transposed: \[ \Sigma=LL^T. \] Find these two triangular matrices require specific algebra algorithm (refereed as the Cholesky decomposition).
Let \[ \Sigma= \left(\begin{matrix}1 & 0.4 \\0.4 & 1\end{matrix}\right) \] Using the eigen R function that provides the eigen value of a matrix, check that the matrix \(\Sigma\) is definite-positive (Note that a symmetric matrix is positive definite if and only if all the eigen values of \(\Sigma\) are strictly positives). Use the R function chol to get the lower triangular matrix. Make the matricial product to check that \(\Sigma=LL^T\). The R matricial product is %*% and the transpose operation is t().
sigma=matrix(c(1,0.4,0.4,1),2,2)
eigen(sigma)
## eigen() decomposition
## $values
## [1] 1.4 0.6
##
## $vectors
## [,1] [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068 0.7071068
L=t(chol(sigma))
L%*%t(L)
## [,1] [,2]
## [1,] 1.0 0.4
## [2,] 0.4 1.0
Let \(X=(Z_1,Z_2)\) be a . We say that \(X\) is multivariate Gaussian with mean value \(m=(m_1,m_2)\) and covariance matrix \(\Sigma\) if the density of \(X\) is
\[ f(x,m,\Sigma)=\frac{1}{(2 \pi)^{n / 2} \sqrt{\operatorname{det} \Sigma}} \exp \left(-\frac{1}{2}(x-m)^t \Sigma^{-1}(x-m)\right),\quad x\in \mathbb R^2 \]
library(rvinecopulib)
Cn <- bicop_dist(family = "gaussian", parameters =0.8)
contour(Cn,col="blue",main="contour-plot of bivariate gaussian distribution")
## plots
plot(Cn,margins="norm",main="Bivariate gaussian density") # surface plot of copula density
To simulate a multivariate Gaussian sample with mean value \(m=(m_1,m_2)\) and covariance matrix \(\Sigma\) we can use the Cholesky decomposition. In fact, Consider \(Z_1\) and \(Z_2\) two independent random variable from the Gaussian distribution with mean value 0 and sd 1. Let \(Z=(Z_1,Z_2)\). Then we have the property that the vector \(X=L^T Z + m\) is multivariate Gaussian with covariance matrix \(\Sigma\) and mean value \(m\).
##or using mvtnorm
library(mvtnorm)
sigma=matrix(c(1,0.4,0.4,1),2,2) #matrix sigma
m=c(0,0) #mean value vector m
X<-rmvnorm(M,m,sigma)
Plot the sample and the two histograms
Calculate the mean value and the empirical variance of the two component of the samples and the empirical covariance between the two components.
Start again with \[ \Sigma_2= \left(\begin{matrix}1 & 0.1 \\0.1 & 1\end{matrix}\right) \text{ and } \Sigma_3= \left(\begin{matrix}1 & 0.9 \\0.9 & 1\end{matrix}\right). \]
Use the following code to add the ‘contour-plot’ of the distribution
library(rvinecopulib)
plot(X[,1],X[,2],main=paste("rho=",0.4))
Cn <- bicop_dist(family = "gaussian", rotation = 0, parameters =0.4)
contour(Cn,col="blue",add=TRUE)
Simulate a bivariate sample \((X_1,Y_1),\ldots,(X_n,Y_n)\) in which \(X_1,\ldots,X_n\) are i.i.d. and follow a exponential distribution with shape parameter \(\lambda=3\), \(Y_1,\ldots,Y_n\) are i.i.d. and follow a Gaussian distribution white mean value \(\mu=2\) and variance \(\sigma^2=1\), and such that \(X_1\) and \(X_2\) be correlated
To do that:
Simulate a \((X_1,Y_1),\ldots,(X_n,Y_n)\) following a multivariate Gaussian distribution with mean value \(m=(2,2)\) and covariance 0.4 using the R package mvtnorm and the function rmvnorm.
Transform \(X_1,\ldots,X_n\) in order to obtain an uniform sample following Exercise 2.
Transform the uniform sample in order to obtain an exponential sample following Exercise 2.
plot \(Y\) vs \(X\). Plot the histogram of \(X_1,\ldots,X_n\) and \(Y_1,\ldots,Y_n\)
print the Pearson’s correlation and Spearman’s correlation.
Use the following R code to have the scatterplot of the distribution and the histograms:
xhist <- hist(X[,1], plot=FALSE)
yhist <- hist(X[,2], plot=FALSE)
top <- max(c(xhist$density, yhist$density))
nf <- layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE)
par(mar=c(4,4,1,1))
#image(vx,vy,z,col=rev(heat.colors(101)))
plot(X,cex=0.2)
#points(X,cex=.2)
par(mar=c(0,3,1,1))
barplot(xhist$density, axes=FALSE, ylim=c(0, top), space=0)
lines((density(X[,1])$x-xhist$breaks[1])/diff(xhist$breaks)[1],
dexp(density(X[,1])$x,3),col="red")
par(mar=c(3,0,1,1))
barplot(yhist$density, axes=FALSE, xlim=c(0, top), space=0,
horiz=TRUE)
lines(dnorm(density(X[,2])$x,2,1),(density(X[,2])$x-yhist$breaks[1])/
diff(yhist$breaks)[1],col="red")