--- title: "Some numerical methods with R, TD1" author: Tom Rohmer (tom.rohmer@inrae.fr) date: "2024-01-09" output: html_document: number_sections: yes toc: yes --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` This TD is to write in Rmarkdown. * File>New File>R markdown (complete title, autor, date, HTLM). * Save your file in a chosen repertory. * Write your R code in second 'chunk'(instead of the example 'summary(cars)' ) * Add your comments outside of the 'chunks' * title name start by # and subtitle name start by ## * Use the ‘knit’ to obtain your htlm file # probability reminder A random variable $X$ is a variable whose value is determined after the realization of a phenomenon, experiment or random event. ## Distribution functions For a discrete random variable $X$ taking values in a space value $\xi$, the probability mass functions **p.m.f** are the probabilities $$P(X = k), k\in\xi$$ Classical discrete distribution: * Uniform distribution on $\{a,\ldots,b\}$ (e.g. roll a dice) * Bernouilli distribution (e.g. flip a coin) * Binomial distribution (e.g. tossing a coin multiple times) * Poisson distribution (counting law, e.g. Number of sinister in assurance) For a continuous random variable $X$, the probability density function **pdf** is the mathematical function $f$ whose the area under curve between $a$ and $b$ give the probability: $$ P(a\leq X\leq b)= \int_{a}^b f(x)dx. $$ Classical continue distribution: * Uniform distribution on $[a,b]$ (e.g. roll a real number in the interval $[a,b]$) * Gaussian distribution (e.g. heights of students in a class) * Exponential distribution (times between events or arrivals e.g. time between arrivals at a bus stop) The cumulative distribution function **c.d.f.** is $$ F(x) = P(X\leq x) $$ * the quantile function is $$ Q(\alpha) = F^{-1}(\alpha);\quad \text{that is to say the value of $x$ such that } P(X\leq x) = \alpha. $$ N.B.: $Q(0.5)$ is the median of $X$, $Q(0.25)$ is the first quartile and $Q(0.75)$ is the third quartile ## Expectation, variance, covariance For a continuous random variable taking value in $\mathbb R$ and with density $f$, the expectation of $X$ is $$ \mathbb E X = \int_{\mathbb R} xf(x) dx. $$ For a discrete variable taking value on a discrete space $\xi$, the expectation of $X$ is $$ \mathbb E X = \sum_{k\in\xi} k P(X=k). $$ The variance is $$ var(X) = \mathbb E\left((X-\mathbb E(X))^2\right) = \mathbb E(X^2)-(\mathbb E(X))^2 $$ The standard deviation is $sd(X)=\sqrt{var(X)}$. Finally, for two random variables $X$ and $Y$, the covariance between $X$ and $Y$ is $$ cov(X,Y)=E\left((X-E(X))(Y-E(Y))\right) = \mathbb E(XY) - \mathbb E(X)\mathbb E(Y) $$ ## On a sample For an independent and identically distributed (i.i.d.) sample $y_1,\ldots,y_n$, an estimation of expectation and variance of the sample are the mean value (or average/ empirical mean) and the empirical variance given by: $$ \bar y=mean(y_1,\ldots,y_n)=\frac{1}{n}\sum_{i=1}^n y_i,\quad S_y^2 =var(y_1,\ldots,y_n) = \frac{1}{n}\sum_{i=1}^n(y_i - \bar y)^2 = \bar{y^2} - (\bar y)^2. $$ Standard deviation (sd) on the sample is $sd(y_1,\ldots,y_n)=\sqrt{S^2_y}$ In the same way, the empirical covariance between the samples $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ is $$ C_{x,y}=\frac{1}{n}\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y) = \bar{xy} - \bar x\bar y. $$ Finally the (Pearson's) correlation between $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ is $$ r_{x,y}=\frac{C_{x,y}}{\sqrt{S_x^2S_y^2}} $$ Pearson's correlation is on $[-1,1]$, $\pm 1$ for perfect linear dependence and 0 for no linear dependence. In R, use the R function **mean, var, sd, cov, cor**. # Generating distributions In the following exercises, we will use the R functions: * **rnorm, rexp, rbinom, rpois** to **simulate** random variables with respectively Gaussian, exponential, binomial and poisson distributions * **pnorm, pexp, pbinom, ppois** to to evaluate the **c.d.f.** of random variables with respectively Gaussian, exponential, binomial and poisson distributions * **qnorm, qexp, qbinom, qpois** to to evaluate the **quantile function**, i.e. the inverse of the c.d.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions * **dnorm, dexp, dbinom, dpois** to to evaluate the **density or p.m.f.** of random variables with respectively Gaussian, exponential, binomial and poisson distributions ```{r,echo=TRUE} plot(function(x) dnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="density of the gaussian distribution") plot(function(x) pnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="c.d.f. of the gaussian distribution") barplot(dpois(0:10,4),ylab="Probability",xlab="k", space=2,ylim=c(0,0.2), names.arg=0:10, main="p.m.f. of the poisson distribution with parameter 4") plot(function(x) ppois(x,4), xlim=c(-1,10), ylab="", main="c.d.f. of the poisson distribution with parameter 4") ``` ## Simulation of classical distributions First, fix the seed by using the following command ```{r,echo=TRUE} set.seed(1) ``` * Using the R functions rexp, rnorm, rbinom, generate i.i.d. samples $X_1$, $X_2$ and $X_3$ of size $n=10000$ respectively from exponential distribution with rate parameter $\lambda=2$, from Gaussian distribution with mean value $\mu=3$ and standard deviation $sd=2$ and from Bernoulli distribution with probability of success $p=0.25$. * Create a data frame $X$ (R command data.frame) which contains $X_1$, $X_2$ and $X_3$. Print the top of the data frame (use head). * Using the R function apply, Calculate the mean value and the empirical variance of the samples. Compare with the theoretical values. Evaluate the Pearson's correlation between the 3 samples and also the covariance matrix (use $cor(X)$, $cov(X)$). * Plot the histogram of the distributions of the exponential and Gaussian distribution and add the corresponding density curves (use hist(X1, freq=FALSE)) * print the table of frequency (use table(X3)/n) and compare these values with $p$ and $1-p$ ```{r} n=10000 #?rexp #?rnorm #?rbinom X1<-rexp(n,2) ##add comment X2<-rnorm(n,3,2) X3<-rbinom(n,1,0.25) X=data.frame(exp=X1,Gauss=X2,Bernouilli=X3) head(X) apply(X,2,mean) apply(X,2,var) cov(X) hist(X1,breaks=30,freq=FALSE) ##add comment plot(function(x) dexp(x,2),add=TRUE,col="red",xlim=c(0,5))##add comment hist(X2,breaks=30,freq=FALSE) plot(function(x) dnorm(x,3,2),add=TRUE,col="red",xlim=c(-10,10)) table(X3)/n pie(table(X3)) ``` ## Simulating distributions using uniform samplings Let $U$ a random variable distributed from the uniform distribution on $[0,1]$. Then for any cumulative distribution function $F$, the random variable $$Z=F^{-1}(U)\quad \text{has distribution } F.$$ Using this result, simulate the same exponential, Gaussian samples and answer the same questions as exercise 1. Use the r function runif, qexp, qnorm, qbinom ```{r} set.seed(2) n=10000 U1=runif(n) #?qexp ##to complete ``` Let $X$ a continuous random variable with c.d.f. $F$. Then the random variable $Z=F(X)$ is uniformly distributed on $[0,1]$. Using this result, from a Gaussian distribution, then from an exponential distribution, simulate a uniform sample on $[0,1]$. ```{r} #to complete ``` ## Theorem Central Limit (TCL), convergence in distribution Let $X_1,\ldots,X_n$ an i.i.d. sample with expectation $\mathbb E(X)=\mu$ and variance $Var(X) =\sigma^2$. Then, $$ \sqrt{n}(\bar X_n - \mu) \rightarrow \mathcal N(0,\sigma^2) \quad\text{as soons as } n\rightarrow\infty $$ We say that the mean value $\bar X_n$ converges in distribution to a Gaussian distribution. Using a loop 'for', simulate $M=10^4$ samples of size $n=1000$ from a Poisson distribution with parameter $\lambda=5$ and verify the theorem. Plot the histogram of the statistic with the density of the Gaussian. ```{r} M=10^4;n=10^3 Z=rep(0,M); #initialization for(m in 1:M){ ##complete the loop ## Z[m]<- } #hist(Z,freq=FALSE) #plot(function(x) dnorm(x,0,..),add=TRUE,col="red",xlim=c(-6,6) ) ``` Note that pour a Poisson distribution the expectation and the variance is $\lambda=5$, so the standard deviation is $\sqrt{\lambda}=5$. try now with a exponential distribution with rate parameter $\lambda=5$. Note that for an exponential distribution the expectation is $1/\lambda=1/5$, and the variance is $1/\lambda^2=1/25$ (so the sd is 1/5) ## Law of large numbers (LLN), almost surely convergence Let $X_1,\ldots,X_n$ an i.i.d. sample with expectation $\mu$. Then $$P(\bar X_n \rightarrow \mu)=1.$$ We say that $\bar X_n$ converges almost surely (a.s.) to $\mu$. simulate a sample of size $n=10^4$ of exponential distribution with parameter $\lambda=2$. For $k$ varying from 1 to $n$, evaluate the mean value $Z_k=\bar X_k$. Plot the sequence $Z_k$ and add the horizontal line $y=0.5$ (use abline(h=0.5)). ```{r} #TO DO ```