---
title: "Some numerical methods with R, TD1"
author: Tom Rohmer (tom.rohmer@inrae.fr)
date: "2024-01-09"
output:
  html_document:
    number_sections: yes
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This TD is to write in Rmarkdown. 

* File>New File>R markdown (complete title, autor, date, HTLM). 
* Save your file in a chosen repertory. 
* Write your R code in second 'chunk'(instead of the example 'summary(cars)' )
* Add your comments outside of the 'chunks'
* title name start by # and subtitle name start by ##
* Use the ‘knit’ to obtain your htlm file



# probability reminder

A random variable $X$ is a variable whose value is determined after the realization of a phenomenon, experiment or random event. 

## Distribution functions

For a discrete random variable $X$ taking values in a space value $\xi$, the probability mass functions **p.m.f** are the probabilities 
$$P(X = k), k\in\xi$$

Classical discrete distribution:

* Uniform distribution on $\{a,\ldots,b\}$ (e.g. roll a dice)
* Bernouilli distribution (e.g.  flip a coin)
* Binomial distribution (e.g.  tossing a coin multiple times)
* Poisson distribution (counting law, e.g. Number of sinister in assurance)


For a continuous random variable $X$, the probability density function **pdf**  is the mathematical function $f$ whose the area under curve between $a$ and $b$ give the probability:

$$
P(a\leq X\leq b)= \int_{a}^b f(x)dx.
$$

Classical continue distribution:

* Uniform distribution on $[a,b]$ (e.g. roll a real number in the interval $[a,b]$)
* Gaussian distribution (e.g. heights of students in a class)
* Exponential distribution (times between events or arrivals e.g. time between arrivals at a bus stop)


The cumulative distribution function **c.d.f.** is 

$$
F(x) = P(X\leq x)
$$

* the quantile function is 

$$
Q(\alpha) = F^{-1}(\alpha);\quad \text{that is to say the value of $x$ such that } P(X\leq x) = \alpha.
$$
N.B.: $Q(0.5)$ is the median of $X$, $Q(0.25)$ is the first quartile and $Q(0.75)$ is the third quartile

## Expectation, variance, covariance

For a continuous random variable taking value in $\mathbb R$ and with density $f$, the expectation of $X$ is 

$$
\mathbb E X = \int_{\mathbb R} xf(x) dx.
$$
For a discrete variable taking value on a discrete space $\xi$, the expectation of $X$ is 
$$
\mathbb E X = \sum_{k\in\xi}  k P(X=k).
$$
The variance is
$$
var(X) = \mathbb E\left((X-\mathbb E(X))^2\right) = \mathbb E(X^2)-(\mathbb E(X))^2
$$
The standard deviation is $sd(X)=\sqrt{var(X)}$. Finally, for two random variables $X$ and $Y$, the covariance between $X$ and $Y$ is
$$
cov(X,Y)=E\left((X-E(X))(Y-E(Y))\right) = \mathbb E(XY) - \mathbb E(X)\mathbb E(Y)
$$


## On a sample

For an independent and identically distributed (i.i.d.) sample $y_1,\ldots,y_n$, an estimation of expectation and variance of the sample are the mean value (or average/ empirical mean) and the empirical variance given by:
$$
\bar y=mean(y_1,\ldots,y_n)=\frac{1}{n}\sum_{i=1}^n y_i,\quad S_y^2 =var(y_1,\ldots,y_n) = \frac{1}{n}\sum_{i=1}^n(y_i - \bar y)^2 = \bar{y^2} - (\bar y)^2.
$$
Standard deviation (sd) on the sample is $sd(y_1,\ldots,y_n)=\sqrt{S^2_y}$

In the same way, the empirical covariance between the samples $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ is 
$$
C_{x,y}=\frac{1}{n}\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y) = \bar{xy} - \bar x\bar y.
$$


Finally the (Pearson's) correlation between $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ is
$$
r_{x,y}=\frac{C_{x,y}}{\sqrt{S_x^2S_y^2}}
$$

Pearson's correlation is on $[-1,1]$, $\pm 1$ for perfect linear dependence and 0 for no linear dependence.

In R, use the R function  **mean, var, sd, cov, cor**.

# Generating distributions

In the following exercises, we will use the R functions:

* **rnorm, rexp, rbinom, rpois** to **simulate** random variables with respectively Gaussian, exponential, binomial and poisson distributions 
* **pnorm, pexp, pbinom, ppois** to to evaluate the **c.d.f.** of random variables with respectively Gaussian, exponential, binomial and poisson distributions
* **qnorm, qexp, qbinom, qpois** to to evaluate the **quantile function**, i.e. the inverse of the c.d.f. of random variables with respectively Gaussian, exponential, binomial and poisson distributions
* **dnorm, dexp, dbinom, dpois** to to evaluate the **density or p.m.f.** of random variables with respectively Gaussian, exponential, binomial and poisson distributions

```{r,echo=TRUE}
plot(function(x) dnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="density of the gaussian distribution")

plot(function(x) pnorm(x,mean=0,sd=1), xlim=c(-5,5), ylab="", main="c.d.f. of the gaussian distribution")


 barplot(dpois(0:10,4),ylab="Probability",xlab="k",
space=2,ylim=c(0,0.2), names.arg=0:10, main="p.m.f. of the poisson distribution with parameter 4")
 
 plot(function(x) ppois(x,4), xlim=c(-1,10), ylab="", main="c.d.f. of the poisson distribution with parameter 4")

```


## Simulation of classical distributions 

First, fix the seed by using the following command
```{r,echo=TRUE}
set.seed(1)
```

* Using the R functions rexp, rnorm, rbinom, generate i.i.d. samples $X_1$, $X_2$ and $X_3$ of size $n=10000$ respectively from exponential distribution with rate parameter $\lambda=2$, from Gaussian distribution with mean value $\mu=3$ and standard deviation $sd=2$ and from Bernoulli distribution with probability of success $p=0.25$. 
* Create a data frame $X$ (R command data.frame) which contains $X_1$, $X_2$ and $X_3$. Print the top of the data frame (use head). 
* Using the R function apply, Calculate the mean value and the empirical variance of the samples. Compare with the theoretical values. Evaluate the Pearson's correlation between the 3 samples and also the covariance matrix (use $cor(X)$, $cov(X)$). 
* Plot the histogram of the distributions of the exponential and Gaussian distribution and add the corresponding density curves (use hist(X1, freq=FALSE))
* print the table of frequency (use table(X3)/n) and compare these values with $p$ and $1-p$

```{r}
n=10000
#?rexp
#?rnorm
#?rbinom

X1<-rexp(n,2) ##add comment
X2<-rnorm(n,3,2)
X3<-rbinom(n,1,0.25)

X=data.frame(exp=X1,Gauss=X2,Bernouilli=X3) 
head(X)  
apply(X,2,mean)
apply(X,2,var)
cov(X)

hist(X1,breaks=30,freq=FALSE) ##add comment
plot(function(x) dexp(x,2),add=TRUE,col="red",xlim=c(0,5))##add comment
hist(X2,breaks=30,freq=FALSE)
plot(function(x) dnorm(x,3,2),add=TRUE,col="red",xlim=c(-10,10))
table(X3)/n
pie(table(X3))

```

## Simulating distributions using uniform samplings

Let $U$ a random variable distributed from the uniform distribution on $[0,1]$. Then for any cumulative distribution function $F$, the random variable $$Z=F^{-1}(U)\quad \text{has distribution } F.$$ Using this result, simulate the same exponential, Gaussian samples and answer the same questions as exercise 1. Use the r function runif, qexp, qnorm, qbinom 

```{r}
set.seed(2)
n=10000
U1=runif(n)
#?qexp
##to complete
```

Let $X$ a continuous random variable with c.d.f. $F$. Then the random variable $Z=F(X)$ is uniformly distributed on $[0,1]$. 

Using this result, from a Gaussian distribution, then from an exponential distribution, simulate a uniform sample on $[0,1]$.

```{r}
#to complete
```



## Theorem Central Limit (TCL), convergence in distribution

Let $X_1,\ldots,X_n$ an i.i.d. sample with expectation $\mathbb E(X)=\mu$ and variance $Var(X) =\sigma^2$. Then,  
$$
\sqrt{n}(\bar X_n - \mu) \rightarrow \mathcal N(0,\sigma^2) \quad\text{as soons as } n\rightarrow\infty
$$
We say that the mean value $\bar X_n$ converges in distribution to a Gaussian distribution.

Using a loop 'for', simulate $M=10^4$ samples of size $n=1000$ from a Poisson distribution with parameter $\lambda=5$ and verify the theorem. Plot the histogram of the statistic with the density of the Gaussian.

```{r}
M=10^4;n=10^3
Z=rep(0,M); #initialization
for(m in 1:M){
##complete the loop
##  Z[m]<-
}
#hist(Z,freq=FALSE)
#plot(function(x) dnorm(x,0,..),add=TRUE,col="red",xlim=c(-6,6) )
```

Note that pour a Poisson distribution the expectation and the variance is $\lambda=5$, so the standard deviation is $\sqrt{\lambda}=5$. 

try now with a exponential distribution with rate parameter $\lambda=5$. Note that for an exponential distribution the expectation is $1/\lambda=1/5$, and the variance is  $1/\lambda^2=1/25$ (so the sd is 1/5)


## Law of large numbers (LLN), almost surely convergence
Let $X_1,\ldots,X_n$ an i.i.d. sample with expectation $\mu$. Then 
$$P(\bar X_n \rightarrow \mu)=1.$$
We say that $\bar X_n$ converges almost surely (a.s.) to $\mu$.

simulate a sample of size $n=10^4$ of exponential distribution with parameter $\lambda=2$.
For $k$ varying from 1 to $n$, evaluate the mean value $Z_k=\bar X_k$. Plot the sequence $Z_k$ and add the horizontal line $y=0.5$ (use abline(h=0.5)).

```{r}
#TO DO
```