Lab 3

Team members: Andrii Yaroshevych, Oleg Omelchuk

Part I: Markov Chain

Determine the TLN (that stands for the team lucky number) as a three-digit number which is the team id number with an extra zero added from the left; e.g., TLN is 028 for team id number 28. In this part, you will study the questions about chances to see the TLN in random sequences of digits.

ID <- 1

TLN <- paste0("00", ID)

Problem 1. In the first part, we will estimate the probability that a random digit sequence of length n contains the TLN (consider the cases n = 100, n = 200, n = 1000).

1. Estimate numerically the probability \(\hat{p_n}\) of the event that your TLN occurs in a random digit sequence \(d_1 \ldots d_n\) of length n.

Hint: Such a sequence can be generated with R command sample(0:9, n, replace=T). You will need to generate a sample of such sequences of sufficiently large size \(N\).

The unbiased estimator of the probability is the proportion of sequences that contain the TLN.

\[ \hat{p_n} = \frac{1}{N} \sum_{i=1}^N I(d_i \ldots d_n \text{ contains TLN}) \]

TLN <- as.numeric(strsplit(TLN, "")[[1]])
N <- 10000
n <- c(100, 200, 1000)
p <- c()

is_sebseq <- function(x, y) grepl(paste(y, collapse = ""), paste(x, collapse = ""))

for (i in seq_along(n)) {
  p[i] <- mean(apply(matrix(sample(0:9, n[i] * N, replace = T), nrow = N), 1, is_sebseq, TLN))
}

print("For n = 100, 200, 1000, the probability of the event that the TLN occurs in a random digit sequence is:")

## [1] "For n = 100, 200, 1000, the probability of the event that the TLN occurs in a random digit sequence is:"

print(p)

## [1] 0.0930 0.1825 0.6347

2. Identify the Markov chain structure with four states S0, S1, S2, S3 in this sequence with Sk denoting the

number of correct last digits (eg., for the team id number 028 these states will be S0 =“*”, S1 =“0”, S2 =“02”, S3 =“028”). Determine the transition probabilities matrix P and find the limiting probability pn for the state “028”. Compare with the result obtained in part 1.

We define the following states:

S0: “*”
S1: “0”
S2: “00”
S3: “001”

The transition matrix is:

\[ P = \begin{pmatrix} 0.9 & 0.1 & 0 & 0 \\ 0.9 & 0 & 0.1 & 0 \\ 0.8 & 0 & 0.1 & 0.1 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} \]

P <- matrix(c(0.9, 0.1, 0, 0,
              0.9, 0, 0.1, 0,
              0.8, 0, 0.1, 0.1,
              0, 0, 0, 1), nrow = 4, byrow = TRUE)

calculate_by_eigenvector <- function(P) {
  vector <- eigen(t(P))$vectors[, 1]
  return(sapply(vector / sum(vector), Re))
}

calculate_by_power_method <- function(P, n) {
  for (i in 1:n) {
    P <- P %*% P
  }
  return(P[1,])
}

print("The limiting probability by eigenvector pn for the state “028” is:")

## [1] "The limiting probability by eigenvector pn for the state “028” is:"

print(calculate_by_eigenvector(P)[4])

## [1] 1

print("The limiting probability by power method pn for the state “028” is:")

## [1] "The limiting probability by power method pn for the state “028” is:"

print(calculate_by_power_method(P, 100)[4])

## [1] 1

We can see that limiting probabilities computed by eigenvector and power method coincide with the result obtained in part 1.

3. Determine approximately the sample size \(N\) which guarantees that the absolute error \(\lvert\hat{p_n} - p_n\vert\) of the estimate \(\hat{p_n}\) is below 0.03 with confidence level of at least 95 percent. Rerun the experiments for n = 1000 with the determined size \(N\) to illustrate the confidence interval and confidence level.

The unbiased estimator of the probability that a random digit sequence of length n contains the TLN is the proportion of sequences that contain the TLN.

\[ {p_n} = \frac{1}{N} \sum_{i=1}^N I(d_i \ldots d_n \text{ contains TLN}) \]

The variance of the estimator is:

\[ \text{Var}({p_n}) = \frac{\hat{p_n}(1 - \hat{p_n})}{N} \]

Thus, the absolute error \(\lvert\hat{p_n} - p_n\rvert\) of the estimate \(\hat{p_n}\) can be estimated as:

\[ \lvert\hat{p_n} - p_n\rvert = \sqrt{\frac{\hat{p_n}(1 - \hat{p_n})}{N}} \]

Confidence interval for the probability \(p_n\) is defined as:

\[ 1 - \alpha = P(\hat{p_n} \in \hat{p_n} \pm z_{\alpha/2} \sqrt{\frac{p_n(1 - p_n)}{N}}) \]

n <- 1000
p <- mean(apply(matrix(sample(0:9, n * N, replace = T), nrow = N), 1, is_sebseq, TLN))
alpha <- 0.05
z <- qnorm(1 - alpha / 2)
N <- ceiling(p * (1 - p) / (z^2 * 0.03^2))

print("The sample size N which guarantees that the absolute error of the estimate is below 0.03 with confidence level of at least 95 percent is:")

## [1] "The sample size N which guarantees that the absolute error of the estimate is below 0.03 with confidence level of at least 95 percent is:"

print(N)

## [1] 67

print("The confidence interval for the probability is:")

## [1] "The confidence interval for the probability is:"

print(p + c(-1, 1) * z * sqrt(p * (1 - p) / N))

## [1] 0.5222853 0.7525147

Problem 2. In the setting of Problem 1, assume that the random digit generation stops at the first occurrence of the TLN (i.e., that the state S4 of the Markov chain is now absorbing). In this problem, you will estimate the average length of such sequences (i.e., the average time till absorption in the Markov chain).

1. Make necessary amendments to the transition probabilities matrix \(P\) above and solve the corresponding system to find the expected time \(E(T)\) till absorption.

The transition matrix is:

\[ P = \begin{pmatrix} 0.9 & 0.1 & 0 & 0 \\ 0.9 & 0 & 0.1 & 0 \\ 0.8 & 0 & 0.1 & 0.1 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} \]

The system of equations is:

\[ \begin{cases} \mu_1 = 1 + 0.9\mu_1 + 0.1\mu_2 \\ \mu_2 = 1 + 0.9\mu_1 + 0.1\mu_3 \\ \mu_3 = 1 + 0.8\mu_1 + 0.1\mu_3 + 0.1\mu_4 \\ \mu_4 = 0 \end{cases} \]

After solving the system of equations, we get:

\[ \mu_1 = 1000 \\ \mu_2 = 990 \\ \mu_3 = 890 \\ \mu_4 = 0 \]

Thus, \(E(T) = \mu_1 = 1000\).

2. Estimate numerically the expected length \(E(T)\) till the first occurrence of the TLN by running a sufficiently large number \(N\) of experiments.

Here, the unbiased estimator of \(E(T)\) is the average of the lengths of the first occurrences of the TLN in the random digit sequences.

\[ \hat{\theta} = \overline{T} = \frac{1}{N}\sum_{i=1}^{N}T_i \]

N <- 1000
sequence <- c(1, 1, 1)
expected_length <- c()

for (i in 1:N) {
  length <- 3
  while (TRUE) {
    sequence[length] <- sample(0:9, 1)
    if (sequence[length] == TLN[3] &&
      sequence[length - 1] == TLN[2] &&
      sequence[length - 2] == TLN[1]) {
      break
    }
    length <- length + 1
  }
  expected_length[i] <- length
}

print("The expected length E(T) till the first occurrence of the TLN is:")

## [1] "The expected length E(T) till the first occurrence of the TLN is:"

print(mean(expected_length))

## [1] 997.906

We can see that the numerically expected length \(E(T)\) till the first occurrence of the TLN is close to the result obtained theoretically.

3. Find the sample size \(N\) which guarantees that the absolute error \(\lvert\hat{\theta} - \theta\rvert\) of the estimate does not exceed 10 with confidence level of at least 95 percent.

Hint: use Chebyshev inequality and estimate the standard deviation of \(T\) by the standard error of the sample \(T_1, \ldots, T_N\).

The standard error of the sample \(T_1, \ldots, T_N\) is:

\[ \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(T_i - \overline{T})^2} \]

By definition, the \(\varepsilon\) is:

\[ \varepsilon = \frac{\sigma}{\sqrt{N}} z_{1 - \frac{\alpha}{2}} \]

Thus, the sample size \(N\) is:

\[ N = \frac{z_{1 - \frac{\alpha}{2}}^2\sigma^2}{\varepsilon^2} \]

alpha <- 0.05
z <- qnorm(1 - alpha / 2)
sigma <- sd(expected_length)
epsilon <- 10

N <- (z^2 * sigma^2) / epsilon^2

print("The sample size N which guarantees that the absolute error of the estimate does not exceed 10 with confidence level of at least 95 percent is:")

## [1] "The sample size N which guarantees that the absolute error of the estimate does not exceed 10 with confidence level of at least 95 percent is:"

print(N)

## [1] 36739.27

Part 2. Parameter estimation

Aim: In problems 3 and 4, you will have to verify that the interval estimates produced by the known rules indeed contain the parameter with probability equal to the confidence level.

Problem 3. The expected value of the exponential distribution \(\mathcal E(\lambda)\) is \(\frac{1}{\lambda}\), so that a good point estimate of the parameter \(\theta = \frac{1}{\lambda}\) is the sample mean \(\overline{x}\). Confidence interval for \(\theta\) can be formed in several ways

Necessary libraries:

library(crayon)

1. Using the exact distribution of the statistics \(2\lambda n \overline{X}\) (show it is \({\chi_2n}^2\) and then use quantiles of the latter to get the interval endpoints)

id <- 1

n <- 1000
m <- 500
alphas <- c(0.1, 0.05, 0.01)
theta <- id / 10
lambda <- 1 / theta
datas <- matrix(rexp(n * m, rate = lambda), nrow = n)


alphas <- c(0.1, 0.05, 0.01)

A. Verify that the confidence intervals of level \(1 − a\) constructed via (1)–(4) above contain the parameter \(\theta = \frac{1}{\lambda}\) approx. \(100(1 − α)\%\) of times.

Approximating using chi-square distribution.

\[\mathcal{E}(\lambda) \sim \mathcal{G}(1,\ 1/{\lambda})\] \[\sum^{iid}\mathcal{G}(1,\ 1/{\lambda}) \sim \mathcal{G}(n,\ 1/{\lambda})\sim 0.5\chi^2_{2n}\]

The \((1-\alpha)100\) confidence interval for \(\lambda\) then is: \[\left[\frac{\chi^2(2n,\ \alpha/2)}{2n\overline{x}},\ \frac{\chi^2(2n,\ 1-\alpha/2)}{2n\overline{x}}\right]\] \(\chi^2(v,\ p)\) is the p’th quantile of chi-square distribution with v degrees of freedom

rate_chi <- c(0, 0, 0)
lengths_chi <- c()
for (k in 1:3) {
    for (i in 1:m) {
        mean <- mean(datas[, i])
        left <- qchisq(alphas[k] / 2, df = 2 * n) / (2 * mean * n)
        right <- qchisq(1 - alphas[k] / 2, df = 2 * n) / (2 * mean * n)

        lengths_chi <- append(lengths_chi, right - left)
        if (left <= lambda & lambda <= right) {
            rate_chi[k] <- rate_chi[k] + 1
        }
    }
}
rate_chi <- rate_chi / m
for (k in 1:3) {
    cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_chi[k], "\n")
}

## The approximation of parameter lambda when alpha = 0.1 is: 0.896 
## The approximation of parameter lambda when alpha = 0.05 is: 0.95 
## The approximation of parameter lambda when alpha = 0.01 is: 0.986

2. Using the normal approximation \(\mathscr{N}\left(\mu, \sigma^{2}\right)\) for \(\overline{\mathbf{X}}\); the parameters are \(\mu=\theta\) and \(\sigma^{2}=s^{2} / n\), where \(s^{2}=\theta^{2}\) is the population variance (i.e., variance of the original distribution \(\mathscr{E}(\lambda)\) ). In other words, we form the \(Z\)-statistics \(Z:=\sqrt{n}(\overline{\mathbf{X}}-\theta) / \theta\) and use the fact that it is approximately standard normal \(\mathscr{N}(0,1)\) to find that

\[ \mathrm{P}\left(|\theta-\overline{\mathbf{X}}| \leq z_{\beta} \theta / \sqrt{n}\right)=\mathrm{P}\left(|Z| \leq z_{\beta}\right)=2 \beta-1 . \]

in other words, \(\theta\) is with probability \(2 \beta-1\) within \(\overline{\mathbf{X}} \pm z_{\beta} \theta / \sqrt{n}\). \[ \mu=\theta\ \ \ \ \sigma^2 = \theta^2 \] \[ X\sim\mathcal{N}(\theta,\ \theta^2)\]

The \((1-\alpha)\) confidence level for \(\lambda\) then is: \[\left[\overline{X} - \frac{\sigma}{\sqrt{n}}Z_{1-\alpha/2},\overline{X} + \frac{\sigma}{\sqrt{n}}Z_{1-\alpha/2}\right]\]

sd <- theta

rate_known_var <- c(0, 0, 0)
lengths_known_var <- c()
for (k in 1:3) {
    for (i in 1:m) {
        mean <- mean(datas[, i])
        z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
        epsilon <- sd / sqrt(n) * z
        left <- mean - epsilon
        right <- mean + epsilon

        lengths_known_var <- append(lengths_known_var, right - left)
        if (left <= theta & theta <= right) {
            rate_known_var[k] <- rate_known_var[k] + 1
        }
    }
}
rate_known_var <- rate_known_var / m
for (k in 1:3) {
    cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_known_var[k], "\n")
}

## The approximation of parameter lambda when alpha = 0.1 is: 0.896 
## The approximation of parameter lambda when alpha = 0.05 is: 0.95 
## The approximation of parameter lambda when alpha = 0.01 is: 0.988

3. The confidence interval constructed above uses the unknown variance \(s^{2}=\theta^{2}\) and is of little use in practice. Instead, we can solve the double inequality

\[ |\theta-\overline{\mathbf{X}}| \leq z_{\beta} \theta / \sqrt{n} \]

for \(\theta\) and get another confidence interval of confidence level \(2 \beta-1\) that is independent of the unknown parameter.

Basically doing same steps as in (2), but now we are calculating standard deviation from our data.

rate_unknown_var <- c(0, 0, 0)
lengths_unknown_var <- c()
for (k in 1:3) {
    for (i in 1:m) {
        mean <- mean(datas[, i])
        sd <- sd(datas[, i])
        z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
        epsilon <- sd / sqrt(n) * z

        left <- mean - epsilon
        right <- mean + epsilon
        lengths_unknown_var <- append(lengths_unknown_var, right - left)
        if (left <= theta & theta <= right) {
            rate_unknown_var[k] <- rate_unknown_var[k] + 1
        }
    }
}
rate_unknown_var <- rate_unknown_var / m
for (k in 1:3) {
    cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_unknown_var[k], "\n")
}

## The approximation of parameter lambda when alpha = 0.1 is: 0.892 
## The approximation of parameter lambda when alpha = 0.05 is: 0.944 
## The approximation of parameter lambda when alpha = 0.01 is: 0.984

B. Compare their precision (lengths).

print("The avarage length using chi-square distribution:")

## [1] "The avarage length using chi-square distribution:"

print(mean(lengths_chi))

## [1] 1.303624

print("The avarage length using normal approximation with known variance:")

## [1] "The avarage length using normal approximation with known variance:"

print(mean(lengths_known_var))

## [1] 0.01302995

print("The avarage length using normal approximation with unknown variance:")

## [1] "The avarage length using normal approximation with unknown variance:"

print(mean(lengths_unknown_var))

## [1] 0.01302515

C. Give your recommendation as to which of the three methods is the best one and explain your decision

The best method is obviously the normal approximation with unknown variance, as it can be widely used because of no need to have the population’s variance.

Chi-squared distribution approximation is worse, because it returns a very wide interval for theta, when normal approximation is very precise.

Problem 4. Repeat parts (2)-(4) of Problem 3 (with corresponding amendments) for a Poisson distribution \(\mathscr{P}(\theta)\).

Task and Directions remain he same. In other words, you have to check that confidence intervals constructed there contain the parameter \(\theta\) with prescribed probability.

lambda <- theta
datas <- matrix(rpois(n * m, lambda = lambda), nrow = n)

alphas <- c(0.1, 0.05, 0.01)

A. Verify that the confidence intervals of level \(1 − a\) constructed via (1)–(4) above contain the parameter \(\theta = \frac{1}{\lambda}\) approx. \(100(1 − α)\%\) of times.

Approximating using chi-square distribution.

The \((1-\alpha)100\) confidence interval for \(\lambda\) is: \[\left[\frac{\chi^2(2n\overline{x},\ \alpha/2)}{2n},\ \frac{\chi^2(2n\overline{x},\ 1-\alpha/2)}{2n}\right]\] \(\chi^2(v,\ p)\) is the p’th quantile of chi-square distribution with v degrees of freedom.

rate_chi <- c(0, 0, 0)
lengths_chi <- c()
for (k in 1:3) {
    for (i in 1:m) {
        mean <- mean(datas[, i])
        left <- qchisq(alphas[k] / 2, df = 2 * n * mean) / (2 * n)
        right <- qchisq(1 - alphas[k] / 2, df = 2 * n * mean) / (2 * n)

        lengths_chi <- append(lengths_chi, right - left)
        if (left <= lambda & lambda <= right) {
            rate_chi[k] <- rate_chi[k] + 1
        }
    }
}
rate_chi <- rate_chi / m
for (k in 1:3) {
    cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_chi[k], "\n")
}

## The approximation of parameter lambda when alpha = 0.1 is: 0.884 
## The approximation of parameter lambda when alpha = 0.05 is: 0.932 
## The approximation of parameter lambda when alpha = 0.01 is: 0.988

Approximating using normal approximation with known variance.

\[ \mu=\theta\ \ \ \ \sigma^2 = \theta \] \[ X\sim\mathcal{N}(\theta,\ \theta)\]

The \((1-\alpha)\) confidence level for \(\lambda\) then is: \[\left[\overline{X} - \sigma Z_{1-\alpha/2},\overline{X} + \sigma Z_{1-\alpha/2}\right]\]

sd <- sqrt(theta)

rate_known_var <- c(0, 0, 0)
lengths_known_var <- c()
for (k in 1:3) {
    for (i in 1:m) {
        mean <- mean(datas[, i])
        z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
        epsilon <- sd * z
        left <- mean - epsilon
        right <- mean + epsilon

        lengths_known_var <- append(lengths_known_var, right - left)
        if (left <= theta & theta <= right) {
            rate_known_var[k] <- rate_known_var[k] + 1
        }
    }
}
rate_known_var <- rate_known_var / m
for (k in 1:3) {
    cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_known_var[k], "\n")
}

## The approximation of parameter lambda when alpha = 0.1 is: 1 
## The approximation of parameter lambda when alpha = 0.05 is: 1 
## The approximation of parameter lambda when alpha = 0.01 is: 1

Approximating using normal approximation with unknown variance.

Basically doing same steps as in (2), but now we are calculating standard deviation from our data.

rate_unknown_var <- c(0, 0, 0)
lengths_unknown_var <- c()
for (k in 1:3) {
    for (i in 1:m) {
        mean <- mean(datas[, i])
        sd <- sd(datas[, i])
        z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
        epsilon <- sd * z

        left <- mean - epsilon
        right <- mean + epsilon
        lengths_unknown_var <- append(lengths_unknown_var, right - left)
        if (left <= theta & theta <= right) {
            rate_unknown_var[k] <- rate_unknown_var[k] + 1
        }
    }
}
rate_unknown_var <- rate_unknown_var / m
for (k in 1:3) {
    cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_unknown_var[k], "\n")
}

## The approximation of parameter lambda when alpha = 0.1 is: 1 
## The approximation of parameter lambda when alpha = 0.05 is: 1 
## The approximation of parameter lambda when alpha = 0.01 is: 1

B. Compare their precision (lengths).

print("The avarage length using chi-square distribution:")

## [1] "The avarage length using chi-square distribution:"

print(mean(lengths_chi))

## [1] 0.04115955

print("The avarage length using normal approximation with known variance:")

## [1] "The avarage length using normal approximation with known variance:"

print(mean(lengths_known_var))

## [1] 1.302995

print("The avarage length using normal approximation with unknown variance:")

## [1] "The avarage length using normal approximation with unknown variance:"

print(mean(lengths_unknown_var))

## [1] 1.302071

C. Give your recommendation as to which of the three methods is the best one and explain your decision.

In Poisson distribution case, the best option is chi-squared distribution estimation, as it gives small interval, therefore giving us a very valuable information.

Other two gave us very high precision, but big intervals, so they are not optimal to use in Poisson distribution case.