Team members: Andrii Yaroshevych, Oleg Omelchuk
Determine the TLN (that stands for the team lucky number) as a three-digit number which is the team id number with an extra zero added from the left; e.g., TLN is 028 for team id number 28. In this part, you will study the questions about chances to see the TLN in random sequences of digits.
ID <- 1
TLN <- paste0("00", ID)
Hint: Such a sequence can be generated with R command
sample(0:9, n, replace=T)
. You will need to generate a sample of such sequences of sufficiently large size \(N\).
The unbiased estimator of the probability is the proportion of sequences that contain the TLN.
\[ \hat{p_n} = \frac{1}{N} \sum_{i=1}^N I(d_i \ldots d_n \text{ contains TLN}) \]
TLN <- as.numeric(strsplit(TLN, "")[[1]])
N <- 10000
n <- c(100, 200, 1000)
p <- c()
is_sebseq <- function(x, y) grepl(paste(y, collapse = ""), paste(x, collapse = ""))
for (i in seq_along(n)) {
p[i] <- mean(apply(matrix(sample(0:9, n[i] * N, replace = T), nrow = N), 1, is_sebseq, TLN))
}
print("For n = 100, 200, 1000, the probability of the event that the TLN occurs in a random digit sequence is:")
## [1] "For n = 100, 200, 1000, the probability of the event that the TLN occurs in a random digit sequence is:"
print(p)
## [1] 0.0930 0.1825 0.6347
number of correct last digits (eg., for the team id number 028 these states will be S0 =“*”, S1 =“0”, S2 =“02”, S3 =“028”). Determine the transition probabilities matrix P and find the limiting probability pn for the state “028”. Compare with the result obtained in part 1.
We define the following states:
The transition matrix is:
\[ P = \begin{pmatrix} 0.9 & 0.1 & 0 & 0 \\ 0.9 & 0 & 0.1 & 0 \\ 0.8 & 0 & 0.1 & 0.1 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} \]
P <- matrix(c(0.9, 0.1, 0, 0,
0.9, 0, 0.1, 0,
0.8, 0, 0.1, 0.1,
0, 0, 0, 1), nrow = 4, byrow = TRUE)
calculate_by_eigenvector <- function(P) {
vector <- eigen(t(P))$vectors[, 1]
return(sapply(vector / sum(vector), Re))
}
calculate_by_power_method <- function(P, n) {
for (i in 1:n) {
P <- P %*% P
}
return(P[1,])
}
print("The limiting probability by eigenvector pn for the state “028” is:")
## [1] "The limiting probability by eigenvector pn for the state “028” is:"
print(calculate_by_eigenvector(P)[4])
## [1] 1
print("The limiting probability by power method pn for the state “028” is:")
## [1] "The limiting probability by power method pn for the state “028” is:"
print(calculate_by_power_method(P, 100)[4])
## [1] 1
We can see that limiting probabilities computed by eigenvector and power method coincide with the result obtained in part 1.
The unbiased estimator of the probability that a random digit sequence of length n contains the TLN is the proportion of sequences that contain the TLN.
\[ {p_n} = \frac{1}{N} \sum_{i=1}^N I(d_i \ldots d_n \text{ contains TLN}) \]
The variance of the estimator is:
\[ \text{Var}({p_n}) = \frac{\hat{p_n}(1 - \hat{p_n})}{N} \]
Thus, the absolute error \(\lvert\hat{p_n} - p_n\rvert\) of the estimate \(\hat{p_n}\) can be estimated as:
\[ \lvert\hat{p_n} - p_n\rvert = \sqrt{\frac{\hat{p_n}(1 - \hat{p_n})}{N}} \]
Confidence interval for the probability \(p_n\) is defined as:
\[ 1 - \alpha = P(\hat{p_n} \in \hat{p_n} \pm z_{\alpha/2} \sqrt{\frac{p_n(1 - p_n)}{N}}) \]
n <- 1000
p <- mean(apply(matrix(sample(0:9, n * N, replace = T), nrow = N), 1, is_sebseq, TLN))
alpha <- 0.05
z <- qnorm(1 - alpha / 2)
N <- ceiling(p * (1 - p) / (z^2 * 0.03^2))
print("The sample size N which guarantees that the absolute error of the estimate is below 0.03 with confidence level of at least 95 percent is:")
## [1] "The sample size N which guarantees that the absolute error of the estimate is below 0.03 with confidence level of at least 95 percent is:"
print(N)
## [1] 67
print("The confidence interval for the probability is:")
## [1] "The confidence interval for the probability is:"
print(p + c(-1, 1) * z * sqrt(p * (1 - p) / N))
## [1] 0.5222853 0.7525147
The transition matrix is:
\[ P = \begin{pmatrix} 0.9 & 0.1 & 0 & 0 \\ 0.9 & 0 & 0.1 & 0 \\ 0.8 & 0 & 0.1 & 0.1 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} \]
The system of equations is:
\[ \begin{cases} \mu_1 = 1 + 0.9\mu_1 + 0.1\mu_2 \\ \mu_2 = 1 + 0.9\mu_1 + 0.1\mu_3 \\ \mu_3 = 1 + 0.8\mu_1 + 0.1\mu_3 + 0.1\mu_4 \\ \mu_4 = 0 \end{cases} \]
After solving the system of equations, we get:
\[ \mu_1 = 1000 \\ \mu_2 = 990 \\ \mu_3 = 890 \\ \mu_4 = 0 \]
Thus, \(E(T) = \mu_1 = 1000\).
Here, the unbiased estimator of \(E(T)\) is the average of the lengths of the first occurrences of the TLN in the random digit sequences.
\[ \hat{\theta} = \overline{T} = \frac{1}{N}\sum_{i=1}^{N}T_i \]
N <- 1000
sequence <- c(1, 1, 1)
expected_length <- c()
for (i in 1:N) {
length <- 3
while (TRUE) {
sequence[length] <- sample(0:9, 1)
if (sequence[length] == TLN[3] &&
sequence[length - 1] == TLN[2] &&
sequence[length - 2] == TLN[1]) {
break
}
length <- length + 1
}
expected_length[i] <- length
}
print("The expected length E(T) till the first occurrence of the TLN is:")
## [1] "The expected length E(T) till the first occurrence of the TLN is:"
print(mean(expected_length))
## [1] 997.906
We can see that the numerically expected length \(E(T)\) till the first occurrence of the TLN is close to the result obtained theoretically.
Hint: use Chebyshev inequality and estimate the standard deviation of \(T\) by the standard error of the sample \(T_1, \ldots, T_N\).
The standard error of the sample \(T_1, \ldots, T_N\) is:
\[ \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(T_i - \overline{T})^2} \]
By definition, the \(\varepsilon\) is:
\[ \varepsilon = \frac{\sigma}{\sqrt{N}} z_{1 - \frac{\alpha}{2}} \]
Thus, the sample size \(N\) is:
\[ N = \frac{z_{1 - \frac{\alpha}{2}}^2\sigma^2}{\varepsilon^2} \]
alpha <- 0.05
z <- qnorm(1 - alpha / 2)
sigma <- sd(expected_length)
epsilon <- 10
N <- (z^2 * sigma^2) / epsilon^2
print("The sample size N which guarantees that the absolute error of the estimate does not exceed 10 with confidence level of at least 95 percent is:")
## [1] "The sample size N which guarantees that the absolute error of the estimate does not exceed 10 with confidence level of at least 95 percent is:"
print(N)
## [1] 36739.27
Aim: In problems 3 and 4, you will have to verify that the interval estimates produced by the known rules indeed contain the parameter with probability equal to the confidence level.
Necessary libraries:
library(crayon)
id <- 1
n <- 1000
m <- 500
alphas <- c(0.1, 0.05, 0.01)
theta <- id / 10
lambda <- 1 / theta
datas <- matrix(rexp(n * m, rate = lambda), nrow = n)
alphas <- c(0.1, 0.05, 0.01)
\[\mathcal{E}(\lambda) \sim \mathcal{G}(1,\ 1/{\lambda})\] \[\sum^{iid}\mathcal{G}(1,\ 1/{\lambda}) \sim \mathcal{G}(n,\ 1/{\lambda})\sim 0.5\chi^2_{2n}\]
The \((1-\alpha)100\) confidence interval for \(\lambda\) then is: \[\left[\frac{\chi^2(2n,\ \alpha/2)}{2n\overline{x}},\ \frac{\chi^2(2n,\ 1-\alpha/2)}{2n\overline{x}}\right]\] \(\chi^2(v,\ p)\) is the p’th quantile of chi-square distribution with v degrees of freedom
rate_chi <- c(0, 0, 0)
lengths_chi <- c()
for (k in 1:3) {
for (i in 1:m) {
mean <- mean(datas[, i])
left <- qchisq(alphas[k] / 2, df = 2 * n) / (2 * mean * n)
right <- qchisq(1 - alphas[k] / 2, df = 2 * n) / (2 * mean * n)
lengths_chi <- append(lengths_chi, right - left)
if (left <= lambda & lambda <= right) {
rate_chi[k] <- rate_chi[k] + 1
}
}
}
rate_chi <- rate_chi / m
for (k in 1:3) {
cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_chi[k], "\n")
}
## The approximation of parameter lambda when alpha = 0.1 is: 0.896
## The approximation of parameter lambda when alpha = 0.05 is: 0.95
## The approximation of parameter lambda when alpha = 0.01 is: 0.986
\[ \mathrm{P}\left(|\theta-\overline{\mathbf{X}}| \leq z_{\beta} \theta / \sqrt{n}\right)=\mathrm{P}\left(|Z| \leq z_{\beta}\right)=2 \beta-1 . \]
in other words, \(\theta\) is with probability \(2 \beta-1\) within \(\overline{\mathbf{X}} \pm z_{\beta} \theta / \sqrt{n}\). \[ \mu=\theta\ \ \ \ \sigma^2 = \theta^2 \] \[ X\sim\mathcal{N}(\theta,\ \theta^2)\]
The \((1-\alpha)\) confidence level for \(\lambda\) then is: \[\left[\overline{X} - \frac{\sigma}{\sqrt{n}}Z_{1-\alpha/2},\overline{X} + \frac{\sigma}{\sqrt{n}}Z_{1-\alpha/2}\right]\]
sd <- theta
rate_known_var <- c(0, 0, 0)
lengths_known_var <- c()
for (k in 1:3) {
for (i in 1:m) {
mean <- mean(datas[, i])
z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
epsilon <- sd / sqrt(n) * z
left <- mean - epsilon
right <- mean + epsilon
lengths_known_var <- append(lengths_known_var, right - left)
if (left <= theta & theta <= right) {
rate_known_var[k] <- rate_known_var[k] + 1
}
}
}
rate_known_var <- rate_known_var / m
for (k in 1:3) {
cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_known_var[k], "\n")
}
## The approximation of parameter lambda when alpha = 0.1 is: 0.896
## The approximation of parameter lambda when alpha = 0.05 is: 0.95
## The approximation of parameter lambda when alpha = 0.01 is: 0.988
\[ |\theta-\overline{\mathbf{X}}| \leq z_{\beta} \theta / \sqrt{n} \]
for \(\theta\) and get another confidence interval of confidence level \(2 \beta-1\) that is independent of the unknown parameter.
Basically doing same steps as in (2), but now we are calculating standard deviation from our data.
rate_unknown_var <- c(0, 0, 0)
lengths_unknown_var <- c()
for (k in 1:3) {
for (i in 1:m) {
mean <- mean(datas[, i])
sd <- sd(datas[, i])
z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
epsilon <- sd / sqrt(n) * z
left <- mean - epsilon
right <- mean + epsilon
lengths_unknown_var <- append(lengths_unknown_var, right - left)
if (left <= theta & theta <= right) {
rate_unknown_var[k] <- rate_unknown_var[k] + 1
}
}
}
rate_unknown_var <- rate_unknown_var / m
for (k in 1:3) {
cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_unknown_var[k], "\n")
}
## The approximation of parameter lambda when alpha = 0.1 is: 0.892
## The approximation of parameter lambda when alpha = 0.05 is: 0.944
## The approximation of parameter lambda when alpha = 0.01 is: 0.984
print("The avarage length using chi-square distribution:")
## [1] "The avarage length using chi-square distribution:"
print(mean(lengths_chi))
## [1] 1.303624
print("The avarage length using normal approximation with known variance:")
## [1] "The avarage length using normal approximation with known variance:"
print(mean(lengths_known_var))
## [1] 0.01302995
print("The avarage length using normal approximation with unknown variance:")
## [1] "The avarage length using normal approximation with unknown variance:"
print(mean(lengths_unknown_var))
## [1] 0.01302515
The best method is obviously the normal approximation with unknown variance, as it can be widely used because of no need to have the population’s variance.
Chi-squared distribution approximation is worse, because it returns a very wide interval for theta, when normal approximation is very precise.
Task and Directions remain he same. In other words, you have to check that confidence intervals constructed there contain the parameter \(\theta\) with prescribed probability.
lambda <- theta
datas <- matrix(rpois(n * m, lambda = lambda), nrow = n)
alphas <- c(0.1, 0.05, 0.01)
The \((1-\alpha)100\) confidence interval for \(\lambda\) is: \[\left[\frac{\chi^2(2n\overline{x},\ \alpha/2)}{2n},\ \frac{\chi^2(2n\overline{x},\ 1-\alpha/2)}{2n}\right]\] \(\chi^2(v,\ p)\) is the p’th quantile of chi-square distribution with v degrees of freedom.
rate_chi <- c(0, 0, 0)
lengths_chi <- c()
for (k in 1:3) {
for (i in 1:m) {
mean <- mean(datas[, i])
left <- qchisq(alphas[k] / 2, df = 2 * n * mean) / (2 * n)
right <- qchisq(1 - alphas[k] / 2, df = 2 * n * mean) / (2 * n)
lengths_chi <- append(lengths_chi, right - left)
if (left <= lambda & lambda <= right) {
rate_chi[k] <- rate_chi[k] + 1
}
}
}
rate_chi <- rate_chi / m
for (k in 1:3) {
cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_chi[k], "\n")
}
## The approximation of parameter lambda when alpha = 0.1 is: 0.884
## The approximation of parameter lambda when alpha = 0.05 is: 0.932
## The approximation of parameter lambda when alpha = 0.01 is: 0.988
\[ \mu=\theta\ \ \ \ \sigma^2 = \theta \] \[ X\sim\mathcal{N}(\theta,\ \theta)\]
The \((1-\alpha)\) confidence level for \(\lambda\) then is: \[\left[\overline{X} - \sigma Z_{1-\alpha/2},\overline{X} + \sigma Z_{1-\alpha/2}\right]\]
sd <- sqrt(theta)
rate_known_var <- c(0, 0, 0)
lengths_known_var <- c()
for (k in 1:3) {
for (i in 1:m) {
mean <- mean(datas[, i])
z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
epsilon <- sd * z
left <- mean - epsilon
right <- mean + epsilon
lengths_known_var <- append(lengths_known_var, right - left)
if (left <= theta & theta <= right) {
rate_known_var[k] <- rate_known_var[k] + 1
}
}
}
rate_known_var <- rate_known_var / m
for (k in 1:3) {
cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_known_var[k], "\n")
}
## The approximation of parameter lambda when alpha = 0.1 is: 1
## The approximation of parameter lambda when alpha = 0.05 is: 1
## The approximation of parameter lambda when alpha = 0.01 is: 1
Basically doing same steps as in (2), but now we are calculating standard deviation from our data.
rate_unknown_var <- c(0, 0, 0)
lengths_unknown_var <- c()
for (k in 1:3) {
for (i in 1:m) {
mean <- mean(datas[, i])
sd <- sd(datas[, i])
z <- qnorm(1 - alphas[k] / 2, mean = 0, sd = 1)
epsilon <- sd * z
left <- mean - epsilon
right <- mean + epsilon
lengths_unknown_var <- append(lengths_unknown_var, right - left)
if (left <= theta & theta <= right) {
rate_unknown_var[k] <- rate_unknown_var[k] + 1
}
}
}
rate_unknown_var <- rate_unknown_var / m
for (k in 1:3) {
cat("The approximation of parameter lambda when alpha =", alphas[k], "is:", rate_unknown_var[k], "\n")
}
## The approximation of parameter lambda when alpha = 0.1 is: 1
## The approximation of parameter lambda when alpha = 0.05 is: 1
## The approximation of parameter lambda when alpha = 0.01 is: 1
print("The avarage length using chi-square distribution:")
## [1] "The avarage length using chi-square distribution:"
print(mean(lengths_chi))
## [1] 0.04115955
print("The avarage length using normal approximation with known variance:")
## [1] "The avarage length using normal approximation with known variance:"
print(mean(lengths_known_var))
## [1] 1.302995
print("The avarage length using normal approximation with unknown variance:")
## [1] "The avarage length using normal approximation with unknown variance:"
print(mean(lengths_unknown_var))
## [1] 1.302071
In Poisson distribution case, the best option is chi-squared distribution estimation, as it gives small interval, therefore giving us a very valuable information.
Other two gave us very high precision, but big intervals, so they are not optimal to use in Poisson distribution case.