Probability Theory and Mathematical Statistics

2024-09-10 2024-09-10

Preface

First Edition Preface

[[2024-09-14]] Today the makeup exam finally ended. I heard the original exam directly reused past papers. These past few days I practiced three sets of “XDU original papers” (from 2021 and two from 2023) found online. I did the 2021 paper in the morning, and in the afternoon $\frac{1}{4}$ of the questions were exact copies without any changes. I couldn’t help but laugh.

Dai Hao once said he would try his best to find the best teachers for the Qian Class. But now it seems the School of Mathematics and Statistics has no one left? Poor teaching could be excused as not focusing on education or lacking talent in teaching; but directly reusing recent past papers for exams, full of unchecked errors and omissions, made me laugh in frustration.

The exams they create have no value, and they don’t even bother to test them themselves. This is an attitude problem.

It’s fine that your university goes easy on final exams, but don’t keep fooling people with old material. You preach innovation to students, yet for yourselves, just getting by is enough. This is not the attitude for academic work, nor is it the attitude one should have for teaching.

Probability theory ends here for now. Over the past two days, I repeatedly reviewed notes, practiced problems, and corrected many errors, clarifying the knowledge structure of this course. Although the content is still relatively sparse, it should suffice as final review material. This edition will likely be the final version (probably). I’ll continue organizing Electrodynamics and Digital Signal Processing during the Mid-Autumn Festival.

Second Edition Preface

Nothing is final!!! ——Qian Xuesen

Added content on the left/right continuity of distribution functions. It seems this course is far from final…

Event Operations to Logical Operations

$A \cup B=A+B$
$A \cap B=A \cdot B$
$A-B=A \bar{B}$ Event $A$ occurs and event $B$ does not occur, easily proven by Venn diagrams. $-B$ can be interpreted as $\cdot (-B)$, where $-B$ is $\bar{B}$.
If $A \subset B$, then $A \cup B=B$, $A \cap B=A$.

After converting event operations to logical operations, most rules are shared. Using logical function operations and simplification learned in digital circuits, complex event operations can be simplified. Tips: Karnaugh maps.

Four Major Probability Formulas

$$ \begin{cases} P(A+B)=P(A)+P(B)-P(AB)\\ P(A-B)=P(A)-P(AB)=P(A \bar{B})\\ P(AB)=P(B) \cdot P(A|B)=P(A) \cdot P(B|A)\\ P(A|B)=\frac{P(AB)}{P(B)}\\ \end{cases} $$

Corollary

$P(A+B+C)$: Treat $A+B$ as a single event and apply the addition formula above, splitting twice to get:

$$ P(A+B+C)=P(A)+P(B)+P(C)-P(AB)-P(AC)-P(BC)+P(ABC) $$

Probabilities for more joint events can be derived recursively.

Complementary event: The probability that $A$ does not occur, obvious from Venn diagrams.

$$ P(\bar{A})=P(1 \cdot \bar{A})=P(1-A)=P(1)-P(1 \cdot A)=1-P(A) $$

Non-Negativity and Normalization

Non-negativity: For any event $A$, $0 \le P(A) \le 1$. Normalization: For the total event $\Omega$, $P(\Omega)=1$.

Independence

$$ \begin{cases} P(AB)=P(A) \cdot P(B)\\ P(A|B)=P(A) \end{cases} $$

Independence implies mutual independence.

Classical Probability Model

All elementary events have equal probability.

Eg. Coin toss, dice roll…

$$ P(A)=\frac{\text{Number of elementary events in } A}{\text{Total elementary events in } \Omega} $$

Classical conditional probability formula:

$$ P(B|A)=\frac{P(AB)}{P(A)}=\frac{\text{Elementary events in both } A \text{ and } B}{\text{Elementary events in } A} $$

Bernoulli Trials (Binomial Distribution)

$n$ independent trials, each with only two outcomes: $A$ or $\bar{A}$.

$X \sim B(n,p)$

$$ P_n(k)=C_n^kp^k(1-p)^{n-k} $$

Where $p=P(A)$, $1-p=P(\bar{A})$.

Geometric Probability Model

The ratio of the length/area/volume occupied by the event to the total length/area/volume of the sample space $\Omega$. When the event’s dimension is lower than $\Omega$’s dimension, its probability is always 0. ==Warning==: A probability of 0 does not mean the event cannot occur. Eg: Randomly selecting a point inside a circle, the probability of selecting any specific point is 0, but it can still happen.

Uniform Distribution

$x \sim U(a,b)$ Approximates a linear distribution in geometric probability, with probability density:

$$ f(x)= \begin{cases} 0,x \le a\\ \frac{1}{b-a},a \lt x \le b\\ 0,x \gt b\\ \end{cases} $$

Cumulative distribution function:

$$ F(x)= \begin{cases} 0,x \le a\\ \frac{x-a}{b-a},a \lt x \le b\\ 1,x \gt b\\ \end{cases} $$

Exponential Distribution

$x \sim E(\lambda)$

Probability Density

$$ f(x)= \begin{cases} \lambda e^{-\lambda x},x \gt 0\\ 0,x \le 0\\ \end{cases} $$

Cumulative Distribution Function

$$ F(x)= \begin{cases} 1-e^{-\lambda x},x \ge 0\\ 0,x \lt 0\\ \end{cases} $$

Poisson Distribution

$X \sim \pi(\lambda)$

$$ P(X=k)=\frac{e^{-\lambda}\lambda^k}{k!} $$

Normal Distribution

$x \sim N(\mu,\sigma^2)$

Probability Density

$$ f(x)=\frac{1}{\sqrt{2 \pi} \sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}},x \in R,\sigma \gt 0 $$

Cumulative Distribution Function

$$ F(x)=\int^{x}_{-\infty}f(t)dt $$

Clearly, $F(\mu)=\frac{1}{2}$, meaning $P(x \le \mu)=P(x \gt \mu)=\frac{1}{2}$.

Standard Normal Distribution

When $\mu=0,\sigma=1$, it becomes the standard normal distribution.

$$ \varphi(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} $$

$$ \varPhi(x)=\int^{x}_{-\infty}\varphi(t)dt $$

Corollaries

$$ \varPhi(-x)=1-\varPhi(x) $$

$$ F(x)=\varPhi(\frac{x-\mu}{\sigma}) $$

Normalization of normal distribution:

$$ X \sim N(\mu,\sigma^2),Z=\frac{X-\mu}{\sigma}\sim N(0,1) $$

Total Probability Formula

Complete Event Group

$$ \begin{cases} B_1 \cup B_2 \cup B_3 \cup \cdots \cup B_n=\Omega\\ B_i \cap B_j=\varnothing,i \ne j,1 \le i \le n,1 \le j \le n\\ \end{cases} $$

$B_1,B_2,B_3,\cdots B_n$ form a complete event group for $\Omega$.

Total Probability Formula

$$ \begin{align} P(A) &=P(AB_1 \cup AB_2 \cup \cdots \cup AB_n)\\ &=P(AB_1)+P(AB_2)+\cdots +P(AB_n)\\ &=P(B_1)P(A|B_1)+P(B_2)P(A|B_2)+\cdots +P(B_n)P(A|B_n)\\ \end{align} $$

Bayes’ Formula

$$ P(B_1|A)=\frac{P(AB_1)}{P(A)}=\frac{P(B_1)P(A|B_1)}{P(A)} $$

One-Dimensional Discrete Random Variables

Probability Mass Function

$$ P(X=x_i)=p_i=\frac{\text{Count of } X=x_i}{\text{Total count}},i=1,2,\cdots $$

Cumulative Distribution Function

$$ F(x)=\sum_{x_i \lt x}p_i,x \in R $$

One-Dimensional Continuous Random Variables

Probability Density Function

$$ f(x)=F'(x) $$

Cumulative Distribution Function

$$ F(x)=\int_{-\infty}^xf(t)dt $$

Interval Probability

$$ P(a \lt x \le b)=\int_a^bf(x)dx=F(b)-F(a) $$

$\because$ $P(x=a)=0,a \in R$ $\therefore$ The equality signs on the interval can be chosen freely.

Normalization

$$ F(\infty)=\int^{\infty}_{-\infty}f(x)dx=1 $$$$ F(-\infty)=0 $$

Two-Dimensional Discrete Random Variables

Joint Probability Mass Function

$P(X=x_i,Y=y_j)$ Create a 2D table of possible values for X and Y, filling in corresponding probabilities.

Marginal Probability Mass Function

$P(X=x_i),P(Y=y_j)$ Sum the rows/columns of the joint probability table to get $f_Y(x),f_X(y)$.

Conditional Distribution

$P(X=x_i|Y=y_j),P(Y=y_i|X=x_j)$ Divide each row/column of the joint probability table by its marginal probability. This scales the joint probabilities so each row/column sums to 1.

Independence of Two Variables

==Independence here refers to linear independence, not complete statistical independence.== Write the joint probability table as a matrix $\vec{A}$. If $\det \vec{A}=0$, X and Y are independent. Or: If the rows/columns of the joint probability table are proportional, X and Y are independent. Or: If the joint probability $\ne$ the product of marginal probabilities, i.e., $P(X=x_i,Y=y_j)\ne P(X=x_i)P(Y=y_j)$, then X and Y are not independent.

Two-Dimensional Continuous Random Variables

Joint Density Function

$$ f(x,y) $$

Normalization

$$ \int^{\infty}_{-\infty}\int^{\infty}_{-\infty}f(x,y)dxdy=1 $$

Marginal Density Functions

$$ f_X(x)=\int^{\infty}_{-\infty}f(x,y)dy $$

$$ f_Y(y)=\int^{\infty}_{-\infty}f(x,y)dx $$

Conditional Density

$$ f_{Y|X}(y|x)=\frac{f(x,y)}{f_X(x)} $$

Independence

$$ f(x,y)=f_X(x)f_Y(y) $$

When this holds, X and Y are independent.

Distribution Function

Let $Z=X-Y$,

$$ \begin{align} F_Z(z) &=P(Z \lt z)\\ &=P(X-Y \lt z)\\ &=P(X \lt Y+z)\\ &=\int^{y}_{-\infty}\int^{y+z}_{-\infty}f(x,y)dxdy\\ \end{align} $$

The distribution function $F_Z(z)=\iint_Df(x,y)dxdy$. Differentiate to get the probability density function $f_Z(z)$. ==Warning==: $F_Z(z)$ must satisfy normalization.

Expectation and Variance

Relations

$$ DX=EX^2-(EX)^2 $$

$$ D(cX)=c^2DX $$

$$ D(X+Y)=D(X)+D(Y)+2Cov(X,Y) $$

When X and Y are independent, $Cov(X,Y)=0$.

Common Expectations and Variances

$(0,1)$ Distribution

$$ EX=p,DX=p(1-p) $$

$B(n,p)$ Binomial Distribution

$$ EX=np,DX=np(1-p) $$

$U(a,b)$ Uniform Distribution

$$ EX=\frac{a+b}{2},DX=\frac{(b-a)^2}{12} $$

$E(\lambda)$ Exponential Distribution

$$ EX=\frac{1}{\lambda},DX=\frac{1}{\lambda^2} $$

$P(\lambda)$ Poisson Distribution

$$ EX=\lambda,DX=\lambda $$

$N(\mu,\sigma^2)$ Normal Distribution

$$ EX=\mu,DX=\sigma^2 $$

Covariance and Correlation Coefficient

Covariance

$$ Cov(X,Y)=E(XY)-E(X)E(Y) $$

Clearly, when $X=Y$, $Cov(X,X)=DX$.

$$ Cov(X+Y,Z)=Cov(X,Z)+Cov(Y,Z) $$

$$ Cov(X-Y,Z)=Cov(X,Z)+Cov(-Y,Z)=Cov(X,Z)-Cov(Y,Z) $$

Correlation Coefficient

$$ \rho_{XY}=\frac{Cov(X,Y)}{\sqrt{DX \cdot DY}} $$

Higher $|\rho|$ means stronger correlation. When $Y=X$, $X$ and $X$ are perfectly correlated, $\rho=1$. When $Y=-X$, $-X$ and $X$ are perfectly correlated, $\rho=-1$. Clearly $|\rho| \le 1$. $\rho=0$ means X and Y are uncorrelated. ==Warning==: Uncorrelated $\nRightarrow$ Independent, but Independent $\Rightarrow$ Uncorrelated.

Chebyshev’s Inequality for Probability Estimation

$$ P(|X-EX|\ge \varepsilon)\le \frac{DX}{\varepsilon^2} $$

Central Limit Theorem

A large number of independent, identically distributed variables can be approximated by a normal distribution. If $x_1,x_2,\cdots,x_n$ are independent and identically distributed, then

$$ \sum_{i=1}^nx_i \sim N(\sum^{n}_{i=1}E(x_i),\sum^{n}_{i=1}D(x_i)) $$

Three Major Distributions

$\chi^2$ (Chi-Squared) Distribution

$$ X=x_1^2+x_2^2+\cdots +x_n^2 \sim \chi^2(n),x_i \sim N(0,1) \text{ and independent} $$

Upper $\alpha$ quantile $\chi^2_\alpha(n)$ Density function is in the first quadrant.

$t$ Distribution

$$ X=\frac{x_1}{\sqrt{x_2/n}}\sim t(n),x_1 \sim N(0,1),x_2 \sim \chi^2(n),x_1 \text{ and } x_2 \text{ independent} $$

Upper $\alpha$ quantile $t_\alpha(n)$ Density function resembles normal distribution, symmetric.

$F$ Distribution

$$ X=\frac{x_1/n_1}{x_2/n_2} \sim F(n_1,n_2),x_1 \sim \chi^2(n_1),x_2 \sim \chi^2(n_2),x_1 \text{ and } x_2 \text{ independent} $$

Upper $\alpha$ quantile $F_\alpha(n_1,n_2)$ Density function is in the first quadrant.

Estimation Methods

For simple random samples that are independent and identically distributed, estimate unknown parameters.

Method of Moments

When sample size is large, approximate the sample as uniformly distributed, using sample mean to replace population mean (population moment = sample moment).

Calculate the expectation $EX$ (first population moment) from the given probability mass/density function.
Calculate the sample mean $\bar{X}$ (first sample moment) from the given sample.
Set $EX=\bar{X}$ and solve for $\theta_0$ as $\hat{\theta}$.

Maximum Likelihood Estimation

The estimate maximizes the probability of the observed sample. Likelihood function for the sample:

$$ L(x_1,x_2,\cdots,x_n;\theta)= \begin{cases} P(X=x_1)P(X=x_2)\cdots P(X=x_n), \text{discrete}\\ f(x_1;\theta)f(x_2;\theta)\cdots f(x_n;\theta), \text{continuous}\\ \end{cases} $$

To find the maximum of $L$, take the derivative to find critical points. Since the product form is cumbersome, first take the logarithm before differentiating with respect to $\theta$.

$$ (\ln L)'= \begin{cases} (\ln P_1+\ln P_2+\cdots +\ln P_n)', \text{discrete}\\ [\ln f(x_1;\theta)+\ln f(x_2;\theta)+\cdots +\ln f(x_n;\theta)]', \text{continuous}\\ \end{cases} =0 $$

Solve for the critical point $\theta_0$, which is the estimate $\hat{\theta}$.

Unbiasedness and Efficiency

If $E(\hat{\theta})=\theta$, then $\hat{\theta}$ is an unbiased estimator of $\theta$. If $\hat{\theta_1},\hat{\theta_2}$ are both unbiased, and $D(\hat{\theta_1}) \lt \hat{\theta_2}$, then $\hat{\theta_1}$ is more efficient than $\hat{\theta_2}$.

Interval Estimation

$X \sim N(\mu,\sigma^2)$, typically given $\bar{X}=\mu,S=\sigma$. Confidence level: $1-\alpha$, usually $\alpha=5\%$.

Confidence Interval for $\mu$

$\sigma^2$ Known

Pivotal quantity (standardized):

$$ \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1) $$

$$ \mu \in (\bar{x}-\frac{\sigma}{\sqrt{n}}\mu_{\frac{\alpha}{2}},\bar{x}+\frac{\sigma}{\sqrt{n}}\mu_{\frac{\alpha}{2}}) $$

$\sigma^2$ Unknown

Pivotal quantity:

$$ \frac{\bar{X}-\mu}{S/\sqrt{n}}\sim t(n-1) $$

$$ \mu \in (\bar{x}-\frac{S}{\sqrt{n}}t_{\frac{\alpha}{2}}(n-1),\bar{x}+\frac{S}{\sqrt{n}}t_{\frac{\alpha}{2}}(n-1)) $$

Confidence Interval for $\sigma^2$

Usually $\mu$ is unknown. Pivotal quantity:

$$ \frac{(n-1)S^2}{\sigma^2}\sim \chi^2(n-1) $$

$$ \sigma^2 \in (\frac{(n-1)S^2}{\chi^2_{\frac{\alpha}{2}}(n-1)},\frac{(n-1)S^2}{\chi^2_{1-\frac{\alpha}{2}}(n-1)}}) $$

Hypothesis Testing

Generally, the significance level is set at $\alpha=5\%$.

$\mu$ Test (Mean Test)

Hypothesis Formulation
$H_0: \mu = \mu_0$ (null hypothesis)
$H_1: \mu \ne \mu_0$ (alternative hypothesis)
Test Statistic Selection
- When population variance $\sigma^2$ is known:
  Use $Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim N(0,1)$ (Z-test)
- When population variance $\sigma^2$ is unknown:
  Use $T = \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t(n-1)$ (T-test)
Rejection Region Determination
- For Z-test:
  $W = (-\infty, -z_{\alpha/2}) \cup (z_{\alpha/2}, \infty)$
- For T-test:
  $W = (-\infty, -t_{\alpha/2}(n-1)) \cup (t_{\alpha/2}(n-1), \infty)$
Decision Rule
Reject $H_0$ if the computed test statistic falls within the rejection region $W$.

$\sigma^2$ Test (Variance Test)

Sample standard deviation formula:

$$ S = \sqrt{S^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{X})^2} $$

Hypothesis Formulation
$H_0: \sigma^2 = \sigma_0^2$
$H_1: \sigma^2 \ne \sigma_0^2$
Test Statistic Selection
Use $\chi^2 = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1)$ (Chi-square test)
Rejection Region Determination
$W = (0, \chi^2_{1-\alpha/2}(n-1)) \cup (\chi^2_{\alpha/2}(n-1), \infty)$
Decision Rule
Reject $H_0$ if the test statistic falls within the rejection region $W$.

Supplementary Notes

Properties of Distribution Functions

For different types of random variables:

Continuous random variables: The distribution function is continuous.
Discrete random variables: The continuity of the distribution function depends on its definition.

Left-Continuous Definition

$$ F(x) = P(X \lt x) $$

In this case:

$F(x) = F(x^-) = F(x-0) = P(X \lt x)$
$F(x^+) = F(x+0) = P(X \lt x) + P(X = x)$

When $P(X = x) \ne 0$, $F(x^+) \gt F(x) = F(x^-)$, making the distribution function left-continuous but not right-continuous.

Right-Continuous Definition

$$ F(x) = P(X \le x) $$

In this case:

$F(x) = F(x^+) = F(x+0) = P(X \le x)$
$F(x^-) = F(x-0) = P(X \le x) - P(X = x)$

When $P(X = x) \ne 0$, $F(x^+) = F(x) \gt F(x^-)$, making the distribution function right-continuous but not left-continuous.

Coin Toss Example

Consider a single coin toss:

Heads (1): Probability 0.5
Tails (0): Probability 0.5

Random variable $X$ has the distribution:

$$ \begin{cases} P(X=0) = 0.5 \\ P(X=1) = 0.5 \\ P(X=\text{other values}) = 0 \\ \end{cases} $$

Cumulative probabilities:

$$ \begin{cases} P(X \lt 0) = 0 \\ P(0 \le X \lt 1) = 0.5 \\ P(X \ge 1) = 1 \\ \end{cases} $$

Using the left-continuous definition $F(x) = P(X \lt x)$:

$$ F(x) = \begin{cases} 0, & x \le 0 \\ 0.5, & 0 \lt x \le 1 \\ 1, & x \gt 1 \\ \end{cases} $$

Here:

$F(0^-) = F(0) = 0$
$F(0^+) = 0.5$
At $x=0$, there is a discontinuity point where the function is left-continuous but not right-continuous.

New version found

Preface

First Edition Preface

Second Edition Preface

Event Operations to Logical Operations

Four Major Probability Formulas

Corollary

Non-Negativity and Normalization

Independence

Classical Probability Model

Bernoulli Trials (Binomial Distribution)

Geometric Probability Model

Uniform Distribution

Exponential Distribution

Probability Density

Cumulative Distribution Function

Poisson Distribution

Normal Distribution

Probability Density

Cumulative Distribution Function

Standard Normal Distribution

Corollaries

Total Probability Formula

Complete Event Group

Total Probability Formula

Bayes’ Formula

One-Dimensional Discrete Random Variables

Probability Mass Function

Cumulative Distribution Function

One-Dimensional Continuous Random Variables

Probability Density Function

Cumulative Distribution Function

Interval Probability

Normalization

Two-Dimensional Discrete Random Variables

Joint Probability Mass Function

Marginal Probability Mass Function

Conditional Distribution

Independence of Two Variables

Two-Dimensional Continuous Random Variables

Joint Density Function

Normalization

Marginal Density Functions

Conditional Density

Independence

Distribution Function

Expectation and Variance

Relations

Common Expectations and Variances

$(0,1)$ Distribution

$B(n,p)$ Binomial Distribution

$U(a,b)$ Uniform Distribution

$E(\lambda)$ Exponential Distribution

$P(\lambda)$ Poisson Distribution

$N(\mu,\sigma^2)$ Normal Distribution

Covariance and Correlation Coefficient

Covariance

Correlation Coefficient

Chebyshev’s Inequality for Probability Estimation

Central Limit Theorem

Three Major Distributions

$\chi^2$ (Chi-Squared) Distribution

$t$ Distribution

$F$ Distribution

Estimation Methods

Method of Moments

Maximum Likelihood Estimation

Unbiasedness and Efficiency

Interval Estimation

Confidence Interval for $\mu$

$\sigma^2$ Known

$\sigma^2$ Unknown

Confidence Interval for $\sigma^2$

Hypothesis Testing

$\mu$ Test (Mean Test)

$\sigma^2$ Test (Variance Test)

Supplementary Notes

Properties of Distribution Functions

Left-Continuous Definition

Right-Continuous Definition