We consider hypothesis testing in this note.

Question
What is $p$-value?
Solution

A $p$-value is the probability, assuming null hypothesis $H_0$ is true, of observing a test statistic at least as extreme as the one actually obtained. Suppose the observed test statistic is $t$, and under $H_0$ the statistic $T$ has a known reference distribution.

For a right-tailed test, $$ p = \mathbb{P}(T \ge t \mid H_0). $$

For a left-tailed test, $$ p = \mathbb{P}(T \le t \mid H_0). $$

For a two-sided test, when the null distribution is symmetric,

$$ p = \mathbb{P}(|T| \ge |t| \mid H_0) = 2\min\bigl\{\mathbb{P}(T \ge t \mid H_0),\; \mathbb{P}(T \le t \mid H_0)\bigr\}. $$

A smaller $p$-value means the observed result would be less likely under $H_0$, and therefore provides stronger evidence against the null hypothesis. However, a $p$-value is not the probability that $H_0$ is true.

Question
How many coin tosses $n$ are needed to determine whether a coin is fair?
Solution

No finite number of tosses can prove that a coin is exactly fair. Instead, we perform a hypothesis test: $$ H_0: p=\frac12, \qquad H_1: p\ne \frac12. $$

Let $H$ be the number of heads in $n$ tosses, and let $$ \hat p = \frac{H}{n}. $$ Under $H_0$, we have $$ H \sim \mathrm{Binomial}\left(n,\frac12\right), $$ and for large $n$, $$ \hat p \approx \mathcal{N}\left(\frac12,\frac{1}{4n}\right). $$

For a two-sided test at significance level $\alpha$, we reject $H_0$ when $$ \left|\hat p-\frac12\right| \ge z_{1-\alpha/2}\frac{1}{2\sqrt{n}}. $$

If we want to detect deviations of size at least $\delta$, we require $$ z_{1-\alpha/2}\frac{1}{2\sqrt{n}} \le \delta. $$ Solving for $n$ gives $$ n \ge \frac{z_{1-\alpha/2}^2}{4\delta^2}. $$

For example, at the $5%$ significance level, $z_{0.975}=1.96$, so $$ n \ge \frac{1.96^2}{4\delta^2} = \frac{0.9604}{\delta^2}. $$

Thus, the required number of tosses depends on the smallest deviation $\delta$ from fairness that one wants to detect.

Question
How does one test whether a model hypothesis is good?
Solution

A model hypothesis is judged primarily by how well it performs on unseen data and not just on the training set. We measure out-of-sample error.

Let the model be $\hat f$, trained on a training dataset. We then evaluate it on a separate validation or test dataset that was not used during fitting. If the loss function is $\ell(\hat f(x), y)$, the out-of-sample error is estimated by $$ \frac{1}{m}\sum_{i=1}^m \ell(\hat f(x_i), y_i), $$ where $(x_i,y_i)$ are held-out samples.

Common ways to assess:

  1. Train/validation/test split: fit on training data, tune on validation data, and report final performance on test data.
  2. Cross-validation: repeatedly split the data into training and validation folds and average the results. This is useful when the dataset is small.
  3. Compare training error and out-of-sample error:
    • low training error and low test error $\Rightarrow$ good fit,
    • low training error but high test error $\Rightarrow$ overfitting,
    • high training error and high test error $\Rightarrow$ underfitting.
  • Key: reasonable stability across different data splits.
Question
Suppose one perform the same hypothesis test on two datasets, with sample sizes $n_1$ and $n_2$, and both produce a $p$-value of $0.04$. Which result is more trustworthy?
Solution

The result based on the larger sample size is more trustworthy since the underlying quantities are estimated more precisely.

For example, if we are estimating a probability parameter $p$ by a sample proportion, $$ \hat p = \frac{1}{n}\sum_{i=1}^n I_i, $$ where $I_i$ is an indicator variable with $ I_i \sim \mathrm{Bernoulli}(p), $ then $ \mathbb{E}[\hat p] = p, \mathrm{Var}(\hat p) = {p(1-p)}/{n}. $ So the standard deviation is $ \sqrt{{p(1-p)}/{n}}. $

When $n$ is large, a $95\%$ confidence interval is approximately $ \hat p \pm 1.96\sqrt{{p(1-p)}{n}}. $ Thus, as $n$ increases, the confidence interval becomes narrower at rate $1/\sqrt{n}$. Therefore, if $n_2 > n_1$, then the estimate based on $n_2$ has smaller variance.