ヘッド | Math & Stats Lounge

Correlation changes after conditioning

Question If we have two features, Feature A and Feature B. In a large and sufficiently diverse population, you find that their correlation is $0$. If you recompute the correlation on a restricted subpopulation, should you still expect it to be $0$? Solution Not necessarily. Correlation measures linear association, and depends on the population being sampled. Even if Feature A and Feature B have correlation $0$ in the full population, the correlation can change after restricting to a subpopulation. This can happen because the two features may have a nonlinear relationship, and the full population may have enough symmetry that the overall correlation cancels out. ...

Reservoir sampling

Question Given a set or stream of values, consider an algorithm to randomly generate sample values. Solution If the dataset is very large / stream, we describe reservoir sampling in this note. Goal: sample $k$ items uniformly at random from $n \gg 1$ values, without storing all $n$ values in memory. Base case $k=1$: Store the first item as the current sample. When the $i$-th item arrives $(i \ge 2)$, replace the current sample with probability $1/i$. After processing all items, each item has probability exactly $1/n$. General $k$ case Put the first $k$ items into the reservoir. For item $i=k+1,\dots,n$, generate a random integer $j$ uniformly from $\{1,\dots,i\}$. If $j \le k$, replace the $j$-th element in the reservoir by the new item. After processing all $n$ items, each item is included in the reservoir with probability $k/n$. ...

Hypothesis testing for coin toss

We consider hypothesis testing in this note. Question What is $p$-value? Solution A $p$-value is the probability, assuming null hypothesis $H_0$ is true, of observing a test statistic at least as extreme as the one actually obtained. Suppose the observed test statistic is $t$, and under $H_0$ the statistic $T$ has a known reference distribution. For a right-tailed test, $$ p = \mathbb{P}(T \ge t \mid H_0). $$ For a left-tailed test, $$ p = \mathbb{P}(T \le t \mid H_0). $$ ...