Part I: Introduction
Chris Winstead
July, 2020
Symbol | Definition | |
---|---|---|
X | A random variable representing an experiment with numerical outcomes. | |
\mathcal{S} | The sample space of possible results. | |
x \in \mathcal{S} | A specific numerical outcome. | |
p_X(x) | The probability measure of X in \mathcal{S}, evaluated for the specific value x. |
A room has N people with height h_k, where k=1,\dots,N. These are sample people. Among the samples, the sample mean of their height is
\overline{h} = \frac{1}{N}\sum_{k=1}^{N} h_k.
Over a long period of time, thousands of random people pass through the room. Let H be a random variable representing individual height, in rounded inches. Let \mathcal{S} be the range of possible human heights, and let p_H(h) be the probability measure for human heights. Then the expected value is
\mu_H = E(H) = \sum_{h \in \mathcal{S}} hp_H(h).
In the long run, the mean is an estimate of the expected value:
\frac{1}{N}\sum_{k=1}^{N} h_k \rightarrow \sum_{h \in \mathcal{S}} hp_H(h).
Symbol | Definition | |
---|---|---|
\overline{h} | Sample mean across samples h_k | |
E(H) | Expected value of outcomes from H | |
\mu_H | Mean determined by p_H(h) | |
\hat{\mu}_H | Estimate of \mu_H using some statistical method | |
\hat{H} | Estimate of H (typically \hat{H} = \hat{\mu}_H = \overline{H}) |
Some useful laws that apply to expectations:
E(X+Y) | = | E(X) + E(Y) | |
E(XY) | = | E(X)E(Y) | if X, Y are independent |
E(aX) | = | aE(X) | a is a scalar constant |
E(X^2) | = | E((X-\mu_X)^2) + \mu_X^2 |
Measures the “noisiness” of a random variable:
\text{Var}(X) | = | E\left[(X-\mu_X)^2\right]. |
= | E\left(X^2\right) - \mu_X^2 |
Standard Deviation is the “root mean square” value of X:
\sigma_X = \sqrt{\text{Var}(X)}
Often \text{Var}(X) is written as \sigma_X^2.
\text{Var}(X+Y) | = | \text{Var}(X) + \text{Var}(Y) | if X, Y are independent |
\text{Var}(XY) | = | E(X^2)E(Y^2) - \mu_X^2\mu_Y^2 | if X, Y are independent |
\text{Var}(aX) | = | a^2\text{Var}(X) | a is a scalar constant |
Statistical estimation’s precision depends on sample count N.
The estimation \hat{\mu}_X can be considered a random variable. The precision is measured by the estimator variance: \text{Var}\left(\hat{\mu}_X\right) = \text{Var}\left(\frac{1}{N}\sum_{k=1}^N x_k\right)
The x_k are independent samples, so the variance of the sum equals the sum of the variances:
\text{Var}\left(\hat{\mu}_X\right)= \frac{1}{N^2}\sum_{k=1}^{N}\text{Var}\left(X\right)
= \frac{1}{N}\text{Var}\left(X\right)
IS defines an alternative estimator with smaller variance, hence requiring fewer samples.
Let’s define a random variable Y as the output from a magic genie who tells us the mean \mu_X.
Then we could make a magic estimator: \hat{\mu}_X = \overline{y}.
The magic estimator only gives one value, so it’s variance is zero. Also the estimation is accurate, because of magic.
Using the magic estimator, we require only a single sample.
We have a biased coin with \Pr(H) = p. Let’s estimate p.
Toss the coin N times. Let N_H be the number of H outcomes. Then
\hat{p} = \frac{N_H}{N}.
Use a two-faced coin. It always comes up H. Toss once.
The sample weight (from the oracle) is p, so the estimate is
\hat{p}^\star = p.
Magic! We can toss the coin more times if we want, but we will always get the same answer.
We don’t have a magic oracle. ¯\_(ツ)_/¯
Magic sampling shows that a process exists to estimate p with perfect accuracy in a single sample. We don’t know the optimal process, but…
We can use prior knowledge to approximate the answer. Consider the definition of expectation:
E(X) = \sum_{x\in\mathcal{S}} x p_X\left(x\right)
Given a random variable Y with outcomes in \mathcal{S} (i.e. the same universe as X), with alternative probability density g_Y(y), we can re-write the expectation:
E(X) = \sum_{y\in\mathcal{S}} y g_Y(y)\left(\frac{p_X(y)}{g_Y(y)}\right)
The alternate probability measure g_Y cancels out, leaving the original E(X).
Now let’s make a Monte Carlo estimator using samples of Y to estimate E(X):
\hat{\mu}_{X}^\text{(IS)} = \frac{1}{N}\sum_{k=1}^{N}y_k\frac{p_X\left(y_k\right)}{g_Y\left(y_k\right)}
As N\rightarrow\infty, this estimator equals E(X).
Each sample of Y is scaled by a sample weight
w\left(y_k\right) = \frac{p_X\left(y_k\right)}{g_Y\left(y_k\right)},
So we can simplify the estimator rule:
\hat{\mu}_X^\text{(IS)} = \frac{1}{N}\sum_{k=1}^{N}y_kw\left(y_k\right).
The estimator variance under importance sampling is
\sigma_{\text{IS}}^2 = \text{Var}\left(\hat{\mu}_X^{\text{(IS)}}\right) = \text{Var}\left[ \frac{1}{N}\sum_{k=1}^N y_kw(y_k)\right].
Note that each of the y_k are independent samples, so we can apply some variance laws:
\sigma_{\text{IS}}^2 = \left(\frac{1}{N}\right)^2\sum_{k=1}^N \text{Var}\left[y_kw(y_k)\right].
Since the y_k are identically distributed, \text{Var}\left[y_kw(y_k)\right] is a constant. So the estimator variance depends on the modified random variable Yw(Y):
\sigma_{\text{IS}}^2 | = | \left(\frac{1}{N}\right) \text{Var}\left[Yw(Y)\right] |
= | \left(\frac{1}{N}\right)\left\{\left[\sum_y \left(y\frac{p(y)}{g(y)}\right)^2g(y)\right] - \mu_X^2\right\} |
Choose a sample distribution g_Y to minimize \text{Var}\left(\hat{\mu}_X^\text{(IS)}\right).
In general, the optimal answer is
g_Y^\star(y) = \frac{yp_X(y)}{\mu_X}
Using this distribution, every sample has weight
w(y) = \frac{\mu_X}{y},
so the importance sampling estimator becomes \hat{\mu}_X = \frac{1}{N}\sum_{i=1}^N y_i \frac{\mu_X}{y_i} = \mu_X.
Now let’s analyze an unfair six-sided die named X with distribution
p_X = [1~ 2~ 3~ 4~ 5~ 6]/21.
For this distribution, the expected value is
\mu_X = \frac{1}{21} + 2\frac{2}{21} + 3\frac{3}{21} + 4\frac{4}{21} + 5\frac{5}{21} + 6\frac{6}{21} = 4.333.
Monte Carlo estimation should give a result close to this value.
As with the cointoss example, we know the answer beforehand, so we can use an optimal sampling distribution:
g_Y = \frac{yp(y)}{\mu_X} = \frac{[1~ 4~ 9~ 16~ 25~ 36]}{21\times 4.333}.
Using this information, I fabricate a new die named Y with distribution g_Y.
Each time I roll Y, resulting in value y_i, I count the weighted sample value
y_iw(y_i) = y_i\frac{p(y_i)\mu_X}{y_i p(y_i)} = \mu_X.
Now suppose we don’t know the exact optimal sampling distribution, but can get a close approximation. To make some example “approximate” distributions, I modified the optimal distribution with the exponent m:
g(y) = \alpha g^\star(y) y^{m-1},
where \alpha is a normalizing constant.
If m=1 then g(y) = g^\star(y).
If m>1, then g(y) resembles g^\star, but not exactly.
With an approximated sample distribution, all the weighted samples are close to \mu_X. The average over many weighted samples converges to \mu_X.
k | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
1 | 4.225 | |||||
2 | 4.400 | |||||
3 | 4.225 | |||||
4 | 4.225 | |||||
5 | 4.303 | |||||
6 | 4.225 | |||||
7 | 4.528 |
Method | \text{Var}\left(\hat{\mu}_X\right) |
---|---|
Naive Sampling | \frac{1}{N}\text{Var}(X). |
Magic Sampling | 0 |
Importance Sampling | complicated |
Since the weighted samples Y are all close to \mu_X, the estimator variance is small. How small? That’s not simple.
Estimator variance indicates the amount of variation we will see when repeating the same statistical experiment many times, obtaining a sequence of estimates.
The sampling gain is how much we can reduce the sample count to reach a desired variance.
Alternatively, it’s the ratio of estimator variances for a fixed number of samples.
Method | Definition | Optimal Gain |
---|---|---|
Constant Variance | \left.\frac{N~\text{(naive)}}{N~\text{(importance)}}\right|_{\text{Var}} | N~\text{(naive)} |
Constant Sample Count | \left.\frac{\text{Var}\text{(naive)}}{\text{Var}\text{(importance)}}\right|_N | \infty |
We usually use the constant sample count definition, i.e. the variance ratio.
Recall the estimator variances for naive sampling vs importance sampling:
Naive | \frac{1}{N}\text{Var}\left(X\right) |
Importance | \frac{1}{N}\text{Var}\left(Yw(Y)\right) |
Expanding these definitions, the sampling gain works out to be
\text{Gain} = \frac{\sum_x x^2p(x) - \mu_X^2}{\sum_y \left(y\frac{p(y)}{g(y)}\right)^2g(y) - \mu_X^2}
Returning to the cointoss example, suppose our coin X has p=0.8. The variance for an individual cointoss is \text{Var}(X)=p(1-p).
Now let’s do importance sampling with coin Y with probability g, with variance \text{Var}(Y)=g(1-g).
For a constant sample count N, the estimator variances are
Method | Sample Variance | Estimator Variance |
---|---|---|
Naive Sampling | p(1-p) | \frac{p(1-p)}{N} |
Importance Sampling | g(1-g) | \frac{g(1-g)}{N}\left(\frac{p}{g}\right)^2 = \frac{p^2(1-g)}{gN} |
We see that the optimal importance sampling distribution is g=1. When g=p, the variances are the same. When g<p, the importance sampling variance is worse than naive sampling!
Based on the variance results, the sampling gain should be
G = \frac{g(1-p)}{p(1-g)}
Using the “distorted” or approximate sampling distribution, g(y) = \alpha g^\star(y)y^{m-1}, the gains are measured for a range of m, using the constant-sample definition: