Importance Sampling Tutorial

Part I: Introduction

Chris Winstead

July, 2020

Introduction

Major Concepts:

Notation

Symbol Definition
X A random variable representing an experiment with numerical outcomes.
\mathcal{S} The sample space of possible results.
x \in \mathcal{S} A specific numerical outcome.
p_X(x) The probability measure of X in \mathcal{S}, evaluated for the specific value x.

Sample Mean

A room has N people with height h_k, where k=1,\dots,N. These are sample people. Among the samples, the sample mean of their height is

\overline{h} = \frac{1}{N}\sum_{k=1}^{N} h_k.

Some people in a room. The average height is 69 in.

“Expected” Value or “Expectation”

Over a long period of time, thousands of random people pass through the room. Let H be a random variable representing individual height, in rounded inches. Let \mathcal{S} be the range of possible human heights, and let p_H(h) be the probability measure for human heights. Then the expected value is

\mu_H = E(H) = \sum_{h \in \mathcal{S}} hp_H(h).

Height distribution across many people.

Estimation

In the long run, the mean is an estimate of the expected value:

\frac{1}{N}\sum_{k=1}^{N} h_k \rightarrow \sum_{h \in \mathcal{S}} hp_H(h).

More notation:

Symbol Definition
\overline{h} Sample mean across samples h_k
E(H) Expected value of outcomes from H
\mu_H Mean determined by p_H(h)
\hat{\mu}_H Estimate of \mu_H using some statistical method
\hat{H} Estimate of H (typically \hat{H} = \hat{\mu}_H = \overline{H})

Expectation Laws

Some useful laws that apply to expectations:

E(X+Y) = E(X) + E(Y)
E(XY) = E(X)E(Y) if X, Y are independent
E(aX) = aE(X) a is a scalar constant
E(X^2) = E((X-\mu_X)^2) + \mu_X^2

Variance

Measures the “noisiness” of a random variable:

\text{Var}(X) = E\left[(X-\mu_X)^2\right].
= E\left(X^2\right) - \mu_X^2

Standard Deviation is the “root mean square” value of X:

\sigma_X = \sqrt{\text{Var}(X)}

Often \text{Var}(X) is written as \sigma_X^2.

Variance Laws

\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) if X, Y are independent
\text{Var}(XY) = E(X^2)E(Y^2) - \mu_X^2\mu_Y^2 if X, Y are independent
\text{Var}(aX) = a^2\text{Var}(X) a is a scalar constant

Estimation Concepts

The Estimation Problem: Naive Sampling

Statistical estimation’s precision depends on sample count N.

The estimation \hat{\mu}_X can be considered a random variable. The precision is measured by the estimator variance: \text{Var}\left(\hat{\mu}_X\right) = \text{Var}\left(\frac{1}{N}\sum_{k=1}^N x_k\right)

The x_k are independent samples, so the variance of the sum equals the sum of the variances:

\text{Var}\left(\hat{\mu}_X\right)= \frac{1}{N^2}\sum_{k=1}^{N}\text{Var}\left(X\right)

= \frac{1}{N}\text{Var}\left(X\right)

Importance Sampling (IS)

IS defines an alternative estimator with smaller variance, hence requiring fewer samples.

Let’s define a random variable Y as the output from a magic genie who tells us the mean \mu_X.

Consulting the Oracle

Then we could make a magic estimator: \hat{\mu}_X = \overline{y}.

The magic estimator only gives one value, so it’s variance is zero. Also the estimation is accurate, because of magic.

Using the magic estimator, we require only a single sample.

Example: Coin Toss (Naive Method)

We have a biased coin with \Pr(H) = p. Let’s estimate p.

Toss the coin N times. Let N_H be the number of H outcomes. Then

\hat{p} = \frac{N_H}{N}.

Naive sampling gets expensive

Example: Coin Toss (Magic Method)

Use a two-faced coin. It always comes up H. Toss once.

The sample weight (from the oracle) is p, so the estimate is

\hat{p}^\star = p.

Magic! We can toss the coin more times if we want, but we will always get the same answer.

Toss the right coin only once.

Importance Sampling Preliminaries

Practical Importance Sampling

We don’t have a magic oracle. ¯\_(ツ)_/¯

Magic sampling shows that a process exists to estimate p with perfect accuracy in a single sample. We don’t know the optimal process, but…

We can use prior knowledge to approximate the answer. Consider the definition of expectation:

E(X) = \sum_{x\in\mathcal{S}} x p_X\left(x\right)

Given a random variable Y with outcomes in \mathcal{S} (i.e. the same universe as X), with alternative probability density g_Y(y), we can re-write the expectation:

E(X) = \sum_{y\in\mathcal{S}} y g_Y(y)\left(\frac{p_X(y)}{g_Y(y)}\right)

The alternate probability measure g_Y cancels out, leaving the original E(X).

Importance Sampling Estimator

Now let’s make a Monte Carlo estimator using samples of Y to estimate E(X):

\hat{\mu}_{X}^\text{(IS)} = \frac{1}{N}\sum_{k=1}^{N}y_k\frac{p_X\left(y_k\right)}{g_Y\left(y_k\right)}

As N\rightarrow\infty, this estimator equals E(X).

Each sample of Y is scaled by a sample weight

w\left(y_k\right) = \frac{p_X\left(y_k\right)}{g_Y\left(y_k\right)},

So we can simplify the estimator rule:

\hat{\mu}_X^\text{(IS)} = \frac{1}{N}\sum_{k=1}^{N}y_kw\left(y_k\right).

Importance Sampling Variance

The estimator variance under importance sampling is

\sigma_{\text{IS}}^2 = \text{Var}\left(\hat{\mu}_X^{\text{(IS)}}\right) = \text{Var}\left[ \frac{1}{N}\sum_{k=1}^N y_kw(y_k)\right].

Note that each of the y_k are independent samples, so we can apply some variance laws:

\sigma_{\text{IS}}^2 = \left(\frac{1}{N}\right)^2\sum_{k=1}^N \text{Var}\left[y_kw(y_k)\right].

Since the y_k are identically distributed, \text{Var}\left[y_kw(y_k)\right] is a constant. So the estimator variance depends on the modified random variable Yw(Y):

\sigma_{\text{IS}}^2 = \left(\frac{1}{N}\right) \text{Var}\left[Yw(Y)\right]
= \left(\frac{1}{N}\right)\left\{\left[\sum_y \left(y\frac{p(y)}{g(y)}\right)^2g(y)\right] - \mu_X^2\right\}

The Main Challenge for Importance Sampling

Choose a sample distribution g_Y to minimize \text{Var}\left(\hat{\mu}_X^\text{(IS)}\right).

In general, the optimal answer is

g_Y^\star(y) = \frac{yp_X(y)}{\mu_X}

Using this distribution, every sample has weight

w(y) = \frac{\mu_X}{y},

so the importance sampling estimator becomes \hat{\mu}_X = \frac{1}{N}\sum_{i=1}^N y_i \frac{\mu_X}{y_i} = \mu_X.

Dice Roll Example

Dice Roll Example: Naive Sampling

Now let’s analyze an unfair six-sided die named X with distribution

p_X = [1~ 2~ 3~ 4~ 5~ 6]/21.

For this distribution, the expected value is

\mu_X = \frac{1}{21} + 2\frac{2}{21} + 3\frac{3}{21} + 4\frac{4}{21} + 5\frac{5}{21} + 6\frac{6}{21} = 4.333.

Monte Carlo estimation should give a result close to this value.

Click Here for Naive Sampling Demonstraton

Dice Roll Example: Magic Sampling

As with the cointoss example, we know the answer beforehand, so we can use an optimal sampling distribution:

g_Y = \frac{yp(y)}{\mu_X} = \frac{[1~ 4~ 9~ 16~ 25~ 36]}{21\times 4.333}.

Using this information, I fabricate a new die named Y with distribution g_Y.

Each time I roll Y, resulting in value y_i, I count the weighted sample value

y_iw(y_i) = y_i\frac{p(y_i)\mu_X}{y_i p(y_i)} = \mu_X.

Click Here for Magic Sampling Demonstraton

Dice Roll Example: Importance Sampling

Now suppose we don’t know the exact optimal sampling distribution, but can get a close approximation. To make some example “approximate” distributions, I modified the optimal distribution with the exponent m:

g(y) = \alpha g^\star(y) y^{m-1},

where \alpha is a normalizing constant.

Dice Roll Example: Approximate Distributions

If m=1 then g(y) = g^\star(y).

If m>1, then g(y) resembles g^\star, but not exactly.

Approximate sample distributions.

Dice Roll Example: Sample Results

With an approximated sample distribution, all the weighted samples are close to \mu_X. The average over many weighted samples converges to \mu_X.

k 1 2 3 4 5 6
1 4.225
2 4.400
3 4.225
4 4.225
5 4.303
6 4.225
7 4.528

Dice Roll Example: Estimator Variance

Method \text{Var}\left(\hat{\mu}_X\right)
Naive Sampling \frac{1}{N}\text{Var}(X).
Magic Sampling 0
Importance Sampling complicated

Since the weighted samples Y are all close to \mu_X, the estimator variance is small. How small? That’s not simple.

Dice Roll Example: Repeated Experiments

Estimator variance indicates the amount of variation we will see when repeating the same statistical experiment many times, obtaining a sequence of estimates.

Comparison of estimator variance using different methods.

Importance Sampling Gain

Sampling Gain

The sampling gain is how much we can reduce the sample count to reach a desired variance.

Alternatively, it’s the ratio of estimator variances for a fixed number of samples.

Method Definition Optimal Gain
Constant Variance \left.\frac{N~\text{(naive)}}{N~\text{(importance)}}\right|_{\text{Var}} N~\text{(naive)}
Constant Sample Count \left.\frac{\text{Var}\text{(naive)}}{\text{Var}\text{(importance)}}\right|_N \infty

We usually use the constant sample count definition, i.e. the variance ratio.

Variance Ratio

Recall the estimator variances for naive sampling vs importance sampling:

Naive \frac{1}{N}\text{Var}\left(X\right)
Importance \frac{1}{N}\text{Var}\left(Yw(Y)\right)

Expanding these definitions, the sampling gain works out to be

\text{Gain} = \frac{\sum_x x^2p(x) - \mu_X^2}{\sum_y \left(y\frac{p(y)}{g(y)}\right)^2g(y) - \mu_X^2}

Cointoss Example: Sampling Gain

Returning to the cointoss example, suppose our coin X has p=0.8. The variance for an individual cointoss is \text{Var}(X)=p(1-p).

Now let’s do importance sampling with coin Y with probability g, with variance \text{Var}(Y)=g(1-g).

For a constant sample count N, the estimator variances are

Method Sample Variance Estimator Variance
Naive Sampling p(1-p) \frac{p(1-p)}{N}
Importance Sampling g(1-g) \frac{g(1-g)}{N}\left(\frac{p}{g}\right)^2 = \frac{p^2(1-g)}{gN}

We see that the optimal importance sampling distribution is g=1. When g=p, the variances are the same. When g<p, the importance sampling variance is worse than naive sampling!

Cointoss Example: Sampling Gain

Based on the variance results, the sampling gain should be

G = \frac{g(1-p)}{p(1-g)}

Sampling gain for the cointoss example.

Dice Roll Example: Sampling Gain

Using the “distorted” or approximate sampling distribution, g(y) = \alpha g^\star(y)y^{m-1}, the gains are measured for a range of m, using the constant-sample definition:

Sampling gains for approximate importance sampling distributions.