Part II: Application
Chris Winstead
July, 2020
A population has a height distribution shown below. Their average height is the sample mean,
\overline{h} = \frac{1}{N}\sum_{k=1}^{N} h_k
Expectation: | 66.2873 |
N | 50k |
NS estimate: | 66.2287 |
NS variance: | 19.5166 |
To use importance sampling, we can slightly shift the original distribution to something more favorable.
Expectation: | 66.2275 |
N | 50k |
NS estimate: | 66.2873 |
NS variance: | 19.5166 |
IS estimate | 66.2734 |
IS variance | 9.7964 |
Sampling Gain | 1.9922 |
One interpretation is that IS reduces the number of low-weight samples. At the same time, we want to avoid creating outlying high-weight samples that increase the variance.
Recall the optimal sampling disribution, g^{\star}_H = hp_H(h)/\mu_H.
In this example, the optimal distribution is very close to the original.
With optimal importance sampling, all the weighted samples are concentrated around the mean.
This is where Importance Sampling really shines.
Say we want to measure a small tail probability:
p_t=\Pr\left(H > 74\text{in}\right)
To estimate this, we count how many people are taller than 74in.
An indicator function returns a binary value indicating if the condition is met:
\mathbf{1}_t(h) = 1\text{~if~}h>74\text{in,~} 0\text{~otherwise.}
When using an indicator function, the probability estimate looks like a sample mean:
\hat{p_t} = \frac{1}{N}\sum_{k=1}^N \mathbf{1}_t\left(h_k\right)
When estimating a small probability, most samples have zero weight.
A good sampling distribution should minimize zero-weight samples.
N | 50k |
NS estimate: | 0.0037 |
NS variance: | 19.5476 |
IS estimate | 0.0039 |
IS variance | 1.6495e-05 |
Sampling Gain | 1.1851e+06 |
The sampling gain is associated with the difference in non-zero samples:
NS non-zero samples: | 183 | 0.366% |
IS non-zero samples: | 40945 | 81.9% |
Optimal sampling produces samples with a constant value of p_t. Since the sample weight is w(h)=p(h)/g(h) = p_t, the optimal solution is
g_H^{\star} = \mathbf{1}_t \frac{p_H(h)}{p_t}
Since the optimal sample distribution is proportional to the original distribution, we can make alterations to the original distribution:
Mean Translation | g(h) = p(h-C) | Shift mean by a C |
Scaling | g(h) = \frac{1}{C}p(\frac{h}{C}) | Scale by C>1 |
These methods are simple and general, but suboptimal.
Returning to our height example, there is some gain when estimating \Pr(H>75) using mean translation:
There is similar gain using the scaling method:
The sampling distribution doesn’t have to be constant. It can change from one sample to the next. We can change it every single sample if we want.
Suppose g_k(h) is the sample distribution used for the k^{\text{th}} sample.
Then the sample weights are
w_k\left(h_k\right) = \frac{p\left(h_k\right)}{g_k\left(h_k\right)}
This open the door for adaptive sampling.
Let’s try Mean Translation. For each sample, randomly choose a bias parameter between 3 and 10.
We’re constantly changing the sampling distribution, but that doesn’t matter since the weight function takes care of it.
We see a sampling gain of 37.5, close to the peak gain.
There are methods, such as the VEGAS algorithm, which try to approximate the optimal sampling distribution. The general idea is:
Under the right conditions, the VEGAS algorithm converges to optimal sampling.
How the VEGAS grid works:
The probability density of the specific sample point is p_S = \left(\frac{1}{N}\right)\left(\frac{1}{\Delta x_i}\right),
so thinner intervals produce higher-probability samples.