Importance Sampling Tutorial

Part II: Application

Chris Winstead

July, 2020

“Simple” Example: Average Height

Naive Sampling

A population has a height distribution shown below. Their average height is the sample mean,

\overline{h} = \frac{1}{N}\sum_{k=1}^{N} h_k

Distribution of heights.

Naive Estimation Results

Expectation: 66.2873
N 50k
NS estimate: 66.2287
NS variance: 19.5166

Importance Sampling

To use importance sampling, we can slightly shift the original distribution to something more favorable.

Distribution of heights.

Importance Estimation Results

Expectation: 66.2275
N 50k
NS estimate: 66.2873
NS variance: 19.5166
IS estimate 66.2734
IS variance 9.7964
Sampling Gain 1.9922

“Importance” and Sample Weight

One interpretation is that IS reduces the number of low-weight samples. At the same time, we want to avoid creating outlying high-weight samples that increase the variance.

Distribution of heights.

The Optimal Sampling Distribution

Recall the optimal sampling disribution, g^{\star}_H = hp_H(h)/\mu_H.

In this example, the optimal distribution is very close to the original.

Optimal sampling distribution vs original distribution.

Weights in the Optimal Distribution

With optimal importance sampling, all the weighted samples are concentrated around the mean.

Sample weight distribution

Importance Sampling in Probability Estimation

Measuring a Probability

This is where Importance Sampling really shines.

Say we want to measure a small tail probability:

p_t=\Pr\left(H > 74\text{in}\right)

To estimate this, we count how many people are taller than 74in.

An indicator function returns a binary value indicating if the condition is met:

\mathbf{1}_t(h) = 1\text{~if~}h>74\text{in,~} 0\text{~otherwise.}

\mathbf{1}_t(h) indicates people who are taller than the blue line.

Probability is a Sample Mean

When using an indicator function, the probability estimate looks like a sample mean:

\hat{p_t} = \frac{1}{N}\sum_{k=1}^N \mathbf{1}_t\left(h_k\right)

Inefficient Sampling

When estimating a small probability, most samples have zero weight.

Naive sampling distribution, highlighting the tail probability region.

Importance Sampling with Indicator Functions

A good sampling distribution should minimize zero-weight samples.

Importance sampling distribution.

Example Results

N 50k
NS estimate: 0.0037
NS variance: 19.5476
IS estimate 0.0039
IS variance 1.6495e-05
Sampling Gain 1.1851e+06

The sampling gain is associated with the difference in non-zero samples:

NS non-zero samples: 183 0.366%
IS non-zero samples: 40945 81.9%

Where are the Good Samples?

Distributions and weight function.

Optimal Distribution for Probability Estimation

Optimal sampling produces samples with a constant value of p_t. Since the sample weight is w(h)=p(h)/g(h) = p_t, the optimal solution is

g_H^{\star} = \mathbf{1}_t \frac{p_H(h)}{p_t}

Optimal sampling distribution

Generalized Strategies

Translation and Scaling

Since the optimal sample distribution is proportional to the original distribution, we can make alterations to the original distribution:

Mean Translation g(h) = p(h-C) Shift mean by a C
Scaling g(h) = \frac{1}{C}p(\frac{h}{C}) Scale by C>1

These methods are simple and general, but suboptimal.

Mean Translation Experiment

Returning to our height example, there is some gain when estimating \Pr(H>75) using mean translation:

Sampling Gain for \Pr(H>75) using mean translation.

Scaling Experiment

There is similar gain using the scaling method:

Sampling Gain using the scaling method.

Multiple Importance Sampling (MIS)

The sampling distribution doesn’t have to be constant. It can change from one sample to the next. We can change it every single sample if we want.

Suppose g_k(h) is the sample distribution used for the k^{\text{th}} sample.

Then the sample weights are

w_k\left(h_k\right) = \frac{p\left(h_k\right)}{g_k\left(h_k\right)}

This open the door for adaptive sampling.

MIS Experiment

Let’s try Mean Translation. For each sample, randomly choose a bias parameter between 3 and 10.

We’re constantly changing the sampling distribution, but that doesn’t matter since the weight function takes care of it.

We see a sampling gain of 37.5, close to the peak gain.

MIS with random bias in the highlighted domain.

Adaptive Importance Sampling

There are methods, such as the VEGAS algorithm, which try to approximate the optimal sampling distribution. The general idea is:

Under the right conditions, the VEGAS algorithm converges to optimal sampling.

Stratified Sampling

How the VEGAS grid works:

  1. Divide the domain into N intervals of size \Delta x_i.
  2. For each sample, randomly choose one interval with probability 1/N.
  3. Within the interval, choose a uniformly distributed sample point.

The probability density of the specific sample point is p_S = \left(\frac{1}{N}\right)\left(\frac{1}{\Delta x_i}\right),

so thinner intervals produce higher-probability samples.

VEGAS Experiment (Pass 1)

Vegas Algorithm example.

VEGAS Experiment (Pass 2)

Vegas Algorithm example.

VEGAS Experiment (Pass 3)

Vegas Algorithm example.

VEGAS Experiment (Pass 4)

Vegas Algorithm example.