Stochastic search in synthesis

Markov process

A probabilistic process where there is a finite set of states $\chi$, and at each step of the process, the probability of transitioning from a given state $x$ to a different state $y$ is given by the transition matrix $K(x,y): \chi \times \chi \to \mathbb{R}$ ($\forall x,y. 0 \leq K(x,y) < 1.0$).

Markov chain

A sequence of states $x_0, x_1, x_2$… in a Markov process. The probability of the whole chain can be computed as the product of the probability of each transition:

\[K(x,y) = (x_1=y|x_0 = x)\] \[K(x,y) * K(y,z) = P(x_2=z, x_1=y|x_0 = x)\]

The probability that $x_2=z$ given $x_0=x$ can be computed as $\sum_{y \in \chi} K(x,y) * K(y, z)$. This is actually matrix multiplication. Therefore, $K^2$ is the probability of transitioning in 2 steps, $K^n(x,y)$ is the probability of transitioning from state $x$ to state $y$ in exactly $n$ steps.

Stationary distribution

Stationary distributions $\pi(x)$ is the probability that a Markov process will be at a particular state in the long run. $\pi(x)$ has the probability $\pi(x) = \sum_{x \in \chi} \pi(x) * K(x,y)$ (i.e., $\pi = K * \pi$), so $\pi$ is an eigenvector of $K$ with eigenvalue 1.

Fundamental theorem of Markov chains is a very important property of Markov process. It states that if a Markov process satisfies:

The Markov process is fully connected, i.e., it must be possible to reach every state from every other state. This requirement indicates that $\forall x. \pi(x) > 0$.
The Markov process is aperiodic, i.e., the process never alternates between only several states.

Then $\forall x \in \chi. \lim_{n \to \infty} K^n(x,y) = \pi(y)$. This is a very powerful statement as it allows us to compute $\pi(x)$ by starting arbitrarily and running the Markov process for a long time.

Metropolis-Hastings

The fundamental theorem of Markov chains suggests an approach for synthesizing programs. Let $\chi$ be the space of programs, we can engineer a $K$ such that $\pi(x)$ is high for good programs and low for bad programs. Then we can pick a random start state and simulate the Markov process for $n$ steps for a large $n$.

Metropolis-Hastings(MH) is the most popular algorithm to engineer a transition matrix $K$ that has a desired $\pi(x)$.

The Metropolis-Hastings algorithm consists of the following steps:

Define the desired stationary distribution $\pi$.
Define a proposal distribution $J(x, y)$ such that $J(x,y) > 0$ iff $J(y,x) >0$. $J$ defines the probability of sampling next state $y$ given the current state $x$. In practice different $J$ can have a big impact on how fast the algorithm converges.
Define an acceptance ratio

\[A(x,y) = \frac{\pi(y) * J(y,x)}{\pi(x) * J(x,y)}\]

Define the desired transition matrix

\[K(x, y)= \begin{cases}J(x, y) & \text { if } x \neq y \text { and } A(x, y) \geq 1 \\ J(x, y) * A(x, y) & \text { if } x \neq y \text { and } A(x, y)<1 \\ J(x, y)+\sum_{z: A(x, z)<1} J(x, z)(1-A(x, z)) & \text { if } x=y\end{cases}\]

The reason what the above computed $K$ has the desired $\pi$ is from the key observations that

\[\pi(x) * K(x, y) = \pi(y) * K(y, x)\] \[\sum_x \pi(x) * K(x, y)=\sum_x \pi(y) * K(y, x)=\pi(y) \sum_x K(y, x)=\pi(y)\]

Therefore $\pi = K * \pi$.

One advantage of the above computed $K$ advantage that it is relatively easy to sample with the following steps:

Given a state $x$, take a sample from $J(x,y)$.
Compute $A(x,y)$.
If $A(x,y) > 1$ (i.e., state $y$ is better than $x$), then transition to $y$. Otherwise transition to $y$ with probability $A(x,y)$.

Another advantage of this algorithm is it can work even if $\pi$ is not normalized, e.g., $\pi(x)$ is a scoring function. This is because $\pi$ only appears as a ratio $\pi(y)/\pi(x)$.

MH for program synthesis

Applying MH algorithm to program synthesis boils down to define:

a finite program space
the desired $\pi$ such that $\pi > 0$, but note that it is inefficient to just make $\pi(x)$ for incorrect programs be some $\epsilon > 0$ as under such distribution, the search would devolve to randomly generating programs and testing them against the examples.
the proposal distribution / sampling probability $J$

There are several things to notice before defining them:

$J(x,y)$ or $\pi(x)$ cannot be 0.
$\pi$ should allow us to tell whether a program is getting closer to being correct.
$J$ should give priority to programs whose behavior is similar to the current program, so that we do not easily lose track of what had already learned from the current search by just jumping to a completely different program every time.

Stochastic search in synthesis

Markov process

Markov chain

Stationary distribution

Metropolis-Hastings

MH for program synthesis

Further Reading

Stochastic superoptimization

BDD applications

DSL and AST