Proj. CRR Paper Reading: Optimal Speedup of Las Vegas Algorithms, Adaptive restart for stochastic synthesis-526互联

Title Adaptive restart for stochastic synthesis PLDI 2021

Task

Distribute the power between multiple runs in stochastic program synthesis to accelerate
Here, a stochastic program synthesis program can be summarized as follows:
Given a set of <input, output> pairs, a set of operations(e.g., zeros(), ones(), addl, orq, shll) randomly generated a program tree that can map those inputs to the corresponding outputs.
For simplicity, this article only needs to generate the following simple programs.

short programs(3-16 instructions) using a limited set of opcodes.
The generated program is a straight line without loops, gotos, or branches.
The generated program does not contain any I/O or memory-related operations.
The generated program is deterministic and free of side effects, which means it does not contain any operation node rand()
Clearly, for one program synthesis question(i.e., a set of input-output pairs), there can be multiple answers.

Assumption

A series of plateaus exist during stochastic searching where the cost of the potential generated program barely changed, and the search appears to stagnate
The restart strategy can help the search jump out of those plateaus
The distribution of time spent is usually heavy-tailed due to the observation that the search gets slower and slower as the search progresses.
The restart strategy slows overall progress when dealing with non-heavy-tailed time distributions; therefore, this strategy is suitable for program synthesis tasks.

Contribution

Use the geometry distribution and the First-order Markov chain to model and analyze the plateaus.
New stochastic synthesis algorithm with adaptive restart strategy
New benchmark using binaries from the Ubuntu16.04 release

Method

A stochastic program synthesis with Luby Sequence restarts remembers potential programs and the cost of those programs to distribute the longer running time to the cheaper potential program.
Based on:

Luby Sequence
Stochastic Superoptimization, ASPLOS 2013
restart on SAT problem

Background:

Luby Sequence

Paper: Optimal Speedup of Las Vegas Algorithms(1992)
Task: A strategy to distribute running time between runs in Las Vegas stochastic searching to accelerate
Solution: Give a series of cutoffs {1,1,2,1,1,2,4,1,1,2,1,1,2,4,8...}
Task details:
Let A be a Las Vegas algorithm, i.e., A is a randomized algorithm that always produces the correct answer when it stops but whose running time is a random variable.
We consider the problem of minimizing the expected time required to obtain an answer from A using strategies which simulate A as follows: Given a cutoff sequence {t1, t2, …, tn}, run the algorithm for time ti until the algorithm finds an answer.
Contribution:

Consider a well-known distribution p, this article proves that a cutoff strategy {t, t...} is optimal where l(t*) = inf_{t< \infty} {l(t)}, l(t) = \frac{(t - \sum_{t<{t}'} {q({t}'))}))}{q(t)}, and t* >= tmin, tmin is the minimum possible running time. The strategy is optimal even if we allow a run to be restarted after being paused.
Consider an unknown distribution p, a close-to-optimal strategy is Suniv = (1,1,2,1,1,2,4,1,1,2,1,1,2,4,8...). This article proves that the running time of Suniv over the running time distribution p has a complexity of 192lp (log(lp) + 5). The time complexity is close to one of the lower bound of all the upper bound on the running time of any policy.
Used in this article

Restart on SAT problem

SAT problem is vital in many fields, including symbolic execution, program synthesis, and hybrid fuzzing. As an NP-complete problem, the SAT search algorithm that needs to record the states and may not be able to solve is also valuable for fuzzing.
Restarts on SAT problem: After running ti unit time, or collecting ti conflicts, the algorithm restarts, discards all learned variables and clauses, and backtracks to the root(level 0).
DPLL: a complete depth-first search on SAT problem. （as easy as you can imagine）
The problem comes along with DPLL: How to use the failure message?
CDCL(Conflict-driven clause learning): Learn from failure: A⊆B, then not B⊆not A.
Algorithm:

Select a variable and assign True or False. This is called the decision state. Remember the assignment.
Apply Boolean constraint propagation (unit propagation).
Build the implication graph.
If there is any conflict
a) Find the cut in the implication graph that led to the conflict
b) Derive a new clause which is the negation of the assignments that led to the conflict
c) Non-chronologically backtrack ("back jump") to the appropriate decision level, where the first-assigned variable involved in the conflict was assigned
Otherwise, continue from step 1 until all variable values are assigned.
Restarts on CDCL: running ti unit time or collecting ti conflicts
A Verified SAT Solver Framework with Learn, Forget, Restart, and Incrementality, 2018

Stochastic Superoptimization

Task: superoptimize the loop-free binaries
Method: The article uses the stochastic search with the Markov chain and the Monte Carlo sampler. The cost function combines competing constraints of transformation correctness and performance improvement.
Note: The article sacrifice completeness to explore the program space more efficiently.
Tool: STOKE
Walk:

Opcode: replace the selected opcode with a random opcode
Operand
Swap: two lines of code interchanged
Instruction: a selected instruction is replaced by "UNUSED" or an unconstrained random instruction.
Cost minimization:
Synthesis: focus on correctness
Optimization: focus on finding the fastest program
Adaptive Restarts only use Markov Chain to analyze the behavior of plateaus.

Analysis on Plateaus:

Design-Stochastic Synthesis Program

The algorithm starts from (zero()). In each step, it maintains the current program p and then proposes a possible program p' by selecting one of the moves. If the cost of p' is lower or equal to the cost of p, then p' is accepted. Otherwise, randomly decide whether to accept p' from the distribution determined by β and the cost of p'.
Moves:

Instruction: Randomly select an instruction node or randomly select a slot for the root node. For this selected instruction, generate a random opcode and then randomly use constants or use existing nodes (that do not cause a loop) to fill in the required parameters of the Opcode
Opcode
Operand
Cost function:
Hamming
Incorrect test cases
Log-difference: 1 + log2(|a - b|)
βIn(random(0,1)) conforms to the exponential distribution.
When β=1, the probability that the algorithm accepts a potential program with a cost of increasing 3 bits is 5%, and 6 bits with 0.25%.
Simplest example: Only use eight opcodes(and, or, xor, not, shr, shl, zero, ones); Run multiple runs and trace 35 popular states to build a Markov chain.
Observations:
The strongly connected component can cause a plateau: If the entire strongly connected component is imagined as a single node in the Markov chain, then it is evident that this node has a higher probability of self-loop.
Restart strategy can accelerate when there is a low-cost strongly connected component: Restart strategy can accelerate when dealing with heavy-tailed distribution.
When β is large, the guidance role of cost functions is weakened. The algorithm behaves like the random walk algorithm.
Analysis:
Strong components

Single dominant strongly connected component: geometry distribution
Multiple dominant strongly connected components
- Each path has a dominant strongly connected component with similar distribution: sum of geometry distribution
- Similar distribution: sum of geometry distribution or gamma distribution
- Different distribution: normal distribution
  Paths
Only one key path: the restart strategy is actually detrimental. (edited)
11:39

Design

Design-Adaptive Restarts

This article is equivalent to a Luby policy that records the cost function and allows pauses.
This article turns the Luby sequences into a binary tree. Unlike the Luby sequence that will run the ith execution for ti unit time, Adaptive Restarts run ith execution for 1 unit, then pause. In the next iteration, it runs another unit, making the whole running time for the ith execution to 2 units. In the dth iteration, it makes the whole running time for the ith execution to power(2, d) units. After each iteration, the algorithm will check for the cost of the potential program generated by the ith execution. It will then prioritize the low-cost program by exchanging the results between different execution. For the dth iteration, power(2, d) new execution will be added.

Design-Construction of Benchmark:

Based on: Ubuntu 16.04 default packages, x86_64
Algorithm:

Collect all related instructions for each live-out register r at the end of the code block. Clean all memory reads by replacing those instructions with reads from unused registers. About 187 000 program fragments are extracted.
Remove redundant program fragments via instruction signatures. About 9719 program fragments are retained.
Use the following test to select program fragments that are capable of synthesis: Given an n-1-long prefix, synthesize an n-long prefix.
Randomly choose 1000 program fragments as the benchmark.

Experiment

Dataset: SyGus, Self-construct benchmark
Compared Strategies:

Adaptive Restart
Luby
Plain(without any restart)
Additional attention: the effect of β

Discussion

Distribution-Restart: Since we cannot assign a standard cost function in fuzzing due to lacking the outputs, suppose we have a target program position C, then generating structured input to cover that position can be viewed as a Las Vegas algorithm. Can we explain the synthesis time of test cases using the Markov model and the heavy-tailed distribution and then assign a good jump-out-restart strategy for test case synthesis?
- Synthesis programs are actually the same as synthesis test cases. However, We still wonder if code blocks in real programs can be described as a set of strongly connected Markov chains. For the most basic tree operation, if the transition matrix only considers adding a tree, removing a branch of a tree, or replacing a tree node with another tree, then it might be possible. However, considering all operations needed in fuzzing, is the Markov chain still able to describe such a chaotic situation?
Strongly connected component: When focusing on the transfer probabilities between different code positions, can we use the strongly connected component to explain the power distribution? Obviously, the distribution of test cases is deeply rooted in the distribution of code positions. If we have a method to explain the transfer between different code positions and the restart helps us jump out of a dilemma, we can also apply that to "Corpus Deletion."
Cutoff sequence: We can apply the Luby sequence with/without some cost functions as a cutoff where the distribution is too complex. We can also use the expectation of the distribution(test case synthesis, code block transfer...) calculated or inferred from history as cutoffs.
Unlike pure stochastic program synthesis, cover touched position is beneficial for fuzzing since we need to create various contexts.
Paper: It is important to understand which situation the restart strategy can be used.
- A simple distribution can be used to describe the restart strategy. (might be too similar to this article)
  - Can we describe restarting using choices among different paths
  - Can we describe restarting using the distribution before\after restarting
  - Use strongly connected components to compress the states and simplify the models

Questions remained

Why this article emphasizes that "when β is large enough, then the states in the Markov chain aggregated into a strongly connected component, in the meanwhile, the search algorithm behaviors like the random walk algorithm"? β does not seem to have much effect on the accept rate because the acceptance rate is only 0.25% when the cost is 6 bits and β=1.
Where is the proof that "the restart strategy can be detrimental when dealing with non-heavy-tail distribution."
Why does "restarts on heavy-tail distribution" make sense? Consider a synthesis problem state transfer diagram(e.g., Fig. 5) with an x-axis (i.e., the cost of potential programs) divided into quarters. Suppose there is only one path in the first half x-axis and a strongly connected component with a long synthesis time yet a low cost at 75% x-axis. In that case, the result is still heavy-tailed distribution, and the restart should be beneficial, as stated before. However, the restart strategy is harmful in this problem as it wastes the search time and is unable to move on to the next necessary path. Isn't restart helpful only when there are two different paths, and the search gets stuck on a non-essential path(such as the upper right corner in Fig. 5)?
Unable to find the proof in the reference paper: "The distribution of −? · ln(random(0, 1)) is exponential and is motivated by MCMC theory."

algorithms stochastic synthesis adaptive

stochastic

synthesis

convolutional stochastic reduction networks

optimal

recommendation self-attention sequential stochastic

transformation stochastic simulation grounded

high-level-synthesis

synthesis参数vivado

adaptive