Automating Formal Verification with Reward-Free Reinforcement Learning

Formal Verification

writing proofs about programs!

more reliability

than testing!

requires specialized

expertise

labor

intensive

Automating Formal Verification

Labor saving!

2015

2020

2025

# of proof synthesis tools

Time

Proof Synthesis!

\bold{\forall}

\bold{\exists}

\bold{\implies}

Theorem Prover

...

Predictor

Search

...

Unbounded Search Space

There are some techniques that can

help us prune the search tree

It's hard to explore when you don't know how close you are!

We can search more efficiently if we can evaluate proof states

forall n,

n + 1 > n

Medium

forall n,

False

Hard

n = n

Easy

26% Shorter Proofs

in 27% Fewer Steps*

* Than a search without state scoring

\bold{\forall}

\bold{\exists}

\bold{\implies}

Theorem Prover

...

Predictor

Search

"Reward-free Reinforcement Learning"

In particular, V-learning

V-Learning

Limitations in Proofs

?

Adapting to Proofs

This Talk

Classic V-Learning

S1

S2

S3

S5

\(\gamma\) = time discount

finding rewards sooner is better!

+\hspace{1em} V(S2),

\left(\hspace{1em} + \hspace{1em} V(S4)\right),

V(S1) = \text{max}(

+ \hspace{1em} V(S5))

\gamma

S4

A

B

C

+ \gamma V(S2),

\gamma\left(\hspace{1em} + \gamma V(S4)\right),

V(S1) = \text{max}(

+ \gamma V(S5))

V(S) = \text{max}_{a \in \text{actions}(S)}\left(R(S, a) + \gamma V(\text{next-state}(S, a)\right)

Generalizes to

V-Learning in Practice: Iterative Updates

V-Learning in Proofs:

The Sparse Reward Problem

...

?

Action

State

Abstraction

Action

Goal

Reality

State

...

Goal

Action

Goal

Reality

State

Goal

Adding extra rewards can lead to bad behavior

P(0)

\forall x, P(x)\\ \implies P(x+1)

induction x

\forall x, P(x)\\ \implies P(x+1)

P(0)

induction x

\forall x, P(x)\\ \implies P(x+1)

P(0)

induction x

Inductive Case

Base Case

Reward-free doesn't have this problem!

\forall x, P(x)

What We Need

A new update equation that accounts for

the branching structure of proofs

Assumptions:

The state of a completed proof has value 1

S1

S2

S3

S4

V(S4) = 0 + \text{max}([\gamma1]) = \gamma

V(S3) = 0 + \text{max}([\gamma (\gamma 1)]) = \gamma^2

V(S3) = 0 + \text{max}([\gamma(\gamma(\gamma 1)(]) = \gamma^3

V(S3) = 0 + \text{max}\left([\gamma(\gamma(\gamma(\gamma 1)))]\right) = \gamma^4

V(\hspace{1em}) = 1

S1

S2

S3

S4

V(S) = \gamma^{(\text{number of steps left})}

V(\hspace{1em}) = 1

Assumption: The state of a completed proof has value 1

G1

G2

G3

V(G1) = ???

\(m\) = Steps to complete proof from G2

\(n\) = Steps to complete proof from G3

Steps to complete proof from G1 = \( m + n + 1 \)

V(G1) = \gamma * V(G2) * V(G3)

Update Equation for Branch-Structured Proofs

V(G) = \text{max}_{a \in \text{actions}(G)}\left(\gamma\prod{}_{G' \in \text{next-state}(G, a)}V(G')\right)

Update Equation for Branch-Structured Proofs

V(G) = \text{max}_{a \in \text{actions}(G)}\left(\gamma\prod{}_{G' \in \text{next-state}(G, a)}V(G')\right)

Update Equation for Branch-Structured Proofs

V(G) = \text{max}_{a \in \text{actions}(G)}\left(\gamma\prod{}_{G' \in \text{next-state}(G, a)}V(G')\right)

Update Equation for Branch-Structured Proofs

V(G) = \text{max}_{a \in \text{actions}(G)}\left(\gamma\prod{}_{G' \in \text{next-state}(G, a)}V(G')\right)

Update Equation for Branch-Structured Proofs

V(G) = \text{max}_{a \in \text{actions}(G)}\left(\gamma\prod{}_{G' \in \text{next-states}(G, a)}V(G')\right)

R(S, a)

V(\hspace{1em}) = 1

Goal

S1

V(G) = \text{max}_{a \in \text{actions}(G)}\left(\gamma\prod{}_{G' \in \text{next-states}(G, a)}V(G')\right)

\gamma\prod{}_{G' \in \empty}V(G')

\gamma(1)

S2

\empty

Action

What about useless subgoals?

Since state values are always < 1,

generating more subgoals always has a cost, even if they are solved immediately

S1

S2

V(S1) > V(S2)

\forall x, P(x)

\forall x, P(x)\\ \implies P(x+1)

P(0)

What about useless subgoals?

But within a subgoal, the agent learns as if it was a whole proof!

S3

V(S3) < V(S4)

S4

\empty

\forall x, P(x)

\forall x, P(x)\\ \implies P(x+1)

P(0)

26% Shorter Proofs

in 27% Fewer Steps

Benchmark: CoqGym

124 Coq Projects

68,501 Theorems

85/15 train-test split

Baseline: Proverbot9001 (updated)

QEDCartographer, except without state scoring-based search

Uses a variant of depth-first search instead

27% faster

than Proverbot9001

26% Shorter Proofs

than Proverbot9001

Shorter proofs are:

More readable
More maintainable
Less expensive to check

Proves slightly more theorems, and proves complementary theorems

Automating Formal Verification with Reward-Free Reinforcement Learning

Uses a new V-value equation for branching goal structure

Makes producing verified-correct code easier and faster

Preprint available at alexsanchezstern.com