Playing with TLA+

For the uninitiated, TLA+ (Temporal Logic of Actions) is a formal methods language. At its core, it’s more like a mathematical specification tool than a programming language. PlusCal is a higher-level language that is more suited for programmers and defining distributed system interactions. PlusCal is compiled to TLA+ for model-checking.

I have been briefly exposed to TLA+ in the past, but never had a need for it in my own work. Recently, however, such a case did pop up and I had to re-learn the language all over again (including PlusCal). The specification I ended up writing didn’t catch all of the nuance of the issue. It ended up being equivalent to a truth table, but it was a fun exercise and I want to improve.

I want to be more comfortable with the framework and incorporate it into my workflow as part of the design process. The overwhelming cause of bugs I’ve seen have to do with concurrency. It just isn’t intuitive, regardless of experience. Unit tests don’t catch these issues. If you’re lucky, you’re able to simulate most concurrent interactions, but you still might miss a rare edge case.

For example, if you have an edge case that is expected to occur in 1% of randomized deployments, you could attribute it to some transient failure and move on. Consider that there are 40 deployments, containing the same logical bug, to multiple data centers daily. The probability of seeing at least one failure is now much higher. In fact,

\begin{align*} P(\text{At least one failure}) =&\ 1 - P(\text{No failures}) \\ =&\ 1 - 0.99^{40}\ \text{(assuming independence)} \\ \approx&\ 0.33 \\ \end{align*}

Now, that’s a 33% chance of a failure every day. It’s closer to an 86% chance to occur at least once during a workweek. So, we have motivation to learn!

I can’t share the problem from work, but I can choose from the pool of toy logic puzzles out there. I’ve opted to start with a few challenge problems in basic TLA+ without leveraging PlusCal since it all compiles down to TLA+. I want a solid foundation before continuing with the PlusCal journey¹. I almost reached for the wolf, goat and cabbage problem, but you’ll find many posts about it already! I’ve turned to another river crossing puzzle called the Bridge and torch problem.

An excerpt from Wikipedia:

Four people come to a river in the night. There is a narrow bridge, and it can only hold two people at a time. They have one torch and, because it’s night, the torch has to be used when crossing the bridge. Person A can cross the bridge in 1 minute, B in 2 minutes, C in 5 minutes, and D in 8 minutes. When two people cross the bridge together, they must move at the slower person’s pace. The question is, can they all get across the bridge if the torch lasts only 15 minutes?

Typically, you’d want to provide a specification which produces no errors. However, for solving these sorts of toy problems, a common approach is to demonstrate a proof by contradiction. This produces a full trace of a solution, which you can then reason about without having to devise a solution in advance.

We start with a basic module listing some required imports, constants imported by the model and variables. You can think of the constants as the configuration for the exact values described in the snippet above.

---- MODULE bridgetorchproblem ----
EXTENDS FiniteSets, Integers, Sequences, TLC

CONSTANTS
  People, \* Defined as { 1, 2, 5, 8 }
  TimeLimit \* Defined as 15

\* Some validation before the model checker is run, to ensure the constants are
\* as expected
ASSUME TimeLimit \in Int
ASSUME People \subseteq Int

VARIABLES
  start, \* Set of people at the start
  end, \* Set of people at the end
  torchStart, \* Whether the torch is at the start
  time \* The elapsed time

Next we define the initialization predicate. TLC creates branches that satisfy this predicate to start the execution. In this case, there is only one starting state, as we only use = (equality) instead of something like $x \in S$ (x \in S).

Init ==
  /\ start = People \* All people at the start
  /\ end = {} \* No one is at the end
  /\ time = 0 \* The elapsed time starts at zero
  /\ torchStart = TRUE \* The torch is also at the start

I’m first covering the Next action², which has two branching actions each with their own guard clauses. The guard clauses act as a short-circuit for not accepting following actions.

Next ==
  \/
    \* Guard clause #1:
    \* There are at least 2 people and the torch is at the start
    /\ torchStart = TRUE /\ Cardinality(start) >= 2
    /\ MoveEnd
  \/
    \* Guard clause #2:
    \* There is at least 1 person at the start and the torch is at the end
    \* (A reason to go back)
    /\ torchStart = FALSE /\ start # {}
    /\ MoveStart

You may have noticed I didn’t check that I have someone at the end for MoveStart. That’s because the torch can’t teleport to the end by itself.

For the first action, MoveEnd, I wrote a custom Max function, since it doesn’t seem to be available in the built-in TLA+ modules. For moving to the end, I pick any two distinct people and move them to the end. I add the time of the slowest person and mark the torch as not being at the start.

Max(S) == CHOOSE x \in S: \A y \in S: x >= y

MoveEnd ==
  \E first \in start:
    \E second \in (start \ {first}):
      /\ start' = start \ {first, second}
      /\ end' = end \union {first, second}
      /\ time' = time + Max({first, second})
      /\ torchStart' = FALSE

MoveStart, similarly, proceeds by selecting any one person to return and adding their walking duration to the total. You’ll notice that I don’t care who returns, slow or fast, I’m leaving the hard work for the computer.

MoveStart ==
  \E person \in end:
    /\ start' = start \union {person}
    /\ end' = end \ {person}
    /\ time' = time + person
    /\ torchStart' = TRUE

The following is simply my type-safety invariant. I write this immediately, as I’m working through the specification. This just saves time from using all of the new operators incorrectly or just making a mistake. You could accidentally put unexpected sets in sets, use the wrong data type for a given operand, etc.

Beyond that, you have the ability to further restrict the type to the smallest possible set given the context of the domain. The start and end variables are examples of this. If a new person (integer) suddenly joins the fray, I’ve definitely made a mistake.

TypeInvariant ==
  /\ time \in Int
  /\ torchStart \in BOOLEAN
  /\ start \subseteq People
  /\ end \subseteq People
  /\ start \union end = People \* Make sure all are accounted for in the system

I decided on a “bait” invariant meaning, “all people can’t get to the other side within the time limit”.

\* An invariant to determine a possible solution via proof by
\* contradiction.
Unsolvable == ~(time <= TimeLimit /\ end = People)

Originally, I had the contradiction as a liveness property. However, violated liveness properties are not captured by name in TLC – any violation halts with a trace and a generic exception.

\* Eventually, the elapsed time is greater than the time limit.
Unsolvable == <>(time > TimeLimit)

The final Spec action is as below. It indicates that a valid state graph:

Satisfies the Init predicate (definition).
Always satisfies either the Next action, or all variables (those in the sequence vars) remain unchanged.

\* I prefer putting this below the VARIABLES definition at the start
vars == <<start, end, torchStart, time>>

Spec == Init /\ [][Next]_vars

If you use the liveness property instead of the invariant, you’ll need to specify a form of fairness, otherwise the behaviour could stutter indefinitely. In short, you’ll need to add:

Spec == Init /\ [][Next]_vars /\ WF_vars(Next)

If you need a detailed explanation, this post on fairness is great.

Finally, this is the counter-example that is found. The trick is that you send over the two fastest people, return with either, send over the two slowest people, return with the fastest at the end, and finally send the two fastest. In other words, there are two counter-examples, but the model checker produced one, which is enough.

\* State 1
/\ time = 0
/\ end = {}
/\ torchStart = TRUE
/\ start = {1, 2, 5, 8}

\* State 2
/\ time = 2
/\ end = {1, 2}
/\ torchStart = FALSE
/\ start = {5, 8}

\* State 3
/\ time = 3
/\ end = {2}
/\ torchStart = TRUE
/\ start = {1, 5, 8}

\* State 4
/\ time = 11
/\ end = {2, 5, 8}
/\ torchStart = FALSE
/\ start = {1}

\* State 5
/\ time = 13
/\ end = {5, 8}
/\ torchStart = TRUE
/\ start = {1, 2}

\* State 6
/\ time = 15
/\ end = {1, 2, 5, 8}
/\ torchStart = FALSE
/\ start = {}

If you set the time limit to 14, for example, then you won’t find a counter-example and the problem is in fact unsolvable.

Finally, if you want to run this yourself, my configuration file (MCbridgetorchproblem.cfg):

SPECIFICATION
  Spec

CONSTANTS
  People <- ConstPeople
  TimeLimit = 15

INVARIANT
  TypeInvariant
  Unsolvable

CHECK_DEADLOCK
  FALSE

And my TLA model file (MCbridgetorchproblem.tla):

---- MODULE MCbridgetorchproblem ----
EXTENDS bridgetorchproblem

\* A set of uniquely named records could be used to allow duplicate walking
\* durations. Such as:
\*
\* { [name |-> "A", duration |-> 1], ... }
ConstPeople == {1, 2, 5, 8}
====

Footnotes

I’ve had the least resistance by learning a bit of PlusCal through Hillel Wayne’s writing, then starting on TLA+. ↩
An action is a predicate that describes how the state changes. I may be wrong, but I interpret this to mean all predicates are actions, but actions are not predicates. I may misuse these throughout, but I tried to be specific when referring to either. ↩

Links

Footnotes