Playing with TLA+
For the uninitiated, TLA+ (Temporal Logic of Actions) is a formal methods language. At its core, it’s more like a mathematical specification tool than a programming language. PlusCal is a higher-level language that is more suited for programmers and defining distributed system interactions. PlusCal is compiled to TLA+ for model-checking.
I have been briefly exposed to TLA+ in the past, but never had a need for it in my own work. Recently, however, such a case did pop up and I had to re-learn the language all over again (including PlusCal). The specification I ended up writing didn’t catch all of the nuance of the issue. It ended up being equivalent to a truth table, but it was a fun exercise and I want to improve.
I want to be more comfortable with the framework and incorporate it into my workflow as part of the design process. The overwhelming cause of bugs I’ve seen have to do with concurrency. It just isn’t intuitive, regardless of experience. Unit tests don’t catch these issues. If you’re lucky, you’re able to simulate most concurrent interactions, but you still might miss a rare edge case.
For example, if you have an edge case that is expected to occur in 1% of randomized deployments, you could attribute it to some transient failure and move on. Consider that there are 40 deployments, containing the same logical bug, to multiple data centers daily. The probability of seeing at least one failure is now much higher. In fact,
Now, that’s a 33% chance of a failure every day. It’s closer to an 86% chance to occur at least once during a workweek. So, we have motivation to learn!
I can’t share the problem from work, but I can choose from the pool of toy logic puzzles out there. I’ve opted to start with a few challenge problems in basic TLA+ without leveraging PlusCal since it all compiles down to TLA+. I want a solid foundation before continuing with the PlusCal journey1. I almost reached for the wolf, goat and cabbage problem, but you’ll find many posts about it already! I’ve turned to another river crossing puzzle called the Bridge and torch problem.
An excerpt from Wikipedia:
Four people come to a river in the night. There is a narrow bridge, and it can only hold two people at a time. They have one torch and, because it’s night, the torch has to be used when crossing the bridge. Person A can cross the bridge in 1 minute, B in 2 minutes, C in 5 minutes, and D in 8 minutes. When two people cross the bridge together, they must move at the slower person’s pace. The question is, can they all get across the bridge if the torch lasts only 15 minutes?
Typically, you’d want to provide a specification which produces no errors. However, for solving these sorts of toy problems, a common approach is to demonstrate a proof by contradiction. This produces a full trace of a solution, which you can then reason about without having to devise a solution in advance.
We start with a basic module listing some required imports, constants imported by the model and variables. You can think of the constants as the configuration for the exact values described in the snippet above.
---- MODULE bridgetorchproblem ----
EXTENDS FiniteSets, Integers, Sequences, TLC
CONSTANTS
People, \* Defined as { 1, 2, 5, 8 }
TimeLimit \* Defined as 15
\* Some validation before the model checker is run, to ensure the constants are
\* as expected
ASSUME TimeLimit \in Int
ASSUME People \subseteq Int
VARIABLES
start, \* Set of people at the start
end, \* Set of people at the end
torchStart, \* Whether the torch is at the start
time \* The elapsed time
Next we define the initialization predicate. TLC creates branches that satisfy
this predicate to start the execution. In this case, there is only one starting
state, as we only use =
(equality) instead of something like (x
\in S
).
Init ==
/\ start = People \* All people at the start
/\ end = {} \* No one is at the end
/\ time = 0 \* The elapsed time starts at zero
/\ torchStart = TRUE \* The torch is also at the start
I’m first covering the Next
action2, which has two branching actions each
with their own guard clauses. The guard clauses act as a short-circuit for not
accepting following actions.
Next ==
\/
\* Guard clause #1:
\* There are at least 2 people and the torch is at the start
/\ torchStart = TRUE /\ Cardinality(start) >= 2
/\ MoveEnd
\/
\* Guard clause #2:
\* There is at least 1 person at the start and the torch is at the end
\* (A reason to go back)
/\ torchStart = FALSE /\ start # {}
/\ MoveStart
You may have noticed I didn’t check that I have someone at the end for
MoveStart
. That’s because the torch can’t teleport to the end by itself.
For the first action, MoveEnd
, I wrote a custom Max
function, since it
doesn’t seem to be available in the built-in TLA+ modules. For moving to the
end, I pick any two distinct people and move them to the end. I add the time of
the slowest person and mark the torch as not being at the start.
Max(S) == CHOOSE x \in S: \A y \in S: x >= y
MoveEnd ==
\E first \in start:
\E second \in (start \ {first}):
/\ start' = start \ {first, second}
/\ end' = end \union {first, second}
/\ time' = time + Max({first, second})
/\ torchStart' = FALSE
MoveStart
, similarly, proceeds by selecting any one person to return and
adding their walking duration to the total. You’ll notice that I don’t care who
returns, slow or fast, I’m leaving the hard work for the computer.
MoveStart ==
\E person \in end:
/\ start' = start \union {person}
/\ end' = end \ {person}
/\ time' = time + person
/\ torchStart' = TRUE
The following is simply my type-safety invariant. I write this immediately, as I’m working through the specification. This just saves time from using all of the new operators incorrectly or just making a mistake. You could accidentally put unexpected sets in sets, use the wrong data type for a given operand, etc.
Beyond that, you have the ability to further restrict the type to the smallest
possible set given the context of the domain. The start
and end
variables
are examples of this. If a new person (integer) suddenly joins the fray, I’ve
definitely made a mistake.
TypeInvariant ==
/\ time \in Int
/\ torchStart \in BOOLEAN
/\ start \subseteq People
/\ end \subseteq People
/\ start \union end = People \* Make sure all are accounted for in the system
I decided on a “bait” invariant meaning, “all people can’t get to the other side within the time limit”.
\* An invariant to determine a possible solution via proof by
\* contradiction.
Unsolvable == ~(time <= TimeLimit /\ end = People)
Originally, I had the contradiction as a liveness property. However, violated liveness properties are not captured by name in TLC – any violation halts with a trace and a generic exception.
\* Eventually, the elapsed time is greater than the time limit.
Unsolvable == <>(time > TimeLimit)
The final Spec
action is as below. It indicates that a valid state graph:
- Satisfies the
Init
predicate (definition). - Always satisfies either the
Next
action, or all variables (those in the sequencevars
) remain unchanged.
\* I prefer putting this below the VARIABLES definition at the start
vars == <<start, end, torchStart, time>>
Spec == Init /\ [][Next]_vars
If you use the liveness property instead of the invariant, you’ll need to specify a form of fairness, otherwise the behaviour could stutter indefinitely. In short, you’ll need to add:
Spec == Init /\ [][Next]_vars /\ WF_vars(Next)
If you need a detailed explanation, this post on fairness is great.
Finally, this is the counter-example that is found. The trick is that you send over the two fastest people, return with either, send over the two slowest people, return with the fastest at the end, and finally send the two fastest. In other words, there are two counter-examples, but the model checker produced one, which is enough.
\* State 1
/\ time = 0
/\ end = {}
/\ torchStart = TRUE
/\ start = {1, 2, 5, 8}
\* State 2
/\ time = 2
/\ end = {1, 2}
/\ torchStart = FALSE
/\ start = {5, 8}
\* State 3
/\ time = 3
/\ end = {2}
/\ torchStart = TRUE
/\ start = {1, 5, 8}
\* State 4
/\ time = 11
/\ end = {2, 5, 8}
/\ torchStart = FALSE
/\ start = {1}
\* State 5
/\ time = 13
/\ end = {5, 8}
/\ torchStart = TRUE
/\ start = {1, 2}
\* State 6
/\ time = 15
/\ end = {1, 2, 5, 8}
/\ torchStart = FALSE
/\ start = {}
If you set the time limit to 14, for example, then you won’t find a counter-example and the problem is in fact unsolvable.
Finally, if you want to run this yourself, my configuration file
(MCbridgetorchproblem.cfg
):
SPECIFICATION
Spec
CONSTANTS
People <- ConstPeople
TimeLimit = 15
INVARIANT
TypeInvariant
Unsolvable
CHECK_DEADLOCK
FALSE
And my TLA model file (MCbridgetorchproblem.tla
):
---- MODULE MCbridgetorchproblem ----
EXTENDS bridgetorchproblem
\* A set of uniquely named records could be used to allow duplicate walking
\* durations. Such as:
\*
\* { [name |-> "A", duration |-> 1], ... }
ConstPeople == {1, 2, 5, 8}
====
Links
- Hillel Wayne’s https://learntla.com/ is a great starting point for programmers.
- Leslie Lamport, the creator of TLA+, has numerous resources and links on his page.
Footnotes
-
I’ve had the least resistance by learning a bit of PlusCal through Hillel Wayne’s writing, then starting on TLA+. ↩
-
An action is a predicate that describes how the state changes. I may be wrong, but I interpret this to mean all predicates are actions, but actions are not predicates. I may misuse these throughout, but I tried to be specific when referring to either. ↩