# Wait-Free Concurrent Computing

Note1: This is a summary of the course Concurrent Algorithms I take at EPFL.

Note2: In the following, when I use process, it doesn’t mean operation system process. It’s an abstraction of computation flow. You can think it as threads.

## Background

Regarding shared-memory concurrency, most programmers are familiar with locks, some have heard lock-free concurrency, very few have idea of wait-free concurrency. The latter two belong to the category of non-blocking algorithms.

In lock-based systems, A process holding a lock prevents all others from progressing. It’s usually implemented based on hardware primitives such as Compare and Swap(CAS). Higher level atomic objects include semaphore, mutex, monitors, etc. Coarse grained locks are slow, while fine grained locks are error-prone.

Generally there’re three variants of non-blocking algorithms, namely obstruction-free, lock-free and wait-free. They’re just three different levels of guarantee of the liveness of the concurrent system.

Obstruction-free. It’s the weakest form of guarantee among non-blocking algorithms. An algorithm is obstruction-free, if a process executes alone will eventually terminate.

Lock-free. In lock-free concurrency, it’s possible that a process is blocked by another process. However, at least one must be able to able to proceed no matter what happens to other processes.

Wait-free. Every process that invokes an operation eventually returns from the invocation. No process can ever be blocked by another computation. Wait-free algorithms are extremely efficient, thought it’s very difficult to design, test and debug.

Software Transaction Memory(STM) is a new technology in concurrent computing, it’s based on the idea transaction from database systems. STM hides locking from programmers just like GC(Garbage Collection) hides memory managenent from programmers. Depending on the implementation, STM can guarantee different levels of non-blocking liveness.

## Registers

Scientific researches often start from the simplest abstractions. Scholars studied the most basic unit of shared memory - registers. They classify registers by following criteria:

1. What values registers can hold? Binary, Multi-valued.
2. How registers are read and written? SRSW, MRSW, MRMW.
3. How registers behave to reads and writes? Safe, Regular, Atomic.

Explanations of SRSW, MRSW and MRMW:

• SRSW - Single Reader Single Writer
• MRSW - Mutiple Reader Single Writer
• MRMW - Mutiple Reader Mutiple Writer

Explanations of Safe, Regular and Atomic:

• Safe - If a read happens in parallel with a write, the read can return any value. Otherwise, a read should return the previous written value.
• Regular - If a read happens in parallel with a write, the read can return current written value or previous written value. Otherwise, a read should return the previous written value.
• Atomic - If two sequential reads get values v and v’ respectively, then v must be written before v’. In another word, all reads and writes of a atomic register can be linearized, as if each operation takes an indivisible point of time.

What’s marvelous is that, after two decades of hard work, people are able to prove following theorem:

A multi-valued MRMW atomic register can be implemented with binary SRSW safe register.

In following sections, let’s see how this is possible.

## From (binary) SRSW safe to (binary) MRSW safe

Uses an array of N (binary) SRSW registers, where N is the number of reader processs.

Read():

Write(v):
for i = 1 to N
Reg[i].write(v)


Note:

• The transformation works also for multi-valued registers and regular ones.
• It doesn’t work for atomic registers.

Why it doesn’t work for atomic registers? Suppose the previous value is 1, current writing value is 2. It’s possible to a reader to return 2, while a following reader to return 1, thus violating atomicity. This scenario is illustrated below:

p1: |---write(1)--|--------|-----------------write(2)---------------|



## From binary MRSW safe to binary MRSW regular

Use a binary MRSW safe register.

Read():

Write(v):
Reg.write(v)


Note:

• The transformation works for single reader registers
• It does not work for multi-valued registers
• It does not work for atomic registers

Why it works for binary-valued registers? The key point that if a read and a write are in parallel, then the read can return any value – which must be 1 or 0, i.e. current value or previous value, because we don’t write when the value doesn’t value.

Why it does not work for multi-valued registers? Look at following scenario:

p1: |---write(1)---|--------|-----------------write(2)---------------|



Why it doesn’t work for atomic registers? It’s because following scenario is possible(even if we use regular MRSW register):

p1: |---write(0)--|--------|-----------------write(1)---------------|



## From binary MRSW regular to M-Valued MRSW regular

Use an array of binary MRSW regular registers Reg[0,1,..,M] init to [1,0,..,0].

Read():
for i=0 to M
return i

Write(v):
Reg[v].write(1)
for i=v to 0
Reg[i].write(0)


Note:

• The transformation would not work if the Write() would first write 0s and then 1
• The transformation works for regular but NOT for atomic registers

Why it doesn’t work if Write() would first write 0s and then 1? Look at following scenario:

p1: |-----------------write(1)---------------|



Why it doesn’t work for atomic registers? Look at following scenario(assume we’re using binary MRSW regular registers):

p1: |---write(3)--|--------|-----------------write(2)---------------|



The case above happens when write(2) on register Reg[2] overlaps with the two reads.

The transformatin is not multi-valued MRSW atomic even if we use binary MRSW atomic registers:

p1: |--------write(1)-------|--|-----------------write(2)---------------|



The case above happens when operations take place in following order:

1. p2 reads Reg[1] and gets 0, and moves to Reg[2]
2. p1 writes 1 to Reg[1] and finishes write.
3. p1 writes 1 to Reg[2] and begins to set Reg[1] to 0
6. p1 writes 0 to Reg[1] and finishes write.

## From SRSW regular to SRSW atomic

Use one regular SRSW register Reg and two local variables.

Read():
if t' > t then t := t', v := v'
return v

Write(v):
t := t + 1
Reg.write(t, v)


The transformation would not work for multiple readers. Why it doesn’t work for multiple readers? Look at following scenario:

p1: |---write(1, 100)---|--|--------------write(2, 200)-----------------|



It’s not a issue when there’s only a single reader, because there’s only disorder between different readers.

## From binary MRSW atomic registers to multi-valued MRSW atomic registers

Use N binary MRSW atomic registers, where N is the maximum value supported.

Write(v):
Reg[v].write(1)
for i=v to 0
Reg[i].write(0)

r = -1
for i=0 to M
r := i
break

for i=r to 1
r := i
## !! no break here

return r


This algorithm is very tricky. We can verify that there’s no newer/older inversion in reads. Think if following scenario is possible: a process pj reads v(i), but a following read by process pk returns v(i-1). Suppose pk reads v(i-1).

If v(i) < v(i-1), then it’s impossible for a newer read to get v(i-1) > v(i) without finding v(i) first. Of course Reg[v(i)] could be set to 0 by a newer write v’ > v(i), but in that case the registers Reg[v(i)..v’-1] should all be 0, it’s impossible for a newer read to return an old value v(i-1) > v(i).

If v(i) > v(i-1), that means the writer hasn’t set Reg[v(i-1)] to 0 yet when pj finishes read. However, we know pj will reverse search the Reg[..], Reg[1..v(i)-1] must all be 0’s. So if a newer read gets v’ < v(i), it must be written by a newer write.

A tricky thing is, if we add break to the second for loop inside Read() of the algorithm above, it does’t work anymore. Following figure is an illustration of an newer/older read inversion:

p1: |--write(3)--|--|---write(1)---|---|----------write(2)------------|



The case above happens when operations take place in following order:

1. p1 writes 1 to Reg[3] and finishes write(3).
2. p2 reads Reg[1] & Reg[2] and gets 0, and reads 1 in Reg[3] and begins to move back
3. p1 writes 1 to Reg[1] and finishes write(1).
4. p1 writes 1 to Reg[2] and begins to set Reg[1] to 0
5. p2 reads Reg[2] and gets 1, finishes the read returning 2.
6. p3 reads Reg[1] and gets 1, finishes the read returning 1.
7. p1 writes 0 to Reg[1] and finishes write(2).

Also, if we changed the base registers from atomic to regular, the algorithm doesn’t work, as following scenario is possible:

p1: |---write(3)--|--------|-----------------write(2)---------------|



The scenario above happens when write to Reg[2] overlaps with the two reads, which may return current value or previous value.

## From SRSW atomic to MRSW atomic

We use N*N SRSW atomic registers RReg[(1,1),(1,2),..,(k,j),..(N,N)] to communicate among the readers. In RReg[(k,j)] the reader is pk and the writer is pj.

We also use n SRSW atomic registers WReg[1,..,N] to store new values. The writer in all these is p1. The reader in WReg[k] is pk.

Write(v):
t1 := t1+1
for j=1 to N
WReg.write(v,t1)

for j=1 to N do

(t, x) := highest(t[..], x[..])

for j=1 to N do
RReg[j, i].write(t, x)

return x


Note:

• The transformation would not work for multiple writers
• The transformation would not work if the readers do not communicate

Why does this algorithm work? Think if following scenario is possible: a process pj reads (t, v)(i), but a following read by process pk returns (r, v)(i-1). It can’t happen, as when pj finishes the read, it must have updated the register RReg[j, k] with (t, v)(i). So for a following read, it should return a value at least as newest as (t, v)(i).

## From MRSW atomic to MRMW atomic

Use N MRSW atomic registers Reg[1,..,N]; the writer of Reg[j] is pj.

Write(v):
for j=1 to N do

(t, x) := highest(t[..], x[..])

t: = t + 1
Reg[i].write(t, v)

for j=1 to N do

(t, x) = highest(t[..], x[..])

return x


Is it possible to have newer/older read inversions? Suppose that a reader reads v(i), we have to show that it’s impossible for a newer read to gets v(i-1). If the two writes write(v(i)) and write(v(i-1)) are not in parallel, the clock in Reg[v(i)] must be higher than clock in Reg[v(i-1)]. Thus, it’s impossible for a newer read to get the lower clock in Reg[v(i-1)] while ignoring Reg[v(i)].

What happens if two writes are in parallel? In such cases, the clock may be the same in two different registers. In this case, the different readers can read different values. However, following scenario can’t happen, as p5 reads 4 instead 3 implies the register holds value 4 has a higher clock, thus a newer read can’t return 3 instead 4.

p1: |-----------------------write(3)--------------------------------|

p2: |-----------------------write(4)--------------------------------|



## Counters and Snapshots

Based on registers, it’s possible to build more complex atomic wait-free objects. Counters and snapshots can be implemented based on registers.

### Counters

The processes share an array of registers Reg[1,..,N], the writer of Reg[i] is process pi.

Read():
sum := 0
for j=0 to N

return sum

Write(v):
Reg[i].write(temp)

Write'(v):
value := value + 1
Reg[i].write(value)


### Snapshots

Idea: To scan, a process keeps reading the entire snapshot, until two results at the same.

The processes share one array of N registers Reg[1,..,N], each contains a value and a timestamp. The writer of Reg[i] is process i.

A lock-free implementation is as follows:

collect():
for j=1 to N

return x

scan():
temp1 := self.collect()

while(true) do
temp2 := self.collect()
if (temp1.tsp = temp2.tsp)
return temp1.val
temp1 := temp2

update(v):
ts := ts + 1
Reg[i].write(v, ts)


A wait-free implementation is as follows:

collect():
for j=1 to N

return x

scan():
t1 := self.collect()
t2 := t1

while(true) do
t3 := self.collect()
if (t3 = t2) then return t3.val

for j=1 to N
if t3[j, 2] >= t1[j, 2] + 2
return t3[j, 3]

t2 := t3

update(v):
ts := ts + 1
Reg[i].write(v, ts, self.scan())


The implementation above is very tricky. Think about following questions:

• why is the loop not a dead loop?
• why use the for loop inside while?
• why use 2 in the timestamp comparison?
• why call scan in update?

If the reader in the while loop finds the value of a cell changes more than two times, it can conclude that the last such UPDATE must be within the interval of current READ. This is illustrated in the figure below:

p1: ---------|--------------------scan()---------------------|---

p2: ---|---update(5)----|-----------|---update(3)---|------------


In the figure above, if p1 notices that Reg[2] has been updated two times, it can safely conclude that the second UPDATE is within the interval of current SCAN.

The idea is that the writer can help the readers by writing a copy of the array along with the value to registers using the same SCAN operation, the reader can just use that copy as the return value. If there’re multiple UPDATEs in parallel, the fastest one will successfully scan a copy(that doesn’t change during two COLLECTs) and finish UPDATE. Then the second fatest writer will successfully scan and finsh UPDATE, and so on.

If a reader is slow and extremely unlucky(not fast enough to obtain two equal COLLECTs), then it waits for at most N parallel UPDATEs in order to read the help copy from a writer.

As all readers and writers can proceed until finish(no matter what happens to other processes), this implementation is indeed wait-free.

## Consensus

We’ve seen that wait-free implement counters and snapshots based on wait-free atomic registers, one question we might ask is can any object be wait-freely implemented based on registers? The answer is no. For example, it’s impossible to wait-freely implement queues based on registers.

To analyse the concurrency power of an object, it’s useful to introduce the object consensus. Consensus is an object supports only one operation propose and satisfies following conditions:

• Each process calling propose eventually terminates
• Each process calling propose gets the same return
• The value returned is among the values proposed

Given a consensus object, there’s an universal construction(introduced below) that can wait-freely implement any object. Consensus number of an object is the maximum number of processes among which the object can implement consensus. So consensus number is a good metric of the capabilities of an object. It turns out that there exists following hierarchy regarding the cababilities of objects:

• N: Compare and Swap(CAS)
• 2: Queue, Test & Set, Fetch & Increment
• 1: Registers

CAS is the most powerful, as it can solve the consensus problem among any number of processes. Registers are the weakest, which can’t even solve the consensus problem for two processes.

### 2-Process Consensus with Test & Set

Uses two registers R0 and R1, and a Test & Set object T.

propose(v):
R[i].write(v)
val := T.test&set()

if val = 0
return v
else


### 2-Process Consensus with swap

Uses two registers R0 and R1, and a swap object S.

propose(v):
R[i].write(v)
aux = i
S.swap(&aux)

if aux = ⊥
return v
else


### 2-Process Consensus with Queue

Uses two registers R0 and R1, and a queue Q, which is initialized to {winner, loser}.

propose(v):
R[i].write(v)
item := Q.dequeue()

if item = winner
return v
else


### N-Process Consensus with CAS

Uses a CAS object C.

propose(v):
val = C.c&s(⊥, v)

if val = ⊥
return v
else
return val


### N-Process Consensus with mem-to-mem-swap

mem-to-mem-swap register supports atomically exchange the value of two registers in addition to the operation READ and WRITE.

The implementation uses an array of N such registers A[1..N] with initial value 0 and one such register R with initial value 1.

propose(v):
Reg[i] = v
mem-to-mem-swap(A[i], R)

for j=0 to N do
if A[j] = 1 then return Reg[i]


### N-Process Consensus with mem-to-mem-move

mem-to-mem-move augments registers by providing an operation to atomically copy the value from one register to another register.

The implementation uses two arrays of such registers R1[1..N] and R0[1..N]. R1 is initilized to 1, and R0 is initilized to 0. An array of registers R[1..N] is to hold proposed values.

propose(v):
R[i] := v
R0[i] <- R1[i]

for j=1 to N do:
R1[j] = 0

for j=N to 1 do:
if R0[j] = 1 do:
return R[j]


### N-Process Consensus with augmented queue objects

The augmented queue objects support peek on the head of queue in addition to operations enqueue and dequene. Interestingly, with an addition of such a small feature, the consensus number of queue goes from 2 to infinity. The implementation is very straight-forward:

propose(v):
Q.enqueue(v)
Q.peek()


### Universal constructions

An object type T is universal if, given any (total) type T, an object of type T′ can be wait-freely implemented from objects of type T, together with atomic registers. Consensus is universal. The construction is as follows:

Shared Objects:
s_lreq: a list of N registers to store requests
s_lcons: an infinite list of consensus objects

Local Objects:
lperf: a list of performed requests
k: consensus object index
sn: sequence number of request
obj: a local copy of the object to be implemented

Init:
lperf := ∅
k := 0
sn := 0

Execute(req):
val = ⊥
pending = ∅
sn := sn + 1

s_lreq[i] := (req, sn, i)

REPEAT do:
foreach j in s_lreq[j] do:
pending := pending ∪ {s_lreq[j]}

pending := pending \ lperf

k := k + 1
reqs := s_lcons[k].propose(pending)

foreach (r, n, j) in reqs do:
if (r, n, j) = (req, sn, i)
val = obj.execute(r)
else
obj.execute(r)

lperf = lperf ∪ reqs

if val != ⊥ then BREAK

return val


## Consensus with Timing Assumptions

There’re different types of consensus if the termination property is different:

• Wait-free: If a correct process proposes, it eventually decides.
• Lock-Free: If a correct process proposes, at least one correct process eventually decides
• Obstruction-Free: If a correct process proposes and eventually executes alone, the process eventually decides

### Obstruction-free Consensus

It’s possible to implement Obstruction-free Consensus using only registers.

Each process maintains a local timestamp initilized to i. The processes share an array of register pairs Reg[1,..,n]; each element of the array contains two registers:

• Reg[i].T contains a timestamp (init to 0)
• Reg[i].V contains a pair (value,timestamp), initialized to (⊥,0)
propose(v):
while TRUE do:
Reg[i].T.write(ts)
(maxts, val) = highestTsValue(Reg[1..n].V)
if val = ⊥ then val := v

Reg[i].V.write((ts, val))

if ts = highestTs(Reg[1..n].T) then return val

ts := ts + n


### Lock-free Consensus

The implementation of Lock-free Consensus depends on:

• Obstruction-free Consensus, and

If a correct process invokes leader, then the invocation returns and eventually, some correct process is permanently the only leader

<>Leader encapsulates the following synchrony assumption: there is a time after which a lower and an upper bound hold on the time it takes for every process to execute a step (eventual synchrony)

The idea is to use <>leader to make sure that, eventually, one process keeps executing steps alone, until it decides.

The only change in the implementation is to add a guard to the body of while, to make sure only the leader executes the body.

propose(v):
while TRUE do:
Reg[i].T.write(ts)
(maxts, val) = highestTsValue(Reg[1..n].V)
if val = ⊥ then val := v

Reg[i].V.write((ts, val))

if ts = highestTs(Reg[1..n].T) then return val

ts := ts + n


### Wait-free Consensus

We can implement Wait-free Consensus based on Lock-free Consensus. The idea is that once the leader decides, it writes the value to a register, other processes return the value in the register.

propose(v):
Reg[i].T.write(ts)
(maxts, val) = highestTsValue(Reg[1..n].V)
if val = ⊥ then val := v

Reg[i].V.write((ts, val))

if ts = highestTs(Reg[1..n].T) then Dec.write(val)

ts := ts + n