123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211 |
- vlocks for Bare-Metal Mutual Exclusion
- ======================================
- Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
- mechanism, with reasonable but minimal requirements on the memory
- system.
- These are intended to be used to coordinate critical activity among CPUs
- which are otherwise non-coherent, in situations where the hardware
- provides no other mechanism to support this and ordinary spinlocks
- cannot be used.
- vlocks make use of the atomicity provided by the memory system for
- writes to a single memory location. To arbitrate, every CPU "votes for
- itself", by storing a unique number to a common memory location. The
- final value seen in that memory location when all the votes have been
- cast identifies the winner.
- In order to make sure that the election produces an unambiguous result
- in finite time, a CPU will only enter the election in the first place if
- no winner has been chosen and the election does not appear to have
- started yet.
- Algorithm
- ---------
- The easiest way to explain the vlocks algorithm is with some pseudo-code:
- int currently_voting[NR_CPUS] = { 0, };
- int last_vote = -1; /* no votes yet */
- bool vlock_trylock(int this_cpu)
- {
- /* signal our desire to vote */
- currently_voting[this_cpu] = 1;
- if (last_vote != -1) {
- /* someone already volunteered himself */
- currently_voting[this_cpu] = 0;
- return false; /* not ourself */
- }
- /* let's suggest ourself */
- last_vote = this_cpu;
- currently_voting[this_cpu] = 0;
- /* then wait until everyone else is done voting */
- for_each_cpu(i) {
- while (currently_voting[i] != 0)
- /* wait */;
- }
- /* result */
- if (last_vote == this_cpu)
- return true; /* we won */
- return false;
- }
- bool vlock_unlock(void)
- {
- last_vote = -1;
- }
- The currently_voting[] array provides a way for the CPUs to determine
- whether an election is in progress, and plays a role analogous to the
- "entering" array in Lamport's bakery algorithm [1].
- However, once the election has started, the underlying memory system
- atomicity is used to pick the winner. This avoids the need for a static
- priority rule to act as a tie-breaker, or any counters which could
- overflow.
- As long as the last_vote variable is globally visible to all CPUs, it
- will contain only one value that won't change once every CPU has cleared
- its currently_voting flag.
- Features and limitations
- ------------------------
- * vlocks are not intended to be fair. In the contended case, it is the
- _last_ CPU which attempts to get the lock which will be most likely
- to win.
- vlocks are therefore best suited to situations where it is necessary
- to pick a unique winner, but it does not matter which CPU actually
- wins.
- * Like other similar mechanisms, vlocks will not scale well to a large
- number of CPUs.
- vlocks can be cascaded in a voting hierarchy to permit better scaling
- if necessary, as in the following hypothetical example for 4096 CPUs:
- /* first level: local election */
- my_town = towns[(this_cpu >> 4) & 0xf];
- I_won = vlock_trylock(my_town, this_cpu & 0xf);
- if (I_won) {
- /* we won the town election, let's go for the state */
- my_state = states[(this_cpu >> 8) & 0xf];
- I_won = vlock_lock(my_state, this_cpu & 0xf));
- if (I_won) {
- /* and so on */
- I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
- if (I_won) {
- /* ... */
- }
- vlock_unlock(the_whole_country);
- }
- vlock_unlock(my_state);
- }
- vlock_unlock(my_town);
- ARM implementation
- ------------------
- The current ARM implementation [2] contains some optimisations beyond
- the basic algorithm:
- * By packing the members of the currently_voting array close together,
- we can read the whole array in one transaction (providing the number
- of CPUs potentially contending the lock is small enough). This
- reduces the number of round-trips required to external memory.
- In the ARM implementation, this means that we can use a single load
- and comparison:
- LDR Rt, [Rn]
- CMP Rt, #0
- ...in place of code equivalent to:
- LDRB Rt, [Rn]
- CMP Rt, #0
- LDRBEQ Rt, [Rn, #1]
- CMPEQ Rt, #0
- LDRBEQ Rt, [Rn, #2]
- CMPEQ Rt, #0
- LDRBEQ Rt, [Rn, #3]
- CMPEQ Rt, #0
- This cuts down on the fast-path latency, as well as potentially
- reducing bus contention in contended cases.
- The optimisation relies on the fact that the ARM memory system
- guarantees coherency between overlapping memory accesses of
- different sizes, similarly to many other architectures. Note that
- we do not care which element of currently_voting appears in which
- bits of Rt, so there is no need to worry about endianness in this
- optimisation.
- If there are too many CPUs to read the currently_voting array in
- one transaction then multiple transations are still required. The
- implementation uses a simple loop of word-sized loads for this
- case. The number of transactions is still fewer than would be
- required if bytes were loaded individually.
- In principle, we could aggregate further by using LDRD or LDM, but
- to keep the code simple this was not attempted in the initial
- implementation.
- * vlocks are currently only used to coordinate between CPUs which are
- unable to enable their caches yet. This means that the
- implementation removes many of the barriers which would be required
- when executing the algorithm in cached memory.
- packing of the currently_voting array does not work with cached
- memory unless all CPUs contending the lock are cache-coherent, due
- to cache writebacks from one CPU clobbering values written by other
- CPUs. (Though if all the CPUs are cache-coherent, you should be
- probably be using proper spinlocks instead anyway).
- * The "no votes yet" value used for the last_vote variable is 0 (not
- -1 as in the pseudocode). This allows statically-allocated vlocks
- to be implicitly initialised to an unlocked state simply by putting
- them in .bss.
- An offset is added to each CPU's ID for the purpose of setting this
- variable, so that no CPU uses the value 0 for its ID.
- Colophon
- --------
- Originally created and documented by Dave Martin for Linaro Limited, for
- use in ARM-based big.LITTLE platforms, with review and input gratefully
- received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for
- grabbing most of this text out of the relevant mail thread and writing
- up the pseudocode.
- Copyright (C) 2012-2013 Linaro Limited
- Distributed under the terms of Version 2 of the GNU General Public
- License, as defined in linux/COPYING.
- References
- ----------
- [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
- Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
- https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
- [2] linux/arch/arm/common/vlock.S, www.kernel.org.
|