vlocks.txt 6.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211
  1. vlocks for Bare-Metal Mutual Exclusion
  2. ======================================
  3. Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
  4. mechanism, with reasonable but minimal requirements on the memory
  5. system.
  6. These are intended to be used to coordinate critical activity among CPUs
  7. which are otherwise non-coherent, in situations where the hardware
  8. provides no other mechanism to support this and ordinary spinlocks
  9. cannot be used.
  10. vlocks make use of the atomicity provided by the memory system for
  11. writes to a single memory location. To arbitrate, every CPU "votes for
  12. itself", by storing a unique number to a common memory location. The
  13. final value seen in that memory location when all the votes have been
  14. cast identifies the winner.
  15. In order to make sure that the election produces an unambiguous result
  16. in finite time, a CPU will only enter the election in the first place if
  17. no winner has been chosen and the election does not appear to have
  18. started yet.
  19. Algorithm
  20. ---------
  21. The easiest way to explain the vlocks algorithm is with some pseudo-code:
  22. int currently_voting[NR_CPUS] = { 0, };
  23. int last_vote = -1; /* no votes yet */
  24. bool vlock_trylock(int this_cpu)
  25. {
  26. /* signal our desire to vote */
  27. currently_voting[this_cpu] = 1;
  28. if (last_vote != -1) {
  29. /* someone already volunteered himself */
  30. currently_voting[this_cpu] = 0;
  31. return false; /* not ourself */
  32. }
  33. /* let's suggest ourself */
  34. last_vote = this_cpu;
  35. currently_voting[this_cpu] = 0;
  36. /* then wait until everyone else is done voting */
  37. for_each_cpu(i) {
  38. while (currently_voting[i] != 0)
  39. /* wait */;
  40. }
  41. /* result */
  42. if (last_vote == this_cpu)
  43. return true; /* we won */
  44. return false;
  45. }
  46. bool vlock_unlock(void)
  47. {
  48. last_vote = -1;
  49. }
  50. The currently_voting[] array provides a way for the CPUs to determine
  51. whether an election is in progress, and plays a role analogous to the
  52. "entering" array in Lamport's bakery algorithm [1].
  53. However, once the election has started, the underlying memory system
  54. atomicity is used to pick the winner. This avoids the need for a static
  55. priority rule to act as a tie-breaker, or any counters which could
  56. overflow.
  57. As long as the last_vote variable is globally visible to all CPUs, it
  58. will contain only one value that won't change once every CPU has cleared
  59. its currently_voting flag.
  60. Features and limitations
  61. ------------------------
  62. * vlocks are not intended to be fair. In the contended case, it is the
  63. _last_ CPU which attempts to get the lock which will be most likely
  64. to win.
  65. vlocks are therefore best suited to situations where it is necessary
  66. to pick a unique winner, but it does not matter which CPU actually
  67. wins.
  68. * Like other similar mechanisms, vlocks will not scale well to a large
  69. number of CPUs.
  70. vlocks can be cascaded in a voting hierarchy to permit better scaling
  71. if necessary, as in the following hypothetical example for 4096 CPUs:
  72. /* first level: local election */
  73. my_town = towns[(this_cpu >> 4) & 0xf];
  74. I_won = vlock_trylock(my_town, this_cpu & 0xf);
  75. if (I_won) {
  76. /* we won the town election, let's go for the state */
  77. my_state = states[(this_cpu >> 8) & 0xf];
  78. I_won = vlock_lock(my_state, this_cpu & 0xf));
  79. if (I_won) {
  80. /* and so on */
  81. I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
  82. if (I_won) {
  83. /* ... */
  84. }
  85. vlock_unlock(the_whole_country);
  86. }
  87. vlock_unlock(my_state);
  88. }
  89. vlock_unlock(my_town);
  90. ARM implementation
  91. ------------------
  92. The current ARM implementation [2] contains some optimisations beyond
  93. the basic algorithm:
  94. * By packing the members of the currently_voting array close together,
  95. we can read the whole array in one transaction (providing the number
  96. of CPUs potentially contending the lock is small enough). This
  97. reduces the number of round-trips required to external memory.
  98. In the ARM implementation, this means that we can use a single load
  99. and comparison:
  100. LDR Rt, [Rn]
  101. CMP Rt, #0
  102. ...in place of code equivalent to:
  103. LDRB Rt, [Rn]
  104. CMP Rt, #0
  105. LDRBEQ Rt, [Rn, #1]
  106. CMPEQ Rt, #0
  107. LDRBEQ Rt, [Rn, #2]
  108. CMPEQ Rt, #0
  109. LDRBEQ Rt, [Rn, #3]
  110. CMPEQ Rt, #0
  111. This cuts down on the fast-path latency, as well as potentially
  112. reducing bus contention in contended cases.
  113. The optimisation relies on the fact that the ARM memory system
  114. guarantees coherency between overlapping memory accesses of
  115. different sizes, similarly to many other architectures. Note that
  116. we do not care which element of currently_voting appears in which
  117. bits of Rt, so there is no need to worry about endianness in this
  118. optimisation.
  119. If there are too many CPUs to read the currently_voting array in
  120. one transaction then multiple transations are still required. The
  121. implementation uses a simple loop of word-sized loads for this
  122. case. The number of transactions is still fewer than would be
  123. required if bytes were loaded individually.
  124. In principle, we could aggregate further by using LDRD or LDM, but
  125. to keep the code simple this was not attempted in the initial
  126. implementation.
  127. * vlocks are currently only used to coordinate between CPUs which are
  128. unable to enable their caches yet. This means that the
  129. implementation removes many of the barriers which would be required
  130. when executing the algorithm in cached memory.
  131. packing of the currently_voting array does not work with cached
  132. memory unless all CPUs contending the lock are cache-coherent, due
  133. to cache writebacks from one CPU clobbering values written by other
  134. CPUs. (Though if all the CPUs are cache-coherent, you should be
  135. probably be using proper spinlocks instead anyway).
  136. * The "no votes yet" value used for the last_vote variable is 0 (not
  137. -1 as in the pseudocode). This allows statically-allocated vlocks
  138. to be implicitly initialised to an unlocked state simply by putting
  139. them in .bss.
  140. An offset is added to each CPU's ID for the purpose of setting this
  141. variable, so that no CPU uses the value 0 for its ID.
  142. Colophon
  143. --------
  144. Originally created and documented by Dave Martin for Linaro Limited, for
  145. use in ARM-based big.LITTLE platforms, with review and input gratefully
  146. received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for
  147. grabbing most of this text out of the relevant mail thread and writing
  148. up the pseudocode.
  149. Copyright (C) 2012-2013 Linaro Limited
  150. Distributed under the terms of Version 2 of the GNU General Public
  151. License, as defined in linux/COPYING.
  152. References
  153. ----------
  154. [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
  155. Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
  156. https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
  157. [2] linux/arch/arm/common/vlock.S, www.kernel.org.