md-cluster.txt 7.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176
  1. The cluster MD is a shared-device RAID for a cluster.
  2. 1. On-disk format
  3. Separate write-intent-bitmap are used for each cluster node.
  4. The bitmaps record all writes that may have been started on that node,
  5. and may not yet have finished. The on-disk layout is:
  6. 0 4k 8k 12k
  7. -------------------------------------------------------------------
  8. | idle | md super | bm super [0] + bits |
  9. | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
  10. | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
  11. | bm bits [3, contd] | | |
  12. During "normal" functioning we assume the filesystem ensures that only one
  13. node writes to any given block at a time, so a write
  14. request will
  15. - set the appropriate bit (if not already set)
  16. - commit the write to all mirrors
  17. - schedule the bit to be cleared after a timeout.
  18. Reads are just handled normally. It is up to the filesystem to
  19. ensure one node doesn't read from a location where another node (or the same
  20. node) is writing.
  21. 2. DLM Locks for management
  22. There are two locks for managing the device:
  23. 2.1 Bitmap lock resource (bm_lockres)
  24. The bm_lockres protects individual node bitmaps. They are named in the
  25. form bitmap001 for node 1, bitmap002 for node and so on. When a node
  26. joins the cluster, it acquires the lock in PW mode and it stays so
  27. during the lifetime the node is part of the cluster. The lock resource
  28. number is based on the slot number returned by the DLM subsystem. Since
  29. DLM starts node count from one and bitmap slots start from zero, one is
  30. subtracted from the DLM slot number to arrive at the bitmap slot number.
  31. 3. Communication
  32. Each node has to communicate with other nodes when starting or ending
  33. resync, and metadata superblock updates.
  34. 3.1 Message Types
  35. There are 3 types, of messages which are passed
  36. 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
  37. updated, and the node must re-read the md superblock. This is performed
  38. synchronously.
  39. 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
  40. so that each node may suspend or resume the region.
  41. 3.2 Communication mechanism
  42. The DLM LVB is used to communicate within nodes of the cluster. There
  43. are three resources used for the purpose:
  44. 3.2.1 Token: The resource which protects the entire communication
  45. system. The node having the token resource is allowed to
  46. communicate.
  47. 3.2.2 Message: The lock resource which carries the data to
  48. communicate.
  49. 3.2.3 Ack: The resource, acquiring which means the message has been
  50. acknowledged by all nodes in the cluster. The BAST of the resource
  51. is used to inform the receive node that a node wants to communicate.
  52. The algorithm is:
  53. 1. receive status
  54. sender receiver receiver
  55. ACK:CR ACK:CR ACK:CR
  56. 2. sender get EX of TOKEN
  57. sender get EX of MESSAGE
  58. sender receiver receiver
  59. TOKEN:EX ACK:CR ACK:CR
  60. MESSAGE:EX
  61. ACK:CR
  62. Sender checks that it still needs to send a message. Messages received
  63. or other events that happened while waiting for the TOKEN may have made
  64. this message inappropriate or redundant.
  65. 3. sender write LVB.
  66. sender down-convert MESSAGE from EX to CW
  67. sender try to get EX of ACK
  68. [ wait until all receiver has *processed* the MESSAGE ]
  69. [ triggered by bast of ACK ]
  70. receiver get CR of MESSAGE
  71. receiver read LVB
  72. receiver processes the message
  73. [ wait finish ]
  74. receiver release ACK
  75. sender receiver receiver
  76. TOKEN:EX MESSAGE:CR MESSAGE:CR
  77. MESSAGE:CR
  78. ACK:EX
  79. 4. triggered by grant of EX on ACK (indicating all receivers have processed
  80. message)
  81. sender down-convert ACK from EX to CR
  82. sender release MESSAGE
  83. sender release TOKEN
  84. receiver upconvert to PR of MESSAGE
  85. receiver get CR of ACK
  86. receiver release MESSAGE
  87. sender receiver receiver
  88. ACK:CR ACK:CR ACK:CR
  89. 4. Handling Failures
  90. 4.1 Node Failure
  91. When a node fails, the DLM informs the cluster with the slot. The node
  92. starts a cluster recovery thread. The cluster recovery thread:
  93. - acquires the bitmap<number> lock of the failed node
  94. - opens the bitmap
  95. - reads the bitmap of the failed node
  96. - copies the set bitmap to local node
  97. - cleans the bitmap of the failed node
  98. - releases bitmap<number> lock of the failed node
  99. - initiates resync of the bitmap on the current node
  100. The resync process, is the regular md resync. However, in a clustered
  101. environment when a resync is performed, it needs to tell other nodes
  102. of the areas which are suspended. Before a resync starts, the node
  103. send out RESYNC_START with the (lo,hi) range of the area which needs
  104. to be suspended. Each node maintains a suspend_list, which contains
  105. the list of ranges which are currently suspended. On receiving
  106. RESYNC_START, the node adds the range to the suspend_list. Similarly,
  107. when the node performing resync finishes, it send RESYNC_FINISHED
  108. to other nodes and other nodes remove the corresponding entry from
  109. the suspend_list.
  110. A helper function, should_suspend() can be used to check if a particular
  111. I/O range should be suspended or not.
  112. 4.2 Device Failure
  113. Device failures are handled and communicated with the metadata update
  114. routine.
  115. 5. Adding a new Device
  116. For adding a new device, it is necessary that all nodes "see" the new device
  117. to be added. For this, the following algorithm is used:
  118. 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
  119. ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
  120. 2. Node 1 sends NEWDISK with uuid and slot number
  121. 3. Other nodes issue kobject_uevent_env with uuid and slot number
  122. (Steps 4,5 could be a udev rule)
  123. 4. In userspace, the node searches for the disk, perhaps
  124. using blkid -t SUB_UUID=""
  125. 5. Other nodes issue either of the following depending on whether the disk
  126. was found:
  127. ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
  128. disc.number set to slot number)
  129. ioctl(CLUSTERED_DISK_NACK)
  130. 6. Other nodes drop lock on no-new-devs (CR) if device is found
  131. 7. Node 1 attempts EX lock on no-new-devs
  132. 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
  133. as SpareLocal
  134. 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
  135. 10. Other nodes get the information whether a disk is added or not
  136. by the following METADATA_UPDATED.