123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398 |
- Introduction
- ============
- This document describes a collection of device-mapper targets that
- between them implement thin-provisioning and snapshots.
- The main highlight of this implementation, compared to the previous
- implementation of snapshots, is that it allows many virtual devices to
- be stored on the same data volume. This simplifies administration and
- allows the sharing of data between volumes, thus reducing disk usage.
- Another significant feature is support for an arbitrary depth of
- recursive snapshots (snapshots of snapshots of snapshots ...). The
- previous implementation of snapshots did this by chaining together
- lookup tables, and so performance was O(depth). This new
- implementation uses a single data structure to avoid this degradation
- with depth. Fragmentation may still be an issue, however, in some
- scenarios.
- Metadata is stored on a separate device from data, giving the
- administrator some freedom, for example to:
- - Improve metadata resilience by storing metadata on a mirrored volume
- but data on a non-mirrored one.
- - Improve performance by storing the metadata on SSD.
- Status
- ======
- These targets are very much still in the EXPERIMENTAL state. Please
- do not yet rely on them in production. But do experiment and offer us
- feedback. Different use cases will have different performance
- characteristics, for example due to fragmentation of the data volume.
- If you find this software is not performing as expected please mail
- dm-devel@redhat.com with details and we'll try our best to improve
- things for you.
- Userspace tools for checking and repairing the metadata are under
- development.
- Cookbook
- ========
- This section describes some quick recipes for using thin provisioning.
- They use the dmsetup program to control the device-mapper driver
- directly. End users will be advised to use a higher-level volume
- manager such as LVM2 once support has been added.
- Pool device
- -----------
- The pool device ties together the metadata volume and the data volume.
- It maps I/O linearly to the data volume and updates the metadata via
- two mechanisms:
- - Function calls from the thin targets
- - Device-mapper 'messages' from userspace which control the creation of new
- virtual devices amongst other things.
- Setting up a fresh pool device
- ------------------------------
- Setting up a pool device requires a valid metadata device, and a
- data device. If you do not have an existing metadata device you can
- make one by zeroing the first 4k to indicate empty metadata.
- dd if=/dev/zero of=$metadata_dev bs=4096 count=1
- The amount of metadata you need will vary according to how many blocks
- are shared between thin devices (i.e. through snapshots). If you have
- less sharing than average you'll need a larger-than-average metadata device.
- As a guide, we suggest you calculate the number of bytes to use in the
- metadata device as 48 * $data_dev_size / $data_block_size but round it up
- to 2MB if the answer is smaller. If you're creating large numbers of
- snapshots which are recording large amounts of change, you may find you
- need to increase this.
- The largest size supported is 16GB: If the device is larger,
- a warning will be issued and the excess space will not be used.
- Reloading a pool table
- ----------------------
- You may reload a pool's table, indeed this is how the pool is resized
- if it runs out of space. (N.B. While specifying a different metadata
- device when reloading is not forbidden at the moment, things will go
- wrong if it does not route I/O to exactly the same on-disk location as
- previously.)
- Using an existing pool device
- -----------------------------
- dmsetup create pool \
- --table "0 20971520 thin-pool $metadata_dev $data_dev \
- $data_block_size $low_water_mark"
- $data_block_size gives the smallest unit of disk space that can be
- allocated at a time expressed in units of 512-byte sectors.
- $data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
- multiple of 128 (64KB). $data_block_size cannot be changed after the
- thin-pool is created. People primarily interested in thin provisioning
- may want to use a value such as 1024 (512KB). People doing lots of
- snapshotting may want a smaller value such as 128 (64KB). If you are
- not zeroing newly-allocated data, a larger $data_block_size in the
- region of 256000 (128MB) is suggested.
- $low_water_mark is expressed in blocks of size $data_block_size. If
- free space on the data device drops below this level then a dm event
- will be triggered which a userspace daemon should catch allowing it to
- extend the pool device. Only one such event will be sent.
- No special event is triggered if a just resumed device's free space is below
- the low water mark. However, resuming a device always triggers an
- event; a userspace daemon should verify that free space exceeds the low
- water mark when handling this event.
- A low water mark for the metadata device is maintained in the kernel and
- will trigger a dm event if free space on the metadata device drops below
- it.
- Updating on-disk metadata
- -------------------------
- On-disk metadata is committed every time a FLUSH or FUA bio is written.
- If no such requests are made then commits will occur every second. This
- means the thin-provisioning target behaves like a physical disk that has
- a volatile write cache. If power is lost you may lose some recent
- writes. The metadata should always be consistent in spite of any crash.
- If data space is exhausted the pool will either error or queue IO
- according to the configuration (see: error_if_no_space). If metadata
- space is exhausted or a metadata operation fails: the pool will error IO
- until the pool is taken offline and repair is performed to 1) fix any
- potential inconsistencies and 2) clear the flag that imposes repair.
- Once the pool's metadata device is repaired it may be resized, which
- will allow the pool to return to normal operation. Note that if a pool
- is flagged as needing repair, the pool's data and metadata devices
- cannot be resized until repair is performed. It should also be noted
- that when the pool's metadata space is exhausted the current metadata
- transaction is aborted. Given that the pool will cache IO whose
- completion may have already been acknowledged to upper IO layers
- (e.g. filesystem) it is strongly suggested that consistency checks
- (e.g. fsck) be performed on those layers when repair of the pool is
- required.
- Thin provisioning
- -----------------
- i) Creating a new thinly-provisioned volume.
- To create a new thinly- provisioned volume you must send a message to an
- active pool device, /dev/mapper/pool in this example.
- dmsetup message /dev/mapper/pool 0 "create_thin 0"
- Here '0' is an identifier for the volume, a 24-bit number. It's up
- to the caller to allocate and manage these identifiers. If the
- identifier is already in use, the message will fail with -EEXIST.
- ii) Using a thinly-provisioned volume.
- Thinly-provisioned volumes are activated using the 'thin' target:
- dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
- The last parameter is the identifier for the thinp device.
- Internal snapshots
- ------------------
- i) Creating an internal snapshot.
- Snapshots are created with another message to the pool.
- N.B. If the origin device that you wish to snapshot is active, you
- must suspend it before creating the snapshot to avoid corruption.
- This is NOT enforced at the moment, so please be careful!
- dmsetup suspend /dev/mapper/thin
- dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
- dmsetup resume /dev/mapper/thin
- Here '1' is the identifier for the volume, a 24-bit number. '0' is the
- identifier for the origin device.
- ii) Using an internal snapshot.
- Once created, the user doesn't have to worry about any connection
- between the origin and the snapshot. Indeed the snapshot is no
- different from any other thinly-provisioned device and can be
- snapshotted itself via the same method. It's perfectly legal to
- have only one of them active, and there's no ordering requirement on
- activating or removing them both. (This differs from conventional
- device-mapper snapshots.)
- Activate it exactly the same way as any other thinly-provisioned volume:
- dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
- External snapshots
- ------------------
- You can use an external _read only_ device as an origin for a
- thinly-provisioned volume. Any read to an unprovisioned area of the
- thin device will be passed through to the origin. Writes trigger
- the allocation of new blocks as usual.
- One use case for this is VM hosts that want to run guests on
- thinly-provisioned volumes but have the base image on another device
- (possibly shared between many VMs).
- You must not write to the origin device if you use this technique!
- Of course, you may write to the thin device and take internal snapshots
- of the thin volume.
- i) Creating a snapshot of an external device
- This is the same as creating a thin device.
- You don't mention the origin at this stage.
- dmsetup message /dev/mapper/pool 0 "create_thin 0"
- ii) Using a snapshot of an external device.
- Append an extra parameter to the thin target specifying the origin:
- dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
- N.B. All descendants (internal snapshots) of this snapshot require the
- same extra origin parameter.
- Deactivation
- ------------
- All devices using a pool must be deactivated before the pool itself
- can be.
- dmsetup remove thin
- dmsetup remove snap
- dmsetup remove pool
- Reference
- =========
- 'thin-pool' target
- ------------------
- i) Constructor
- thin-pool <metadata dev> <data dev> <data block size (sectors)> \
- <low water mark (blocks)> [<number of feature args> [<arg>]*]
- Optional feature arguments:
- skip_block_zeroing: Skip the zeroing of newly-provisioned blocks.
- ignore_discard: Disable discard support.
- no_discard_passdown: Don't pass discards down to the underlying
- data device, but just remove the mapping.
- read_only: Don't allow any changes to be made to the pool
- metadata.
- error_if_no_space: Error IOs, instead of queueing, if no space.
- Data block size must be between 64KB (128 sectors) and 1GB
- (2097152 sectors) inclusive.
- ii) Status
- <transaction id> <used metadata blocks>/<total metadata blocks>
- <used data blocks>/<total data blocks> <held metadata root>
- [no_]discard_passdown ro|rw
- transaction id:
- A 64-bit number used by userspace to help synchronise with metadata
- from volume managers.
- used data blocks / total data blocks
- If the number of free blocks drops below the pool's low water mark a
- dm event will be sent to userspace. This event is edge-triggered and
- it will occur only once after each resume so volume manager writers
- should register for the event and then check the target's status.
- held metadata root:
- The location, in blocks, of the metadata root that has been
- 'held' for userspace read access. '-' indicates there is no
- held root.
- discard_passdown|no_discard_passdown
- Whether or not discards are actually being passed down to the
- underlying device. When this is enabled when loading the table,
- it can get disabled if the underlying device doesn't support it.
- ro|rw|out_of_data_space
- If the pool encounters certain types of device failures it will
- drop into a read-only metadata mode in which no changes to
- the pool metadata (like allocating new blocks) are permitted.
- In serious cases where even a read-only mode is deemed unsafe
- no further I/O will be permitted and the status will just
- contain the string 'Fail'. The userspace recovery tools
- should then be used.
- error_if_no_space|queue_if_no_space
- If the pool runs out of data or metadata space, the pool will
- either queue or error the IO destined to the data device. The
- default is to queue the IO until more space is added or the
- 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool
- module parameter can be used to change this timeout -- it
- defaults to 60 seconds but may be disabled using a value of 0.
- needs_check
- A metadata operation has failed, resulting in the needs_check
- flag being set in the metadata's superblock. The metadata
- device must be deactivated and checked/repaired before the
- thin-pool can be made fully operational again. '-' indicates
- needs_check is not set.
- iii) Messages
- create_thin <dev id>
- Create a new thinly-provisioned device.
- <dev id> is an arbitrary unique 24-bit identifier chosen by
- the caller.
- create_snap <dev id> <origin id>
- Create a new snapshot of another thinly-provisioned device.
- <dev id> is an arbitrary unique 24-bit identifier chosen by
- the caller.
- <origin id> is the identifier of the thinly-provisioned device
- of which the new device will be a snapshot.
- delete <dev id>
- Deletes a thin device. Irreversible.
- set_transaction_id <current id> <new id>
- Userland volume managers, such as LVM, need a way to
- synchronise their external metadata with the internal metadata of the
- pool target. The thin-pool target offers to store an
- arbitrary 64-bit transaction id and return it on the target's
- status line. To avoid races you must provide what you think
- the current transaction id is when you change it with this
- compare-and-swap message.
- reserve_metadata_snap
- Reserve a copy of the data mapping btree for use by userland.
- This allows userland to inspect the mappings as they were when
- this message was executed. Use the pool's status command to
- get the root block associated with the metadata snapshot.
- release_metadata_snap
- Release a previously reserved copy of the data mapping btree.
- 'thin' target
- -------------
- i) Constructor
- thin <pool dev> <dev id> [<external origin dev>]
- pool dev:
- the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
- dev id:
- the internal device identifier of the device to be
- activated.
- external origin dev:
- an optional block device outside the pool to be treated as a
- read-only snapshot origin: reads to unprovisioned areas of the
- thin target will be mapped to this device.
- The pool doesn't store any size against the thin devices. If you
- load a thin target that is smaller than you've been using previously,
- then you'll have no access to blocks mapped beyond the end. If you
- load a target that is bigger than before, then extra blocks will be
- provisioned as and when needed.
- ii) Status
- <nr mapped sectors> <highest mapped sector>
- If the pool has encountered device errors and failed, the status
- will just contain the string 'Fail'. The userspace recovery
- tools should then be used.
|