12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697 |
- Direct Access for files
- -----------------------
- Motivation
- ----------
- The page cache is usually used to buffer reads and writes to files.
- It is also used to provide the pages which are mapped into userspace
- by a call to mmap.
- For block devices that are memory-like, the page cache pages would be
- unnecessary copies of the original storage. The DAX code removes the
- extra copy by performing reads and writes directly to the storage device.
- For file mappings, the storage device is mapped directly into userspace.
- Usage
- -----
- If you have a block device which supports DAX, you can make a filesystem
- on it as usual. The DAX code currently only supports files with a block
- size equal to your kernel's PAGE_SIZE, so you may need to specify a block
- size when creating the filesystem. When mounting it, use the "-o dax"
- option on the command line or add 'dax' to the options in /etc/fstab.
- Implementation Tips for Block Driver Writers
- --------------------------------------------
- To support DAX in your block driver, implement the 'direct_access'
- block device operation. It is used to translate the sector number
- (expressed in units of 512-byte sectors) to a page frame number (pfn)
- that identifies the physical page for the memory. It also returns a
- kernel virtual address that can be used to access the memory.
- The direct_access method takes a 'size' parameter that indicates the
- number of bytes being requested. The function should return the number
- of bytes that can be contiguously accessed at that offset. It may also
- return a negative errno if an error occurs.
- In order to support this method, the storage must be byte-accessible by
- the CPU at all times. If your device uses paging techniques to expose
- a large amount of memory through a smaller window, then you cannot
- implement direct_access. Equally, if your device can occasionally
- stall the CPU for an extended period, you should also not attempt to
- implement direct_access.
- These block devices may be used for inspiration:
- - axonram: Axon DDR2 device driver
- - brd: RAM backed block device driver
- - dcssblk: s390 dcss block device driver
- Implementation Tips for Filesystem Writers
- ------------------------------------------
- Filesystem support consists of
- - adding support to mark inodes as being DAX by setting the S_DAX flag in
- i_flags
- - implementing the direct_IO address space operation, and calling
- dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
- - implementing an mmap file operation for DAX files which sets the
- VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
- include handlers for fault, pmd_fault and page_mkwrite (which should
- probably call dax_fault(), dax_pmd_fault() and dax_mkwrite(), passing the
- appropriate get_block() callback)
- - calling dax_truncate_page() instead of block_truncate_page() for DAX files
- - calling dax_zero_page_range() instead of zero_user() for DAX files
- - ensuring that there is sufficient locking between reads, writes,
- truncates and page faults
- The get_block() callback passed to the DAX functions may return
- uninitialised extents. If it does, it must ensure that simultaneous
- calls to get_block() (for example by a page-fault racing with a read()
- or a write()) work correctly.
- These filesystems may be used for inspiration:
- - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
- - ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
- Shortcomings
- ------------
- Even if the kernel or its modules are stored on a filesystem that supports
- DAX on a block device that supports DAX, they will still be copied into RAM.
- The DAX code does not work correctly on architectures which have virtually
- mapped caches such as ARM, MIPS and SPARC.
- Calling get_user_pages() on a range of user memory that has been mmaped
- from a DAX file will fail as there are no 'struct page' to describe
- those pages. This problem is being worked on. That means that O_DIRECT
- reads/writes to those memory ranges from a non-DAX file will fail (note
- that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
- that is being accessed that is key here). Other things that will not
- work include RDMA, sendfile() and splice().
|