botching-up-ioctls.txt 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219
  1. (How to avoid) Botching up ioctls
  2. =================================
  3. From: http://blog.ffwll.ch/2013/11/botching-up-ioctls.html
  4. By: Daniel Vetter, Copyright © 2013 Intel Corporation
  5. One clear insight kernel graphics hackers gained in the past few years is that
  6. trying to come up with a unified interface to manage the execution units and
  7. memory on completely different GPUs is a futile effort. So nowadays every
  8. driver has its own set of ioctls to allocate memory and submit work to the GPU.
  9. Which is nice, since there's no more insanity in the form of fake-generic, but
  10. actually only used once interfaces. But the clear downside is that there's much
  11. more potential to screw things up.
  12. To avoid repeating all the same mistakes again I've written up some of the
  13. lessons learned while botching the job for the drm/i915 driver. Most of these
  14. only cover technicalities and not the big-picture issues like what the command
  15. submission ioctl exactly should look like. Learning these lessons is probably
  16. something every GPU driver has to do on its own.
  17. Prerequisites
  18. -------------
  19. First the prerequisites. Without these you have already failed, because you
  20. will need to add a a 32-bit compat layer:
  21. * Only use fixed sized integers. To avoid conflicts with typedefs in userspace
  22. the kernel has special types like __u32, __s64. Use them.
  23. * Align everything to the natural size and use explicit padding. 32-bit
  24. platforms don't necessarily align 64-bit values to 64-bit boundaries, but
  25. 64-bit platforms do. So we always need padding to the natural size to get
  26. this right.
  27. * Pad the entire struct to a multiple of 64-bits - the structure size will
  28. otherwise differ on 32-bit versus 64-bit. Having a different structure size
  29. hurts when passing arrays of structures to the kernel, or if the kernel
  30. checks the structure size, which e.g. the drm core does.
  31. * Pointers are __u64, cast from/to a uintprt_t on the userspace side and
  32. from/to a void __user * in the kernel. Try really hard not to delay this
  33. conversion or worse, fiddle the raw __u64 through your code since that
  34. diminishes the checking tools like sparse can provide.
  35. Basics
  36. ------
  37. With the joys of writing a compat layer avoided we can take a look at the basic
  38. fumbles. Neglecting these will make backward and forward compatibility a real
  39. pain. And since getting things wrong on the first attempt is guaranteed you
  40. will have a second iteration or at least an extension for any given interface.
  41. * Have a clear way for userspace to figure out whether your new ioctl or ioctl
  42. extension is supported on a given kernel. If you can't rely on old kernels
  43. rejecting the new flags/modes or ioctls (since doing that was botched in the
  44. past) then you need a driver feature flag or revision number somewhere.
  45. * Have a plan for extending ioctls with new flags or new fields at the end of
  46. the structure. The drm core checks the passed-in size for each ioctl call
  47. and zero-extends any mismatches between kernel and userspace. That helps,
  48. but isn't a complete solution since newer userspace on older kernels won't
  49. notice that the newly added fields at the end get ignored. So this still
  50. needs a new driver feature flags.
  51. * Check all unused fields and flags and all the padding for whether it's 0,
  52. and reject the ioctl if that's not the case. Otherwise your nice plan for
  53. future extensions is going right down the gutters since someone will submit
  54. an ioctl struct with random stack garbage in the yet unused parts. Which
  55. then bakes in the ABI that those fields can never be used for anything else
  56. but garbage.
  57. * Have simple testcases for all of the above.
  58. Fun with Error Paths
  59. --------------------
  60. Nowadays we don't have any excuse left any more for drm drivers being neat
  61. little root exploits. This means we both need full input validation and solid
  62. error handling paths - GPUs will die eventually in the oddmost corner cases
  63. anyway:
  64. * The ioctl must check for array overflows. Also it needs to check for
  65. over/underflows and clamping issues of integer values in general. The usual
  66. example is sprite positioning values fed directly into the hardware with the
  67. hardware just having 12 bits or so. Works nicely until some odd display
  68. server doesn't bother with clamping itself and the cursor wraps around the
  69. screen.
  70. * Have simple testcases for every input validation failure case in your ioctl.
  71. Check that the error code matches your expectations. And finally make sure
  72. that you only test for one single error path in each subtest by submitting
  73. otherwise perfectly valid data. Without this an earlier check might reject
  74. the ioctl already and shadow the codepath you actually want to test, hiding
  75. bugs and regressions.
  76. * Make all your ioctls restartable. First X really loves signals and second
  77. this will allow you to test 90% of all error handling paths by just
  78. interrupting your main test suite constantly with signals. Thanks to X's
  79. love for signal you'll get an excellent base coverage of all your error
  80. paths pretty much for free for graphics drivers. Also, be consistent with
  81. how you handle ioctl restarting - e.g. drm has a tiny drmIoctl helper in its
  82. userspace library. The i915 driver botched this with the set_tiling ioctl,
  83. now we're stuck forever with some arcane semantics in both the kernel and
  84. userspace.
  85. * If you can't make a given codepath restartable make a stuck task at least
  86. killable. GPUs just die and your users won't like you more if you hang their
  87. entire box (by means of an unkillable X process). If the state recovery is
  88. still too tricky have a timeout or hangcheck safety net as a last-ditch
  89. effort in case the hardware has gone bananas.
  90. * Have testcases for the really tricky corner cases in your error recovery code
  91. - it's way too easy to create a deadlock between your hangcheck code and
  92. waiters.
  93. Time, Waiting and Missing it
  94. ----------------------------
  95. GPUs do most everything asynchronously, so we have a need to time operations and
  96. wait for oustanding ones. This is really tricky business; at the moment none of
  97. the ioctls supported by the drm/i915 get this fully right, which means there's
  98. still tons more lessons to learn here.
  99. * Use CLOCK_MONOTONIC as your reference time, always. It's what alsa, drm and
  100. v4l use by default nowadays. But let userspace know which timestamps are
  101. derived from different clock domains like your main system clock (provided
  102. by the kernel) or some independent hardware counter somewhere else. Clocks
  103. will mismatch if you look close enough, but if performance measuring tools
  104. have this information they can at least compensate. If your userspace can
  105. get at the raw values of some clocks (e.g. through in-command-stream
  106. performance counter sampling instructions) consider exposing those also.
  107. * Use __s64 seconds plus __u64 nanoseconds to specify time. It's not the most
  108. convenient time specification, but it's mostly the standard.
  109. * Check that input time values are normalized and reject them if not. Note
  110. that the kernel native struct ktime has a signed integer for both seconds
  111. and nanoseconds, so beware here.
  112. * For timeouts, use absolute times. If you're a good fellow and made your
  113. ioctl restartable relative timeouts tend to be too coarse and can
  114. indefinitely extend your wait time due to rounding on each restart.
  115. Especially if your reference clock is something really slow like the display
  116. frame counter. With a spec laywer hat on this isn't a bug since timeouts can
  117. always be extended - but users will surely hate you if their neat animations
  118. starts to stutter due to this.
  119. * Consider ditching any synchronous wait ioctls with timeouts and just deliver
  120. an asynchronous event on a pollable file descriptor. It fits much better
  121. into event driven applications' main loop.
  122. * Have testcases for corner-cases, especially whether the return values for
  123. already-completed events, successful waits and timed-out waits are all sane
  124. and suiting to your needs.
  125. Leaking Resources, Not
  126. ----------------------
  127. A full-blown drm driver essentially implements a little OS, but specialized to
  128. the given GPU platforms. This means a driver needs to expose tons of handles
  129. for different objects and other resources to userspace. Doing that right
  130. entails its own little set of pitfalls:
  131. * Always attach the lifetime of your dynamically created resources to the
  132. lifetime of a file descriptor. Consider using a 1:1 mapping if your resource
  133. needs to be shared across processes - fd-passing over unix domain sockets
  134. also simplifies lifetime management for userspace.
  135. * Always have O_CLOEXEC support.
  136. * Ensure that you have sufficient insulation between different clients. By
  137. default pick a private per-fd namespace which forces any sharing to be done
  138. explictly. Only go with a more global per-device namespace if the objects
  139. are truly device-unique. One counterexample in the drm modeset interfaces is
  140. that the per-device modeset objects like connectors share a namespace with
  141. framebuffer objects, which mostly are not shared at all. A separate
  142. namespace, private by default, for framebuffers would have been more
  143. suitable.
  144. * Think about uniqueness requirements for userspace handles. E.g. for most drm
  145. drivers it's a userspace bug to submit the same object twice in the same
  146. command submission ioctl. But then if objects are shareable userspace needs
  147. to know whether it has seen an imported object from a different process
  148. already or not. I haven't tried this myself yet due to lack of a new class
  149. of objects, but consider using inode numbers on your shared file descriptors
  150. as unique identifiers - it's how real files are told apart, too.
  151. Unfortunately this requires a full-blown virtual filesystem in the kernel.
  152. Last, but not Least
  153. -------------------
  154. Not every problem needs a new ioctl:
  155. * Think hard whether you really want a driver-private interface. Of course
  156. it's much quicker to push a driver-private interface than engaging in
  157. lengthy discussions for a more generic solution. And occasionally doing a
  158. private interface to spearhead a new concept is what's required. But in the
  159. end, once the generic interface comes around you'll end up maintainer two
  160. interfaces. Indefinitely.
  161. * Consider other interfaces than ioctls. A sysfs attribute is much better for
  162. per-device settings, or for child objects with fairly static lifetimes (like
  163. output connectors in drm with all the detection override attributes). Or
  164. maybe only your testsuite needs this interface, and then debugfs with its
  165. disclaimer of not having a stable ABI would be better.
  166. Finally, the name of the game is to get it right on the first attempt, since if
  167. your driver proves popular and your hardware platforms long-lived then you'll
  168. be stuck with a given ioctl essentially forever. You can try to deprecate
  169. horrible ioctls on newer iterations of your hardware, but generally it takes
  170. years to accomplish this. And then again years until the last user able to
  171. complain about regressions disappears, too.