seccomp_filter.txt 9.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225
  1. SECure COMPuting with filters
  2. =============================
  3. Introduction
  4. ------------
  5. A large number of system calls are exposed to every userland process
  6. with many of them going unused for the entire lifetime of the process.
  7. As system calls change and mature, bugs are found and eradicated. A
  8. certain subset of userland applications benefit by having a reduced set
  9. of available system calls. The resulting set reduces the total kernel
  10. surface exposed to the application. System call filtering is meant for
  11. use with those applications.
  12. Seccomp filtering provides a means for a process to specify a filter for
  13. incoming system calls. The filter is expressed as a Berkeley Packet
  14. Filter (BPF) program, as with socket filters, except that the data
  15. operated on is related to the system call being made: system call
  16. number and the system call arguments. This allows for expressive
  17. filtering of system calls using a filter program language with a long
  18. history of being exposed to userland and a straightforward data set.
  19. Additionally, BPF makes it impossible for users of seccomp to fall prey
  20. to time-of-check-time-of-use (TOCTOU) attacks that are common in system
  21. call interposition frameworks. BPF programs may not dereference
  22. pointers which constrains all filters to solely evaluating the system
  23. call arguments directly.
  24. What it isn't
  25. -------------
  26. System call filtering isn't a sandbox. It provides a clearly defined
  27. mechanism for minimizing the exposed kernel surface. It is meant to be
  28. a tool for sandbox developers to use. Beyond that, policy for logical
  29. behavior and information flow should be managed with a combination of
  30. other system hardening techniques and, potentially, an LSM of your
  31. choosing. Expressive, dynamic filters provide further options down this
  32. path (avoiding pathological sizes or selecting which of the multiplexed
  33. system calls in socketcall() is allowed, for instance) which could be
  34. construed, incorrectly, as a more complete sandboxing solution.
  35. Usage
  36. -----
  37. An additional seccomp mode is added and is enabled using the same
  38. prctl(2) call as the strict seccomp. If the architecture has
  39. CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
  40. PR_SET_SECCOMP:
  41. Now takes an additional argument which specifies a new filter
  42. using a BPF program.
  43. The BPF program will be executed over struct seccomp_data
  44. reflecting the system call number, arguments, and other
  45. metadata. The BPF program must then return one of the
  46. acceptable values to inform the kernel which action should be
  47. taken.
  48. Usage:
  49. prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
  50. The 'prog' argument is a pointer to a struct sock_fprog which
  51. will contain the filter program. If the program is invalid, the
  52. call will return -1 and set errno to EINVAL.
  53. If fork/clone and execve are allowed by @prog, any child
  54. processes will be constrained to the same filters and system
  55. call ABI as the parent.
  56. Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
  57. run with CAP_SYS_ADMIN privileges in its namespace. If these are not
  58. true, -EACCES will be returned. This requirement ensures that filter
  59. programs cannot be applied to child processes with greater privileges
  60. than the task that installed them.
  61. Additionally, if prctl(2) is allowed by the attached filter,
  62. additional filters may be layered on which will increase evaluation
  63. time, but allow for further decreasing the attack surface during
  64. execution of a process.
  65. The above call returns 0 on success and non-zero on error.
  66. Return values
  67. -------------
  68. A seccomp filter may return any of the following values. If multiple
  69. filters exist, the return value for the evaluation of a given system
  70. call will always use the highest precedent value. (For example,
  71. SECCOMP_RET_KILL will always take precedence.)
  72. In precedence order, they are:
  73. SECCOMP_RET_KILL:
  74. Results in the task exiting immediately without executing the
  75. system call. The exit status of the task (status & 0x7f) will
  76. be SIGSYS, not SIGKILL.
  77. SECCOMP_RET_TRAP:
  78. Results in the kernel sending a SIGSYS signal to the triggering
  79. task without executing the system call. siginfo->si_call_addr
  80. will show the address of the system call instruction, and
  81. siginfo->si_syscall and siginfo->si_arch will indicate which
  82. syscall was attempted. The program counter will be as though
  83. the syscall happened (i.e. it will not point to the syscall
  84. instruction). The return value register will contain an arch-
  85. dependent value -- if resuming execution, set it to something
  86. sensible. (The architecture dependency is because replacing
  87. it with -ENOSYS could overwrite some useful information.)
  88. The SECCOMP_RET_DATA portion of the return value will be passed
  89. as si_errno.
  90. SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.
  91. SECCOMP_RET_ERRNO:
  92. Results in the lower 16-bits of the return value being passed
  93. to userland as the errno without executing the system call.
  94. SECCOMP_RET_TRACE:
  95. When returned, this value will cause the kernel to attempt to
  96. notify a ptrace()-based tracer prior to executing the system
  97. call. If there is no tracer present, -ENOSYS is returned to
  98. userland and the system call is not executed.
  99. A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
  100. using ptrace(PTRACE_SETOPTIONS). The tracer will be notified
  101. of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of
  102. the BPF program return value will be available to the tracer
  103. via PTRACE_GETEVENTMSG.
  104. The tracer can skip the system call by changing the syscall number
  105. to -1. Alternatively, the tracer can change the system call
  106. requested by changing the system call to a valid syscall number. If
  107. the tracer asks to skip the system call, then the system call will
  108. appear to return the value that the tracer puts in the return value
  109. register.
  110. The seccomp check will not be run again after the tracer is
  111. notified. (This means that seccomp-based sandboxes MUST NOT
  112. allow use of ptrace, even of other sandboxed processes, without
  113. extreme care; ptracers can use this mechanism to escape.)
  114. SECCOMP_RET_ALLOW:
  115. Results in the system call being executed.
  116. If multiple filters exist, the return value for the evaluation of a
  117. given system call will always use the highest precedent value.
  118. Precedence is only determined using the SECCOMP_RET_ACTION mask. When
  119. multiple filters return values of the same precedence, only the
  120. SECCOMP_RET_DATA from the most recently installed filter will be
  121. returned.
  122. Pitfalls
  123. --------
  124. The biggest pitfall to avoid during use is filtering on system call
  125. number without checking the architecture value. Why? On any
  126. architecture that supports multiple system call invocation conventions,
  127. the system call numbers may vary based on the specific invocation. If
  128. the numbers in the different calling conventions overlap, then checks in
  129. the filters may be abused. Always check the arch value!
  130. Example
  131. -------
  132. The samples/seccomp/ directory contains both an x86-specific example
  133. and a more generic example of a higher level macro interface for BPF
  134. program generation.
  135. Adding architecture support
  136. -----------------------
  137. See arch/Kconfig for the authoritative requirements. In general, if an
  138. architecture supports both ptrace_event and seccomp, it will be able to
  139. support seccomp filter with minor fixup: SIGSYS support and seccomp return
  140. value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
  141. to its arch-specific Kconfig.
  142. Caveats
  143. -------
  144. The vDSO can cause some system calls to run entirely in userspace,
  145. leading to surprises when you run programs on different machines that
  146. fall back to real syscalls. To minimize these surprises on x86, make
  147. sure you test with
  148. /sys/devices/system/clocksource/clocksource0/current_clocksource set to
  149. something like acpi_pm.
  150. On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
  151. legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
  152. - A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
  153. the vsyscall entry for the given call and not the address after the
  154. 'syscall' instruction. Any code which wants to restart the call
  155. should be aware that (a) a ret instruction has been emulated and (b)
  156. trying to resume the syscall will again trigger the standard vsyscall
  157. emulation security checks, making resuming the syscall mostly
  158. pointless.
  159. - A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
  160. but the syscall may not be changed to another system call using the
  161. orig_rax register. It may only be changed to -1 order to skip the
  162. currently emulated call. Any other change MAY terminate the process.
  163. The rip value seen by the tracer will be the syscall entry address;
  164. this is different from normal behavior. The tracer MUST NOT modify
  165. rip or rsp. (Do not rely on other changes terminating the process.
  166. They might work. For example, on some kernels, choosing a syscall
  167. that only exists in future kernels will be correctly emulated (by
  168. returning -ENOSYS).
  169. To detect this quirky behavior, check for addr & ~0x0C00 ==
  170. 0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
  171. SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
  172. condition: future kernels may improve vsyscall emulation and current
  173. kernels in vsyscall=native mode will behave differently, but the
  174. instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
  175. cases.
  176. Note that modern systems are unlikely to use vsyscalls at all -- they
  177. are a legacy feature and they are considerably slower than standard
  178. syscalls. New code will use the vDSO, and vDSO-issued system calls
  179. are indistinguishable from normal system calls.