basic-pm-debugging.txt 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229
  1. Debugging hibernation and suspend
  2. (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
  3. 1. Testing hibernation (aka suspend to disk or STD)
  4. To check if hibernation works, you can try to hibernate in the "reboot" mode:
  5. # echo reboot > /sys/power/disk
  6. # echo disk > /sys/power/state
  7. and the system should create a hibernation image, reboot, resume and get back to
  8. the command prompt where you have started the transition. If that happens,
  9. hibernation is most likely to work correctly. Still, you need to repeat the
  10. test at least a couple of times in a row for confidence. [This is necessary,
  11. because some problems only show up on a second attempt at suspending and
  12. resuming the system.] Moreover, hibernating in the "reboot" and "shutdown"
  13. modes causes the PM core to skip some platform-related callbacks which on ACPI
  14. systems might be necessary to make hibernation work. Thus, if your machine fails
  15. to hibernate or resume in the "reboot" mode, you should try the "platform" mode:
  16. # echo platform > /sys/power/disk
  17. # echo disk > /sys/power/state
  18. which is the default and recommended mode of hibernation.
  19. Unfortunately, the "platform" mode of hibernation does not work on some systems
  20. with broken BIOSes. In such cases the "shutdown" mode of hibernation might
  21. work:
  22. # echo shutdown > /sys/power/disk
  23. # echo disk > /sys/power/state
  24. (it is similar to the "reboot" mode, but it requires you to press the power
  25. button to make the system resume).
  26. If neither "platform" nor "shutdown" hibernation mode works, you will need to
  27. identify what goes wrong.
  28. a) Test modes of hibernation
  29. To find out why hibernation fails on your system, you can use a special testing
  30. facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then,
  31. there is the file /sys/power/pm_test that can be used to make the hibernation
  32. core run in a test mode. There are 5 test modes available:
  33. freezer
  34. - test the freezing of processes
  35. devices
  36. - test the freezing of processes and suspending of devices
  37. platform
  38. - test the freezing of processes, suspending of devices and platform
  39. global control methods(*)
  40. processors
  41. - test the freezing of processes, suspending of devices, platform
  42. global control methods(*) and the disabling of nonboot CPUs
  43. core
  44. - test the freezing of processes, suspending of devices, platform global
  45. control methods(*), the disabling of nonboot CPUs and suspending of
  46. platform/system devices
  47. (*) the platform global control methods are only available on ACPI systems
  48. and are only tested if the hibernation mode is set to "platform"
  49. To use one of them it is necessary to write the corresponding string to
  50. /sys/power/pm_test (eg. "devices" to test the freezing of processes and
  51. suspending devices) and issue the standard hibernation commands. For example,
  52. to use the "devices" test mode along with the "platform" mode of hibernation,
  53. you should do the following:
  54. # echo devices > /sys/power/pm_test
  55. # echo platform > /sys/power/disk
  56. # echo disk > /sys/power/state
  57. Then, the kernel will try to freeze processes, suspend devices, wait a few
  58. seconds (5 by default, but configurable by the suspend.pm_test_delay module
  59. parameter), resume devices and thaw processes. If "platform" is written to
  60. /sys/power/pm_test , then after suspending devices the kernel will additionally
  61. invoke the global control methods (eg. ACPI global control methods) used to
  62. prepare the platform firmware for hibernation. Next, it will wait a
  63. configurable number of seconds and invoke the platform (eg. ACPI) global
  64. methods used to cancel hibernation etc.
  65. Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal
  66. hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test
  67. contains a space-separated list of all available tests (including "none" that
  68. represents the normal functionality) in which the current test level is
  69. indicated by square brackets.
  70. Generally, as you can see, each test level is more "invasive" than the previous
  71. one and the "core" level tests the hardware and drivers as deeply as possible
  72. without creating a hibernation image. Obviously, if the "devices" test fails,
  73. the "platform" test will fail as well and so on. Thus, as a rule of thumb, you
  74. should try the test modes starting from "freezer", through "devices", "platform"
  75. and "processors" up to "core" (repeat the test on each level a couple of times
  76. to make sure that any random factors are avoided).
  77. If the "freezer" test fails, there is a task that cannot be frozen (in that case
  78. it usually is possible to identify the offending task by analysing the output of
  79. dmesg obtained after the failing test). Failure at this level usually means
  80. that there is a problem with the tasks freezer subsystem that should be
  81. reported.
  82. If the "devices" test fails, most likely there is a driver that cannot suspend
  83. or resume its device (in the latter case the system may hang or become unstable
  84. after the test, so please take that into consideration). To find this driver,
  85. you can carry out a binary search according to the rules:
  86. - if the test fails, unload a half of the drivers currently loaded and repeat
  87. (that would probably involve rebooting the system, so always note what drivers
  88. have been loaded before the test),
  89. - if the test succeeds, load a half of the drivers you have unloaded most
  90. recently and repeat.
  91. Once you have found the failing driver (there can be more than just one of
  92. them), you have to unload it every time before hibernation. In that case please
  93. make sure to report the problem with the driver.
  94. It is also possible that the "devices" test will still fail after you have
  95. unloaded all modules. In that case, you may want to look in your kernel
  96. configuration for the drivers that can be compiled as modules (and test again
  97. with these drivers compiled as modules). You may also try to use some special
  98. kernel command line options such as "noapic", "noacpi" or even "acpi=off".
  99. If the "platform" test fails, there is a problem with the handling of the
  100. platform (eg. ACPI) firmware on your system. In that case the "platform" mode
  101. of hibernation is not likely to work. You can try the "shutdown" mode, but that
  102. is rather a poor man's workaround.
  103. If the "processors" test fails, the disabling/enabling of nonboot CPUs does not
  104. work (of course, this only may be an issue on SMP systems) and the problem
  105. should be reported. In that case you can also try to switch the nonboot CPUs
  106. off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and
  107. see if that works.
  108. If the "core" test fails, which means that suspending of the system/platform
  109. devices has failed (these devices are suspended on one CPU with interrupts off),
  110. the problem is most probably hardware-related and serious, so it should be
  111. reported.
  112. A failure of any of the "platform", "processors" or "core" tests may cause your
  113. system to hang or become unstable, so please beware. Such a failure usually
  114. indicates a serious problem that very well may be related to the hardware, but
  115. please report it anyway.
  116. b) Testing minimal configuration
  117. If all of the hibernation test modes work, you can boot the system with the
  118. "init=/bin/bash" command line parameter and attempt to hibernate in the
  119. "reboot", "shutdown" and "platform" modes. If that does not work, there
  120. probably is a problem with a driver statically compiled into the kernel and you
  121. can try to compile more drivers as modules, so that they can be tested
  122. individually. Otherwise, there is a problem with a modular driver and you can
  123. find it by loading a half of the modules you normally use and binary searching
  124. in accordance with the algorithm:
  125. - if there are n modules loaded and the attempt to suspend and resume fails,
  126. unload n/2 of the modules and try again (that would probably involve rebooting
  127. the system),
  128. - if there are n modules loaded and the attempt to suspend and resume succeeds,
  129. load n/2 modules more and try again.
  130. Again, if you find the offending module(s), it(they) must be unloaded every time
  131. before hibernation, and please report the problem with it(them).
  132. c) Advanced debugging
  133. In case that hibernation does not work on your system even in the minimal
  134. configuration and compiling more drivers as modules is not practical or some
  135. modules cannot be unloaded, you can use one of the more advanced debugging
  136. techniques to find the problem. First, if there is a serial port in your box,
  137. you can boot the kernel with the 'no_console_suspend' parameter and try to log
  138. kernel messages using the serial console. This may provide you with some
  139. information about the reasons of the suspend (resume) failure. Alternatively,
  140. it may be possible to use a FireWire port for debugging with firescope
  141. (http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to
  142. use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt .
  143. 2. Testing suspend to RAM (STR)
  144. To verify that the STR works, it is generally more convenient to use the s2ram
  145. tool available from http://suspend.sf.net and documented at
  146. http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK).
  147. Namely, after writing "freezer", "devices", "platform", "processors", or "core"
  148. into /sys/power/pm_test (available if the kernel is compiled with
  149. CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding
  150. to given string. The STR test modes are defined in the same way as for
  151. hibernation, so please refer to Section 1 for more information about them. In
  152. particular, the "core" test allows you to test everything except for the actual
  153. invocation of the platform firmware in order to put the system into the sleep
  154. state.
  155. Among other things, the testing with the help of /sys/power/pm_test may allow
  156. you to identify drivers that fail to suspend or resume their devices. They
  157. should be unloaded every time before an STR transition.
  158. Next, you can follow the instructions at S2RAM_LINK to test the system, but if
  159. it does not work "out of the box", you may need to boot it with
  160. "init=/bin/bash" and test s2ram in the minimal configuration. In that case,
  161. you may be able to search for failing drivers by following the procedure
  162. analogous to the one described in section 1. If you find some failing drivers,
  163. you will have to unload them every time before an STR transition (ie. before
  164. you run s2ram), and please report the problems with them.
  165. There is a debugfs entry which shows the suspend to RAM statistics. Here is an
  166. example of its output.
  167. # mount -t debugfs none /sys/kernel/debug
  168. # cat /sys/kernel/debug/suspend_stats
  169. success: 20
  170. fail: 5
  171. failed_freeze: 0
  172. failed_prepare: 0
  173. failed_suspend: 5
  174. failed_suspend_noirq: 0
  175. failed_resume: 0
  176. failed_resume_noirq: 0
  177. failures:
  178. last_failed_dev: alarm
  179. adc
  180. last_failed_errno: -16
  181. -16
  182. last_failed_step: suspend
  183. suspend
  184. Field success means the success number of suspend to RAM, and field fail means
  185. the failure number. Others are the failure number of different steps of suspend
  186. to RAM. suspend_stats just lists the last 2 failed devices, error number and
  187. failed step of suspend.