xnu: For some signals, it’s not possible to distinguish between hardware faults and asynchronous software triggers (kill() and raise()), frustrating user-space fault handling and proper re-raise

Originator:mark
Number:rdar://FB8707529 Date Originated:2020-09-17
Status:Open Resolved:
Product:macOS Product Version:
Classification:Incorrect/Unexpected Behavior Reproducible:
 
SIGBUS, SIGFPE, SIGILL, and SIGSEGV are distinguished from other signals in that they arise synchronously as a result of sustaining a hardware trap. In xnu, the hardware traps give rise to Mach exceptions which, if handled by the default in-kernel handler, are delivered to the faulting process as a POSIX signal (10.15.6 xnu-6153.141.1/bsd/uxkern/ux_exception.c handle_ux_exception()). A signal handler responding to one of these signals has the opportunity to handle the fault. If unhandled, the signal will recur when the signal handler returns. Should the signal handler uninstall itself (restoring SIG_DFL) and return, the recurring signal will be treated according to the system’s default disposition, which for these signals is termination with an optional core dump (SA_KILL | SA_CORE, 10.15.6 xnu-6153.141.1/bsd/sys/signalvar.h sigprop[]).

Because they are part of the signal interface, it is also possible for these four signals, normally delivered synchronously as described above, to occur asynchronously, being sent from software by the kill() system call, the raise() library function, or the kill command. When triggered in this way, these signals will be delivered to a signal handler, but they will not recur autonomously when the signal handler returns, because no hardware fault exists that would cause them to re-raise. As such, a signal handler that wants to perform its own action, uninstall itself, and have the default termination action occur, will need to explicitly re-raise the signal from within the signal handler.

Signal handlers need to be able to distinguish between these hardware-based signals occurring naturally and synchronously from hardware traps and those generated asynchronously from software. Only true hardware traps should be the target of user-space fault handling from the signal handler. Autonomous re-raise is also available for hardware trap-based signals, but not for others.

A re-raised signal with the default disposition guarantees the desired termination, but if applied to hardware faults, results information loss, as the underlying cause of the hardware fault is no longer available via the EXC_CORPSE_NOTIFY and EXC_CRASH interfaces used by crash reporters such as Apple’s own ReportCrash and Chrome’s Crashpad. This is undesirable as it obscures the true cause of the crash. A well-designed signal handler will observe the attributes of a signal to determine whether it will re-raise autonomously or if an explicit software re-raise is necessary. Therefore, for this purpose as well, it is important for the signal handler to be able to distinguish between signals arising from hardware faults that will re-raise autonomously, and those arising asynchronously from software that will not and that must be re-raised manually.

Ordinarily, it’s possible to use the siginfo::si_code field to distinguish between signals that arose as hardware traps (and thus will be re-raised autonomously) and those that arose asynchronously from software (and will not). A well-behaved signal handler would consult this field to determine whether software re-raise is necessary and appropriate. In xnu, this does not function as intended: on x86_64, SIGBUS always appears to a signal handler as a hardware trap even when originating asynchronously in software. On x86_64, SIGSEGV arising out of a hardware #GP trap (general protection fault) will appear to the signal handler as though it was generated asynchronously in software. Regardless of the actual source of the signal, SIGTRAP on x86_64 always appears as if it originated synchronously in hardware

Quoting POSIX.1-2008, 2018 Edition, XRAT “Rationale”/B.2.4 “Signal Concepts”/“Signal Actions”:

> Historically, an si_code value of less than or equal to zero indicated that
> the signal was generated by a process via the kill() function, and values of
> si_code that provided additional information for implementation-generated
> signals, such as SIGFPE or SIGSEGV, were all positive.

For SIGBUS on all architectures, si_code is always nonzero, even when generated by kill().

> XSI applications should check whether si_code is SI_USER or SI_QUEUE in
> addition to checking whether it is less than or equal to zero.

Note that xnu currently defines but never uses the SI_* constants such as SI_USER and SI_QUEUE. I don’t consider it important for xnu to adopt SI_USER, provided that it correctly implements what POSIX calls the “historical” behavior of setting si_code to 0 for signals generated asynchronously in software and to a positive value for signals generated from hardware traps.

I’m attaching a test program, t_signals.cc, that demonstrates the problem. Compilation and usage:

mark@sweet16 zsh% clang++ -std=c++17 -g -Wall -Werror t_signals.cc -o t_signals
mark@sweet16 zsh% ./t_signals
usage: t_signals <handle> <trigger> <signal>
  handle: catch,
          nocatch
  trigger: cause,
           cause/nocatch,
           raise
  signal: abrt,
          alrm,
          bus,
          fpe,
          ill,
          pipe,
          segv,
          segv/gp,
          sys,
          trap

“handle” determines whether the program installs its own signal handlers in-process or simply leaves all signals at their default (SIG_DFL) dispositions. “trigger” determines whether the signal arises naturally due to some fault (“cause”), or via the raise() library function (which internally uses the kill() system call). “signal” specifies the signal to cause or raise. A bit of explanation on “segv/gp”: with trigger set to “cause”, it causes SIGSEGV to be raised by triggering a #GP trap (general protection fault), to exhibit the subtly different handling afforded to such traps compared to other causes of SIGSEGV. “segv” without the “/gp” achieves the signal in “cause” mode via an ordinary null pointer dereference.

For the signals that can arise from hardware traps, SIGBUS, SIGFPE, SIGILL, SIGSEGV, and SIGTRAP, the program arranges to generate a true trap, and xnu’s full fault handling path comes into play in order to deliver a signal to the process. (A note on SIGTRAP: although originating from a hardware fault, it’s intended for use as a software trap, and the program counter advances to the next instruction before the fault occurs, so it will not re-raise autonomously, in contrast to the other, harder, hardware faults.)

The expected behavior is for a signal handler to execute in the program, and for the program to subsequently exit with the proper status reported via waitpid. For example:

mark@sweet16 zsh% PS1='%n@%m %? zsh%# '
mark@sweet16 0 zsh% ./t_signals catch cause bus
expect_signal = 10
signal = 10 (Bus error: 10)
code = 2 (genuine)
pid = 0, uid = 0
reraise_autonomously = 1
zsh: bus error  ./t_signals catch cause bus
mark@sweet16 138 zsh%

This is fine. The signal handler detected that the signal originated from a hardware trap (code = 2 (genuine)), would re-raise autonomously (reraise_autonomously = 1), and the signal did in fact recur on its own. SIGBUS was visible to the parent, zsh, via waitpid. I have placed %? in PS1 so that the exit status is visible, here, 138. (138 & 0x7f = 10 = SIGBUS). However, a SIGBUS generated via the kill() system call is not handled correctly:

mark@sweet16 0 zsh% ./t_signals catch raise bus
expect_signal = 10
signal = 10 (Bus error: 10)
code = 2 (genuine)
pid = 0, uid = 0
reraise_autonomously = 1
t_signals: unexpected successful exit
mark@sweet16 0 zsh%

In this case, raise_autonomously was incorrectly detected as 1 because xnu set sigaction::si_code to a positive value, BUS_ADRERR. This happens unconditionally when delivering SIGBUS in 10.15.6 xnu-6153.141.1/bsd/dev/i386/unix_signal.c sendsig(). As a result, the signal handler did not attempt to re-raise the signal manually, and the signal, having originated asynchronously in software, did not recur following the signal handler’s return. The program proceeded to complete executing in main(), exiting with status 0.

A related problem occurs on x86_64 for SIGSEGV when originating from a #GP trap (general protection fault). In 10.15.6 xnu-6153.141.1/bsd/dev/i386/unix_signal.c sendsig, handling EXC_I386_GPFLT, si_code is cleared, making the signal appear asynchronous, when it is obviously not. A new si_code value in the SEGV_ namespace should be introduced to properly convey that such signals originate asynchronously in software. This can be diagnosed by running t_signals on x86_64:

mark@sweet16 0 zsh% ./t_signals catch cause segv/gp
expect_signal = 11
signal = 11 (Segmentation fault: 11)
code = 0 (software kill())
pid = 0, uid = 0
reraise_autonomously = 0
zsh: segmentation fault  ./t_signals catch cause segv/gp
mark@sweet16 139 zsh%

This was detected as a software kill(). Although the signal was re-raised, the re-raise was manual and not autonomous, which has resulted in information loss to the crash reporter. Examining the report produced by ReportCrash in ~/Library/Logs/DiagnosticReports for this execution compared to a cause/nocatch version (no in-process signal handler, therefore no manual re-raise, just ordinary SIG_DFL handling the first time the signal appears) reveals differences.

mark@sweet16 0 zsh% ./t_signals nocatch cause segv/gp
zsh: segmentation fault  ./t_signals nocatch cause segv/gp
mark@sweet16 139 zsh% diff -u ~/Library/Logs/DiagnosticReports/t_signals_2020-09-17-1832{00,59}_sweet16.crash
[…filtered to present only salient differences…]
+Termination Signal:    Segmentation fault: 11
+Termination Reason:    Namespace SIGNAL, Code 0xb
+Terminating Process:   exc handler [35165]
[…]
-Error Code:      0x020000b8
-Trap Number:     133
+Error Code:      0x00000000
+Trap Number:     13
[…]

Because the SIGSEGV was incorrectly detected as originating asynchronously from software, (1) the signal handler, if being used as a user-space fault handler, would not have attempted to recover from the fault condition, even though one existed, and (2) it performed manual re-raise, even though the signal would have re-raised autonomously. As a result of the manual re-raise, ReportCrash did not output the “Termination” section further identifying the cause of the crash, and the trap number is presented as 133 (T_SYSCALL) instead of 13 (T_GENERAL_PROTECTION).

SIGTRAP handling is also incorrect on x86_64. Recalling that SIGTRAP never recurs autonomously, the “code =” (from siginfo::si_code) printed by the test program’s signal handler must be inspected. On x86_64, si_code always appears as 1 (TRAP_BKPT) indicating that a #BP (breakpoint) trap was sustained, such as via the int3 instruction, even when the SIGTRAP was raised asynchronously in software.

Finally, it’s not possible to distinguish between genuine SIGPIPE and SIGSYS originating in the kernel and those sent from user space by kill(). In all cases, si_code is reported as 0. SIGALRM shares the same problem, although it is presently possible to distinguish genuine SIGALRM as they set si_pid and si_uid.

Resolution:

Improvements must be made to the sendsig() function in 10.15.6 xnu-6153.141.1/bsd/dev/i386/unix_signal.c (for x86_64).

For SIGBUS and SIGTRAP, sendsig() should consider ut->uu_code when setting sinfo64.si_code, setting it to 0 when signals did not arise from a hardware trap. For SIGSEGV, when ut->uu_code is EXC_I386_GPFLT, it should set sinfo64.si_code to a new nonzero positive constant in the SEGV_* domain, or make appropriate reuse of existing SEGV_* constants. In the SIGSEGV ut->uu_code “default” case, FPE_NOOP should be replaced with a literal 0 for clarity. FPE_NOOP, while having value 0, incorrectly conveys that this condition has something to do with SIGFPE, when it does not. (SEGV_NOOP is another alternative.)

Similar handling should be applied to the software signals SIGALRM, SIGPIPE, and SIGSYS, setting sinfo.si_code to a positive integer when raised for genuine intended purposes by the kernel, and to 0 when arising from a user space kill().

Optionally, all cases that set si_code to 0 to indicate an asynchronous software source could be changed to use SI_USER per POSIX. However, maintaining the traditional value of 0 is not problematic.

Also optionally, the si_pid and si_uid fields can be populated unconditionally for signals raised asynchronously from software, such as by the kill() system call. This would aid in tracing the origin of such signals. Presently, si_pid and si_uid are inconsistently set.

Additional notes:

I tested on macOS 11.0db6 20A5364e on x86_64, and on macOS 10.15.6 19G2021 on x86_64.

The test program is based on library and test code at https://chromium.googlesource.com/crashpad/crashpad/+/refs/heads/master/util/posix/signals.cc and https://chromium.googlesource.com/crashpad/crashpad/+/refs/heads/master/util/posix/signals_test.cc.

Comments

base64(gzip(t_signals.cc))

H4sIAP4UZF8CA6UZW24bOfJfp2AUONty9HKSyWbs2APFVhIBduyR7EmCbNCguimJUL+myZbtxAbmGgvs AnuWPcqcZKtI9oOyZCuz/kjUxWJVsapYL3Y6xAtoNH36lLSE9Pe9p093/k5aU9L6SIMA/mVpGqdEuoJP IxqItueRVlx+12qPeeQFmc/Ia0Btzw5sQBTboICPpyxahoVcChsWUm/WwX9ssOa6BJM+j++AgNEyLOXR dAl2LTphSKO7UCGpvAuVPGR3oZeUL+FmEQcREFYB0mAap1zOwiqmF1K5Qhl3BUdSPpvwiJH3vQ9Hx323 P/hwPnSuGsT5Tv5RI8RnXiCvE4YgxiOZupcpTRKWuikTWSD3NFZMNDpZiUT2CezXqLfkcsYDRpzViPuk tUOePCHKyvipBDJ71wlw26jVIhoykVCPke+1Guhpd1cfkWxP3ZDt1WArYVcJ83Iv26vVxnEckI88CEYK MmQp5YL1MhlHcRhnIrh2vDgSksAOHk1iV5Jt87MBbAjpdIjeKkjKPMYXzCexnDHw7RmNyIJTMqOpf0lT RiYUpBVNIjJvRqgAjFgwojj6mhQV15E3S4E5sla75yCc0yA08jWm02iqD72ZplzgES0+EtQjNMGcFZz9 OYkjcvXqJXG02nAbj8ho8O582Dsj40wS6i9o5OECnAAWQX+ZJ3kcaWJJDGRYCgJcglAkiiWcOfa8LEXS sIXDj8sIzMsiWJFZGiGtSRqHiqDWuqYFyvEDlrZr8Kk1jOYBDPAUo+DWgeDKUvGeheTF4L42FoL2apry aQSaA3YiZyiIB5YYMwIRAuwrwUJKpCW7KK0iZspaStPGJBVfaCuQhp8XByKXoOAYuS4xWDY94ROlB3UA DgxztQouwXFIlIVjcBzUK1xfb4bKi+FixhNzntHA3daKoBFENjJijPz5x79GFb32lL3En3/8+7//ySIf yH0a9s4Ra0hxhQYM1jpv2s/aL8q95DCOPJZI3IdOcXY6Gnxq77Sedbuvmprws+7OK9L3ORJpgo6mcDQU 0Gi/rRQyg/gEwcgDimM2owsOMd6PmVBnyuAE6hzm2liHUdrP7VXqcelCiFl8SbIE/E+CW2tFxhNNr9sG 04PivNNRU6l5mQGYQUc6Xzl7xBagHRDKzy+UEaxyiAUNMqX/Lpw4pHAjtNJ3uu2dn9ovNf5VlLVe7vz0 vL3zYqe90xkLvzNnaaT+Qf9teyTRJ3PV/cFMAzv1BSGO8vl9vIdvLkbk5kbFUP1XLr09669bGhwfr1sa 9d/91oBYWi46SmUHcJ4qVCvyEe5xe6PPHw4Hp2vXT/qjd7+uXf31on/RX7t6PjjpD9euXoz6EOZrt7Xa IuY+6FkeMXVv3utY0Yt8E5+1055GQ6XChyP0W8qDDKIeurYE85JJFql7gg6VBT5x2RWXztzg9eHjEORq oFMDNl5V8nvGvbkJCeAt8MkkXGu1weAVVNMsUrxMfDCxDn1Rsiup3Y3LP//4p4nSPp9MuIcRQsZE0AkD V4d4FcRZEUlsUu08IEI+S1VQXJYdAuTOzzsqJuoojhSoFs/XanX1JwZXWGNhIq8Fk84Te7ktqBtSMW/s qVpgeWkS0KkAZt3Vy/nRlTu6R2+P9zYO+BAsnUJm/NUkS6I1IWIGQSLTBnpQV5ubrDMm0rw1OaJ/lQTc 4xL0nEf7SoZC1lxWk5xBgQqKBZM2ecM8muVxzLYMhro0hBr3GmoN5mWS6WCls6PBHQexN8e4o5zRpHVN DRIQsuRhyHwOIRDo+CzgC1VQ5AK2yZGGXaNvglKgVoKolkWSB2UcW/I+HW9EHusKWZgXQ90Em41UOpLn Z/YtWkolWHvRtDgWOGzIct/GWgPSf8qUAlXhARkD1aHpQEE2ncKy3zYWfnR/8VXcZCgItZ4A8gPmzoOJ DiGaj2Mcr7kiZDSJQt/OzHk0kwTKSDlx6lb1CH675W+Jf0T1pl1WNmtWKAZZ7X2/kDpxQi5Ujn/UqJNd Uq8reXM+FQbE2RINxUNLLFO9qPRgbTJl0dIWq0SyJbOWMGN0UTQRT6QqWXThqaSDfJxB3rSFTEBPyK5J MvPrDk9AsQGAadFItcndapFVktrMNRoqpGyeLfJ9KtOgK1wIOoX7pyw9MZJB+wB3qknqGS7uki1BXuuL dEBeGyc+yPvGAyUtNhmtA88FEznaMneIEXMZd+Gag+2bsG8dYv4XxQr1HkwjDdKEmLQJTY3ZMaQ32qEU eQ+iVsQuoeNUbkKQBmm4Cd44EwbtMUSLR6Z+c1yXpuHLF67buH/7JMkV8phFkGXvxwZ320SmhCcbqVmw 6aIifCk7NGHuBsLj/s40+YEDiGuxiWDYH+Zopgrtfxqcu297g+OLYb8swQ7RT+ygqa+JgOAPjaVTANCn VIPyrvdmeL5rYGDncZxKR3FS5kwZnesPlYor246HJ+U2U7JwHIykC8zI6heEhu+3OS0NanPpqiq9LRcu SOth4dN/vqeyzQ4JwabguRDN/XwblhVM6t3OQJWk7rDfO4byQgPXlBVqtJGmTlVVoNaCVr04pTkayecc cBhWpSICxhJHTSig5WPQarh6PPMa6kbQNqRy0PbB7m5Ir0wosciu1yJ0EKUSdZXlQQ+6TWSY+Bz1N2WS RQunfn5ydjQYljIjGpFQBLoJlbMvZ73z9+5J79PXfF1Exp8KHExGOGPh31gJbIBGtkRnS7Q/mb96JekY 99Oy/JL/gATTgZ8rQ6iyGHjeBDNMOBfIqMJtr2JUQHn9oLUMiaqtSgpZFPBoXiG/AUG9Z4Xt8QJtQzua JNglg/Dw0zGOBVVGCBymzEXtNcnZ8PQcffCI3OjfH4eD8/4dxVX/Tnpn7tlw8FsP8EA7TRC0qoyC777C RHn7Rw/pBjatOAiS84IYSq+Jv9GNUMgrCG2nTPW/CcQcFzxWvla+eZAL28jbiPt9/PD98VHp5FBiQPmm a5FJnM4dSw0If9iGuG+1Ryi6SyfW9aYiMbo4POyPRit0hsFSUpmJPUvOdAFiWnNWHPLCmqNKpSd6DxrT OgZse/gUhlL9oXixPpHq5ha/oU4R5fAKrQYVBrSpCy6wgaxMBnGKSSXx4wjaWVlMbtRsy1T8NEHLqN0+ a42vW99YCoW2iHX7g/MhvTlJYz/zmJl3tC2rA6A0Ok75Egi6/JuaacEtmzOhqUE15M2vFXUIdQRZFS30 Nd59oAg9TMDnTPUkuZQxjko9VgyDoLUCmA9g6I6kGsUVDY+en7btXIWmU/+442xSDV1KhH215NQ7dWNl xLJsrNFw+F21M4g2UIMC6AgTSKU0UjMCb8a8OblR6DdKYGwab/By3rRJLyKxVg8YDecSKWNVingwLjww L2xVVLo34CQ0gpTOoIdMQTVNbAwx/ZdWH1/rs3BRJYZDRjNUMxO/9r3pEs5eX31hfs9iyYE/KOsZ6She Od5EaS9HWKVCy8dNrWQ50OD4GB3IrsT4cyjF3Aa5uVlTnqkLD4w4zstd9GVH1WKBRca6Q1D0iNCpt4MY fKarS7dAMPhXPXmhFcti7v5Qdzaoej3qCGtPd+KLL8++2nEuYU6+tlGQRuR7g33Bqfv1/4r6qrDwqqFd YNLLg+FlymWF2c5XMK6HTeM3Fk8c76/EQaRYEYSg9vPtj/YrpG06Vw8RetBcOG4tzbWIA4hOgQqf24Sb EAC1nkp9CDzI64GCxzZfnwN/2HOVWK07cuEjQaSfqDBKxNFuuaA81wt4XTfVq49xJ43rs3SvXnX139/M /91NzrXqro4+qzL2cZLSaUj1SzLEITqNYgEaJEkmZutXcYCI47B666PPQEoPXxJa+IxJU5UvRL1yo5QX QmrAtOC86P78E17YtYzjxHbHTRoETftHHAkf5P5isFIWxNe++oZxapzOyeO/FqbMNLZ0rVW3KIv09Ivl zztky9djotWxuzq1G5bjG7v/NKNA7D/oWOSzMNh2i15cPATr+Q6+26jtNJ1CaFEBaRt+L7581dSw7QAn iNglqTwbO2OwBpJyFC6EQX0p0PJICW3/Ire9aaPNKKkYNavnZayWwAHcYkK8T7AjLGbcMvWgqVFMMPzV zVym3rBKzxVUJhQMpniVMe4OtZxW1VPXSKt07mybgZKbP2c0UHlYnJtotUrwZ5pVhinAEnuZGJCpTBXu EV5RVHZ+mGLFTyoU7z2rjsWVRxRTxplhRtE6k210gb1aUdXzqWYx/4Be1ot88/z/5StOJxTe9zqOwcDJ zTjktpmDgzQ04OPhSQEeZ0JDoX0vgN4s8DUUGx4A31e3f69PIJ03TaGMyOXN/V7nEH2apgYq6KsCoJlX GQUYh04ajFnDsF0XbjR6Z4q9e6uypcocwp+h97k8HBZSGoqhToFv9eUq3oRUtsC+wTzv4YiYPEFjuHAD 8ln67h07FPdlyZ+egz8t7W7jd2OpwdPclzGN2VdGq+KpyqK0xu+WnxCsGFbEl+XLnhO947Xq6a/6uFH1 Q/wzLti0AOB8FgD8zvpGj7MA4FTWNziS9Y0u1CRVyK8XA5sp+oYN+GwzRU/QADPiu/NuKWhlACgovi+W i/vWG4/BqT5r6g3FU2ZOIX/AHPVcPNiHt6d6VbkeqlfboOJwlrpLx7HfK61d2LDQHxwr5pSW64bb3JOW 46DtQeblyRgzZwbtfHTlWDkZu14hJllAcK5RLz21Ohk2gw5Msf8DzBWA+NonAAA=


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!