[Chrome] arm64: uncatchable spurious SIGILL received when a page fault occurs while calling a signal handler for any signal

Number:rdar://FB8922558 Date Originated:2020-11-30
Status:Open Resolved:
Product:macOS Product Version:11.0.1 20B29
Classification:Application Crash Reproducible:Always
When the kernel wants to invoke a user-space signal handler, it writes the signal handler’s stack frame and then sets the register state to jump to the signal handler. For arm64, this happens in xnu bsd/dev/arm/unix_signal.c sendsig. (x86_64 is bsd/dev/i386/unix_signal.c sendsig, which is similar.)

I have discovered that, when a user-space signal handler is expected to be invoked, on arm64, sometimes an uncatchable SIGILL is sent to the process instead. The “bad:” label at the bottom of sendsig is the source of this SIGILL. Notably, it sets the disposition for SIGILL to SIG_DFL (the default action for this signal being to terminate the process), unblocks the signal, and sends it via psignal_locked. This results in process termination by uncatchable SIGILL, regardless of the signal originally being sent in sendsig.

This occurs whenever the copyout call in sendsig fails. The copyout call is responsible for writing the user-space signal handler’s stack frame. It should not be possible for copyout to fail in a well-behaved program where the stack frame is to be written to memory that is validly mapped in the target process. However, on arm64, if the copyout triggers a page fault, the copyout fails. The absent page should be paged in and used.

This bug reproduces at random and in the wild in Go (golang) executables. We are tracking this bug at https://github.com/golang/go/issues/42774. The Go runtime makes heavy use of SIGURG being sent cross-thread within a process. The design of Go’s use of SIGURG is discussed at https://go.googlesource.com/proposal/+/master/design/24543-non-cooperative-preemption.md.

In Go’s case, the user-space signal handler that uses sigaltstack and SA_ONSTACK to provide an alternate signal stack, but presumably the bug can occur even with the signal handler’s stack frame on the main thread stack, particularly if the signal handler’s stack frame would occupy space on a different page than the existing stack top. In the wild, the bug occurs most readily on a system under heavy load (particularly, memory pressure). Both “all.bash” and “go test io -run='^$' -bench=BenchmarkCopy -count=10” have been used to demonstrate the bug. In these cases, the bug reproduces spuriously and at random.

Go’s runtime also has a user-space signal handler for SIGILL, which is not invoked when this bug occurs.

In order to provide a more concrete reproduction, I’ve distilled the basic operations into a pair of C++ test programs, a “simple” version that experience shows reproduces at 100%, and a more complete version that simulates Go’s use of SIGURG. For these tests, I’ve allocated the sigaltstack stack_t region using a file-backed mmap, as pages allocated in this way are least likely to be present, can be removed from the unified buffer cache for testing easily by running purge, and thus most likely to reproduce the bug. Note, however, that when the bug was discovered, it was with memory regions allocated with MAP_ANONYMOUS and not file-backed.

Steps to reproduce (t_fault_sigaltstack_simple):

mark@arm-and-hammer zsh% clang++ -Wall -Werror -g t_fault_sigaltstack_simple.cc -o t_fault_sigaltstack_simple
mark@arm-and-hammer zsh% ./t_fault_sigaltstack_simple; echo $?

Expected behavior:

mark@arm-and-hammer zsh% ./t_fault_sigaltstack_simple; echo $?

Observed behavior:

mark@arm-and-hammer zsh% ./t_fault_sigaltstack_simple; echo $?
zsh: illegal hardware instruction  ./t_fault_sigaltstack_simple

Steps to reproduce (t_fault_sigaltstack):

mark@arm-and-hammer zsh% clang++ -Wall -Werror -std=c++11 -g t_fault_sigaltstack.cc -o t_fault_sigaltstack
mark@arm-and-hammer zsh% ./t_fault_sigaltstack -n 8 -t 60; echo $?

NOTE: If the bug doesn’t reproduce quickly, encourage the memory used for the threads’ alternate signal stacks to be paged out by running “purge”, possibly in a loop. From another shell:

admin@arm-and-hammer zsh% while true; do sudo purge; done

Expected behavior:

mark@arm-and-hammer zsh% ./t_fault_sigaltstack -n 8 -t 60; echo $?
[…wait -t seconds…]

Observed behavior:

mark@arm-and-hammer zsh% ./t_fault_sigaltstack -n 8 -t 60; echo $?
[…wait up to -t seconds, possibly use purge as described above…]
zsh: illegal hardware instruction  ./t_fault_sigaltstack -n 8 -t 60

System information:

mark@arm-and-hammer zsh% sw_vers
ProductName:	macOS
ProductVersion:	11.0.1
BuildVersion:	20B29
mark@arm-and-hammer zsh% xcodebuild -version
Xcode 12.2
Build version 12B45b
mark@arm-and-hammer zsh% uname -m
mark@arm-and-hammer zsh% system_profiler SPHardwareDataType | grep 'Model Identifier'
      Model Identifier: ADP3,2

This occurs on shipping M1-based hardware as well.

This does not occur on x86_64 (tested 10.15.7 19H15 and 11.0.1 20B29), and does not occur when running x86_64 code on arm64 under Rosetta translation (tested 11.0.1 20B29).

Additional information:

As a workaround, it is possible to mlock the page that the kernel will write the user-space signal handler’s stack frame to. This page can generally be known when using sigaltstack and SA_ONSTACK, as it’s the top page in the stack_t region. This workaround is not complete because there are still circumstances under which a signal handler may run on the thread’s main stack, even with an alternate signal stack configured, and the page used for a signal handler’s stack frame on a thread’s main stack cannot generally be known ahead of time. The workaround is undesirable as well, because it requires wired memory of at least one page (16kB on arm64) for each thread.

When this bug occurs, crash reports written by ReportCrash contain some characteristics that identify this bug, or at the very least, identify that the SIGILL did not originate from a genuine hardware trap.

mark@arm-and-hammer zsh% cat Library/Logs/DiagnosticReports/t_fault_sigaltstack_simple_2020-11-30-*_arm-and-hammer.crash
Process:               t_fault_sigaltstack_simple [1285]
Code Type:             ARM-64 (Native)
Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   t_fault_sigaltstack_simple [1285]

Application Specific Information:
dyld2 mode

In contrast to a genuine SIGILL:
 - In this case, “Exception Type” indicates EXC_CRASH (SIGILL). With a genuine illegal instruction, it will be EXC_BAD_ACCESS (SIGILL).
 - In this case, the “Exception Codes” are always both zero. With a genuine illegal instruction, codes[0] is never 0, and is nearly always EXC_ARM_UNDEFINED, represented numerically as 1. codes[1] will be the faulting instruction (which may be 0, as this is in fact an illegal instruction on arm64).
 - In this case, “Terminating Process” indicates that the signal was sent by the process itself (in actuality, it was sent by the kernel on behalf of the process). With a genuine illegal instruction, this field would show “exc handler [1285]”, indicating that the Mach exception handler server in the kernel, operating on behalf of the process, generated the signal.

These conditions exist in crash reports for t_fault_sigaltstack too. In that case, the crash report may blame either the thread that sent the signal or the thread that was to receive the signal as the crash thread.

mark@arm-and-hammer zsh% cat Library/Logs/DiagnosticReports/t_fault_sigaltstack_2020-11-30-*_arm-and-hammer.crash
Crashed Thread:        0  Dispatch queue: com.apple.main-thread
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib              0x0000000191e4fcec __pthread_kill + 8
1   libsystem_pthread.dylib             0x0000000191e80c24 pthread_kill + 292
2   t_fault_sigaltstack                 0x0000000102b311a4 main + 816 (t_fault_sigaltstack.cc:268)
3   libdyld.dylib                       0x0000000191e9cf54 start + 4

mark@arm-and-hammer zsh% cat Library/Logs/DiagnosticReports/t_fault_sigaltstack_2020-11-30-*_arm-and-hammer.crash
Crashed Thread:        1
Thread 0:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib              0x000000018298bcec __pthread_kill + 8
1   libsystem_pthread.dylib             0x00000001829bcc24 pthread_kill + 292
2   t_fault_sigaltstack                 0x00000001003b91a4 main + 816 (t_fault_sigaltstack.cc:268)
3   libdyld.dylib                       0x00000001829d8f54 start + 4

Thread 1 Crashed:
0   libsystem_kernel.dylib              0x0000000182983d24 semaphore_wait_trap + 8
1   libdispatch.dylib                   0x0000000182810988 _dispatch_sema4_wait + 28
2   libdispatch.dylib                   0x0000000182811050 _dispatch_semaphore_wait_slow + 132
3   t_fault_sigaltstack                 0x00000001003b9844 (anonymous namespace)::Semaphore::Wait() + 32 (t_fault_sigaltstack.cc:62)
4   t_fault_sigaltstack                 0x00000001003b9744 (anonymous namespace)::ThreadMain(void*) + 540 (t_fault_sigaltstack.cc:154)
5   libsystem_pthread.dylib             0x00000001829bd06c _pthread_start + 320
6   libsystem_pthread.dylib             0x00000001829b7da0 thread_start + 8

This bug was discovered during the course of updating golang to run on mac-arm64, tracked at https://github.com/golang/go/issues/42774


This is similar to FB8759414 (https://github.com/golang/go/issues/41702). Although that bug affects x86_64 as well, and the conditions to trigger it are different, the same uncatchable spurious SIGILL is being sent.





Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!