Write to system register S3_6_C15_C1_5 (aka. SPRR_PERM_EL0 ) in pthread_jit_write_protect_np can fail rarely

Originator:lewurm
Number:rdar://FB10500605 Date Originated:06/29/2022
Status:Fixed Resolved:
Product:macOS Product Version:
Classification: Reproducible:
 
We observe a rare crash (SIGTRAP) in pthread_jit_write_protect_np() in our CI setup for Truffleruby [1]. It’s using HotSpot (OpenJDK), this is what the stack trace looks like:


    Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
    0   libsystem_pthread.dylib         0x000000018515c6f0 pthread_jit_write_protect_np + 516
    1   libjvm.dylib                    0x000000010119e394 Threads::create_vm(JavaVMInitArgs*, bool*) + 140
    2   libjvm.dylib                    0x0000000100d5aa04 JNI_CreateJavaVM + 120
    3   ruby                            0x0000000100699260 main + 892
    4   libdyld.dylib                   0x0000000185179430 start + 4


Disassembling pthread_jit_write_protect_np() we see the following instruction at offset 516:

(lldb) dis -n pthread_jit_write_protect_np
libsystem_pthread.dylib`pthread_jit_write_protect_np:
    0x19bb34f5c <+0>:   pacibsp
    0x19bb34f60 <+4>:   stp    x29, x30, [sp, #-0x10]!
    0x19bb34f64 <+8>:   mov    x29, sp
    [...]
    0x19bb35160 <+516>: brk    #0x1

where brk causes a SIGTRAP.  So how do we get here?

    0x19bb34fe4 <+136>: movk   x0, #0xc110
    0x19bb34fe8 <+140>: movk   x0, #0xffff, lsl #16
    0x19bb34fec <+144>: movk   x0, #0xf, lsl #32
    0x19bb34ff0 <+148>: movk   x0, #0x0, lsl #48
    0x19bb34ff4 <+152>: ldr    x0, [x0]
    0x19bb34ff8 <+156>: msr    S3_6_C15_C1_5, x0
    0x19bb34ffc <+160>: isb
    0x19bb35000 <+164>: movk   x1, #0xc110
    0x19bb35004 <+168>: movk   x1, #0xffff, lsl #16
    0x19bb35008 <+172>: movk   x1, #0xf, lsl #32
    0x19bb3500c <+176>: movk   x1, #0x0, lsl #48
    0x19bb35010 <+180>: ldr    x9, [x1]
    0x19bb35014 <+184>: mrs    x10, S3_6_C15_C1_5
    0x19bb35018 <+188>: b      0x19bb350ac               ; <+336>
    [...]
    0x19bb350ac <+336>: cmp    x9, x10
    0x19bb350b0 <+340>: b.ne   0x19bb35160               ; <+516>
    [...]
    0x19bb35160 <+516>: brk    #0x1

So the verification fails, and thus a jump to brk happens.

We managed to replace pthread_jit_write_protect_np() with a custom implementation that retries writing to S3_6_C15_C1_5 until successful. However, it looks like a context switch must happen between retries; I guess whatever the kernel is doing this helps to “fixup” the situation.

Here is our workaround for HotSpot with more details on the issue: https://gist.githubusercontent.com/lewurm/3ae189f55de13621708aefb52d12fe1d/raw/09f70b66d91c7961b7229f9be3f76ac355c95bf4/jit-protect.patch

Unfortunately we have not been able to come up with a reproducer outside of our CI setup.

We are observing this on Macmini9,1 machines.


[1] https://github.com/oracle/truffleruby/

Comments

It got fixed in macOS 13.0 Beta 3 (22A5295i) and was a race in the process setup by the kernel.

Some hints can be found in a XNU source bump: https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/bsd/kern/kern_exec.c#L4081-L4099

This is also tracked as rdar://96307913

I managed to come up with a reproducer that also works on macOS 12.4 on a M1 Pro: https://gist.github.com/lewurm/40fb8f7edb81f5e715ee6c7217feed32

Crash report: https://gist.github.com/lewurm/74f5a8c2b2291756e64b49070aea68d7


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!