APFS: Something is changing timestamps (last-modified time) of executables and dylibs, triggering build systems to rebuild unnecessarily

Originator:mark
Number:rdar://32928868 Date Originated:2017-06-22
Status:Closed Resolved:2017-08-01
Product:macOS + SDK Product Version:10.13db2 17A291j
Classification:Crash/Hang/Data Loss Reproducible:Sometimes
 
I’ve found that something is changing the timestamps (last-modified time) of executables and dylibs on an APFS filesystem. This does not occur on HFS+.

Like many other build systems, Chrome’s build system tracks whether a particular build output is up to date based on whether its timestamp is more recent than the timestamps of any of the things that it depends on.

I’ve experienced timestamps advance before my very eyes. It’s particularly alarming to do a build, not change anything, and find that an attempt to rebuild, which shouldn’t need to do any work at all, wants to do several thousand things.

So far, I’ve only caught timestamps advancing. This causes an over-build problem. I don’t know if timestamps ever move backwards, although this would have the potential to be a larger problem, because it could mask the fact that files had changed and cause downstream dependencies to not be rebuilt when they ought to be (an under-build problem).

I’m classifying this as data loss because timestamps are metadata that do have significance.

Steps to Reproduce:
Preconditions: Be sure that Xcode is installed. Be sure that it’s been launched, that you’ve accepted its license, and that it’s had the opportunity to install any components necessary. Be sure that xcode-select points to it.

Check out and build Chromium. (A more complete guide to building Chromium is at https://chromium.googlesource.com/chromium/src/+/master/docs/mac_build_instructions.md.)

% git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
% PATH="${PATH}:$(pwd)/depot_tools"
% mkdir chromium
% cd chromium
% fetch chromium
[…wait…]
% cd src
% gn gen out/debug
Done. Made 6755 targets from 1296 files in 4569ms
% ninja -C out/debug chrome
ninja: Entering directory `out/debug'
[…wait…]
[29471/29471] STAMP obj/chrome/chrome.stamp

Then, without making any changes, build it again.

% ninja -C out/debug chrome

Expected Results:
The second attempt to build (the rebuild) should find that there’s nothing to do, and exit quickly. This is what happens when the build is on an HFS+ filesystem.

% ninja -C out/debug chrome
ninja: Entering directory `out/debug'
ninja: no work to do.

Observed Results:
The second (and often third, fourth, …) attempts to build find that there’s work to do, and rebuilds various files, along with everything “downstream” from the files it’s rebuilt. This can result in thousands of recompiles.

litterbox@litterbox zsh% ninja -C out/debug chrome
ninja: Entering directory `out/debug'
[2700/2700] STAMP obj/chrome/chrome.stamp

Use the -n argument to ninja to do a dry run (ninja -n -C out/debug chrome), which will tell you whether it’s planning on rebuilding anything without actually doing the rebuild. Use the “-d explain” option to get ninja to tell you why it’s decided to rebuild something. For example:

% ninja -d explain -n -C out/debug chrome
[…]
ninja explain: output obj/net/http/generate_transport_security_state.inputdeps.stamp older than most recent input transport_security_state_generator (1498153036 vs 1498153760)
[…]

In this case, transport_security_state_generator is an executable that’s run to produce transport_security_state_static.h. You can see this in the build description at https://chromium.googlesource.com/chromium/src/+/526c7ebbc8501df55df5ecb0ee90e08368ee6852/net/http/BUILD.gn#7.  generate_transport_security_state.inputdeps.stamp is an output whose timestamp is updated after transport_security_state_static.h is updated. We can look at all of the files involved:

litterbox@litterbox zsh% ls -lT out/debug/transport_security_state_generator out/debug/obj/net/http/generate_transport_security_state.inputdeps.stamp out/debug/gen/net/http/transport_security_state_static.h net/http/transport_security_state_static.json net/http/transport_security_state_static.pins net/http/transport_security_state_static.template
-rw-r--r--  1 litterbox  staff  2469636 Jun 21 15:58:09 2017 net/http/transport_security_state_static.json
-rw-r--r--  1 litterbox  staff    90854 Jun 15 12:14:10 2017 net/http/transport_security_state_static.pins
-rw-r--r--  1 litterbox  staff     2013 Jun 15 12:14:10 2017 net/http/transport_security_state_static.template
-rw-r--r--  1 litterbox  staff  1649554 Jun 22 13:37:20 2017 out/debug/gen/net/http/transport_security_state_static.h
-rw-r--r--  1 litterbox  staff        0 Jun 22 13:37:16 2017 out/debug/obj/net/http/generate_transport_security_state.inputdeps.stamp
-rwxr-xr-x  1 litterbox  staff   580756 Jun 22 13:49:20 2017 out/debug/transport_security_state_generator

Sure enough, transport_security_state_generator appears newer than transport_security_state_static.h and generate_transport_security_state.inputdeps.stamp, and since it has appeared to change, it must be run again to produce its output. But its timestamp, 13:49:20, isn’t correct! The file was created at 13:37:16 and no part of the build system changed it.

litterbox@litterbox zsh% GetFileInfo out/debug/transport_security_state_generator
file: "/chrome/chrome/src/out/debug/transport_security_state_generator"
type: "\0\0\0\0"
creator: "\0\0\0\0"
attributes: avbstclinmedz
created: 06/22/2017 13:37:16
modified: 06/22/2017 13:49:20

Version:
10.13db2 17A291j with Xcode 9b2 9M137d.
I also experienced this with 10.13db1 17A264c with Xcode 9b1 9M136h.

Notes:
Running “ls -ltT out/debug” is informative. If you run it right after the build, wait a while without doing anything, and then run it again, you’ll spot timestamps advancing on executables and dylibs. For example, when I first finished the initial build, I had

-rwxr-xr-x  1 litterbox  staff  21302808 Jun 22 13:49 mksnapshot

Without doing any rebuild at all or doing anything else that would cause this file to be touched, a half hour later, I now have

-rwxr-xr-x  1 litterbox  staff  21302808 Jun 22 14:13 mksnapshot

You’ll also note that each dylib in out/debug is accompanied by a .TOC file. The .TOC file is always updated immediately after its corresponding .dylib is linked, so the timestamp of any .TOC file should always be at least as fresh as its companion .dylib. Yet there are numerous examples such as

-rwxr-xr-x   2 litterbox  staff   10772916 Jun 22 14:02 libbase.dylib
-rw-r--r--   1 litterbox  staff     342086 Jun 22 13:36 libbase.dylib.TOC

Configuration:
This only occurs on an APFS volume. When I do the build on an HFS+ volume, I don’t see this problem.

Comments

2017-08-01 19:58 UTC from Apple

Thanks for your update. We are closing this bug report.

For any issues remaining in Bug ID 32934061, please file a new bug report.

If you have questions regarding the resolution of this issue, please update your bug report with that information.

Please be sure to regularly check new Apple releases for any updates that might affect this issue.

2017-08-01 15:02 UTC to Apple

I can no longer reproduce this bug in 10.11db4 17A315i. Thanks.

However, bug 32934061 was erroneously duped into this one (32928868). I am still experiencing the performance problems described in bug 32934061 (operations on APFS are slower than equivalents on HFS+).

You can close bug 32928868, but please reopen bug 32934061.

2017-07-24 20:22 from Apple

We believe this issue has been resolved in the latest macOS 10.13 beta.

Please test with the latest beta. If you still have issues, please update your bug report with any relevant logs or information that could help us investigate.

macOS https://developer.apple.com/download/

2017-07-10 20:20 from Apple

A solution is under investigation. We will follow up with you again when it is available.

2017-07-07 16:19 from Apple

A solution is under investigation. We will follow up with you again when it is available.

2017-06-30 14:38 UTC to Apple

I believe that this can be mitigated by having ld call msync().

I’m using Xcode 9b2 9M137d with ld64-301.1. ld::tool::OutputFile::writeOutputFile() normally uses mmap() when writing a regular file to an HFS+ or APFS filesystem. It does not call msync(). This should be done before the rename() call.

msync(wholeBuffer, fileSize, MSASYNC); // or MS_SYNC

This may also improve the situation in bug 32926460 (now marked a duplicate of 32747641).

Because APFS behaves differently than HFS+ in this regard, unless APFS is changed to behave more like HFS+, it would be prudent to audit all uses of mmap(…, PROT_WRITE, MAP_SHARED, …) in the developer tools and elsewhere on the system to ensure that either msync() is used or determined to be unnecessary, and a note should be published advising developers of this behavior change, such as in the 10.13 release notes for developers.

2017-06-29 22:17 UTC to Apple

It appears that the timestamps are updated as mapped files are paged out to the filesystem. I determined this based on the fs_usage traces and was able to devise a much more reliable reproduction by writing a program to waste memory by allocating and zeroing blocks, forcing the kernel to page things out.

To produce the logs in the attached 32928868.1.tar.bz2, I did a full build of Chrome (output in ninja.1) and then ran the waste program to waste memory for 30 seconds. All the while, fs_usage was running to capture filesystem activity (the tail end of fs_usage’s output just from while waste was running, excluding the full build of Chrome, is in fs_usage.tail). At the conclusion, I asked ninja, the build tool, to say what it wanted to rebuild and why (ninja.2). There are also files containing timestamps of everything in the build directory from before (find.2) and after (find.3) waste was running. All source code is in the archive, and test.sh is the main test driver.

Looking at ninja.2, you can determine which files’ timestamps are newer than they ought to be:

litterbox@litterbox zsh% sed -Ene 's/^.most recent input (.) (.*$/1/p' ninja.2 | sort | uniq Helpers/crashpad_handler brotli crashpad_handler flatc mksnapshot protoc transport_security_state_generator

By looking for these files in find.2 and find.3, you can see the timestamps at the time that “ninja” completed and the time that “waste” completed, and that some of them changed during this window, even though the build was done by then:

before waste after waste delta path 1498771151.302104 1498771151.302104 0.000000 Helpers/crashpad_handler 1498771145.349302 1498771145.349302 0.000000 brotli 1498771151.302104 1498771151.302104 0.000000 crashpad_handler 1498771146.402073 1498771146.402073 0.000000 flatc 1498770754.525862 1498771174.757142 420.231280 mksnapshot 1498771159.690814 1498771194.570975 34.880161 protoc 1498771146.823954 1498771146.823954 0.000000 transport_security_state_generator

For the files that ninja wanted to rebuild but that show a “delta” of 0, the files were paged out and their timestamps updated while ninja build was still running, before waste even ran. You can confirm this by looking at the timestamps in ninja.1, which give link times, after which the timestamps should not have changed:

link time path 1498770499.967061 brotli 1498770600.998057 crashpad_handler 1498770536.979447 flatc 1498770754.615710 mksnapshot 1498770561.685248 protoc 1498770531.198556 transport_security_state_generator

fs_usage.tail is an “fs_usage -f filesys -w -e” trace filtered to begin at 1498771173.525731 (when waste began, see the timestamps file). This is equal to 2017-06-29 21:19:33.525731 UTC. The mksnapshot and protoc timestamps were updated after this time, so fs_usage.tail can be examined for the events that caused them to be updated.

For protoc, I see

21:19:54.576518 PgOut[AT2] D=0x01a969e4 B=0x100000 /dev/disk1s1 out/debug_goma/protoc 0.005555 W kernel_task.474

21:19:54 is 1498771194, the timestamp that protoc has been updated to have.

For mksnapshot, I see 70 similar PgOut[AT2] events starting at 21:19:34.155321 and ending at 21:19:34.757232. 21:19:34 is 1498771174, the timestamp that mksnapshot has been updated to have.

Note that the files listed are just the ones that ninja felt were relevant reasons to want to rebuild things. Many, many other files in the build output directory saw their timestamps updated after the build completed. Run “diff -u find.2 find.3” to see them. There are 23 in total, and note that they are all linker output. As before, this doesn’t reflect the full set of files whose timestamps were changed after they were linked, just those whose timestamps changed after the whole build completed. Many other files had their timestamps updated after they were written but before the build finished.

I can reproduce this problem in the absence of the waste program, but I’ve found that it’s a much more reliable reproduction when using that program to encourage page-outs.

I have the full fs_usage log which I’m not uploading because it’s nearly 3.5GB long, although I can provide it if you’d really like to see it.

32928868.1.tar.bz2 2.03MB

2017-06-28 16:50 UTC from Apple

Please run:

sudo fs_usage -f filesys -w

Trip the condition.

Ctrl-C and attach the output from fs_usage to the bug report.

Please provide your response or results by updating your bug report. If uploading files, please compress first.


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!