hexdump -C prints nonprintable characters since macOS 12.3

Originator:mark
Number:rdar://FB9973780 Date Originated:2022-04-04
Status:Open Resolved:
Product:macOS Product Version:12.3.1 21E258
Classification:Incorrect/Unexpected Behavior Reproducible:Always
 
This bug report concerns /usr/bin/hexdump, part of shell_cmds.

From “man hexdump”:
     -C      Canonical hex+ASCII display.  Display the input offset in
             hexadecimal, followed by sixteen space-separated, two column,
             hexadecimal bytes, followed by the same sixteen bytes in %_p format
             enclosed in ``|'' characters.
[…]
     _p          Output characters in the default character set.  Nonprinting
                 characters are displayed as a single “.”.

Within Terminal:

% sw_vers
ProductName:	macOS
ProductVersion:	12.3.1
BuildVersion:	21E258
% hexdump -C -n 64 /usr/lib/dyld | less
00000000  ca fe ba be 00 00 00 03  00 00 00 07 00 00 00 03  |<CA><FE><BA><BE>............|
00000010  00 00 40 00 00 09 e4 00  00 00 00 0e 01 00 00 07  |..@...<E4>.........|
00000020  00 00 00 03 00 0a 40 00  00 0b 4e 40 00 00 00 0e  |......@...N@....|
00000030  01 00 00 0c 80 00 00 02  00 15 c0 00 00 0b 38 30  |..........<C0>...80|
00000040
(END)

Here, <CA>, <FE>, <BA>, <BE>, <E4>, and <C0> on the rightmost side are displayed in inverse video, an indication that “less” received those bytes as 0xca, 0xfe, 0xba, 0xbe, 0xe4, and 0xc0 intact. hexdump -C is outputting nonprintable characters, in contravention of its documented and intended behavior. These characters should have appeared as dots.

Steps to reproduce:

1. Run hexdump -C on a file with nonprintable characters. Here, /usr/lib/dyld is used, in conjunction with -n 64 to limit the amount of output produced. Examine the output via something that will highlight nonprintable characters such as less, or even by piping to another instance of hexdump or od.

% hexdump -C -n 64 /usr/lib/dyld | less
or
% hexdump -C -n 64 /usr/lib/dyld | od -A x -t x1

2. Look for nonprintable characters.

Expected results: There should be no nonprintable characters.

Observed results: Nonprintable characters are apparent. These are displayed in inverse video when using “less”. Following the “od” example above, they are apparent beginning at offset 0x3d, which shows “ca” (indicating byte 0xca was output) instead of the expected “2e” (indicating that a “.” character was output).

Note: testing was conducted with the Terminal-default value of LANG=en_US.UTF-8. The specific behaviors here may be locale-sensitive, so please conduct testing in Terminal at its defaults, allowing LANG to be set appropriately.

This regressed in macOS 12.3 (hexdump from shell_cmds-240.100.15). I checked a system running macOS 12.2 (hexdump from shell_cmds-234) and found that its hexdump -C behaved correctly, in accordance with “Expected results” above.

Comments

2022-04-06 to Apple

Considering the most recent published open-source version of shell_cmds, shell_cmds-234 from macOS 12.0–12.2:

https://github.com/apple-oss-distributions/shell_cmds/blob/shell_cmds-234/hexdump/display.c#L182

    case F_P:
            (void)printf(pr->fmt, isprint(*bp) && isascii(*bp) ? *bp : '.');
            break;

Although source for shell_cmds-240.100.15 (macOS 12.3) has not yet been published publicly, it appears that the change in that version is to remove isascii(*bp), so that the line would read

            (void)printf(pr->fmt, isprint(*bp) ? *bp : '.');

The intent of this change would be to allow non-ASCII printable characters to appear as themselves, rather than the placeholder character '.'. Indeed, the change does have the intended effect for locales (LC_CTYPE) that specify a single-byte character set. For example, when configuring Terminal to interpret terminal output in a single-byte text encoding (Terminal:Preferences…:(selected profile):Advanced:International:Text Encoding), changing it from the default “Unicode (UTF-8)” to, for example, “Western (ISO Latin 1)” and opening a new Terminal window according to that profile:

    % echo $LANG
    en_US.ISO8859-1
    % hexdump -n 64 -C /usr/lib/dyld
    00000000  ca fe ba be 00 00 00 03  00 00 00 07 00 00 00 03  |Êþº¾............|
    00000010  00 00 40 00 00 09 e4 00  00 00 00 0e 01 00 00 07  |..@...ä.........|
    00000020  00 00 00 03 00 0a 40 00  00 0b 4e 40 00 00 00 0e  |......@...N@....|
    00000030  01 00 00 0c 80 00 00 02  00 15 c0 00 00 0b 38 30  |..........À...80|
    00000040

In ISO-8859-1, 0xca = 'Ê', 0xfe = 'þ', etc. (https://en.wikipedia.org/wiki/ISO/IEC_8859-1). As it is a single-byte encoding, it is not undesirable to treat bytes such as 0xca as printable, and output those characters in the rightmost pane of hexdump -C output. Perhaps more illustrative:

    % python3 -c "open('bytes', 'wb').write(b''.join(i.to_bytes(1, 'little') for i in range(0, 256)))"
    % hexdump -C bytes
    00000000  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |................|
    00000010  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
    00000020  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
    00000030  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
    00000040  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
    00000050  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
    00000060  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
    00000070  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|
    00000080  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|
    00000090  90 91 92 93 94 95 96 97  98 99 9a 9b 9c 9d 9e 9f  |................|
    000000a0  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af  | ¡¢£¤¥¦§¨©ª«¬­®¯|
    000000b0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  |°±²³´µ¶·¸¹º»¼½¾¿|
    000000c0  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf  |ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ|
    000000d0  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df  |ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß|
    000000e0  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef  |àáâãäåæçèéêëìíîï|
    000000f0  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe ff  |ðñòóôõö÷øùúûüýþÿ|
    00000100

hexdump from shell_cmds-234 (macOS 12.2) would have shown '.' placeholders for the entire range [0xa0, 0xff]. In this regard, shell_cmds-240.100.15 (macOS 12.3) is an improvement.

The bug reported here arises because this doesn’t work so well for locales using a multi-byte character set, including the default, which uses UTF-8. Resetting the text encoding in Terminal’s preference to its default, “Unicode (UTF-8)”, within the range [0xa0, 0xff], hexdump -C outputs those bytes as-is instead of placeholders. The issue is that Terminal, or any other program properly configured to interpret UTF-8 (either via the LANG, LC_ALL, or LC_CTYPE environment variables or some other mechanism) will not find valid UTF-8. 0xa0 is not the start of a valid UTF-8 sequence, as everything in the range [0x80, 0xbf] is only valid in UTF-8 as a continuation byte. The bytes in the range [0xc0, 0xff] are valid UTF-8 lead bytes, but only when followed by continuation bytes ([0x80, 0xbf]). Upon encountering such invalid UTF-8, Terminal writes its own '?' placeholders:

    000000a0  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af  |????????????????|
    000000b0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  |????????????????|
    000000c0  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf  |????????????????|
    000000d0  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df  |????????????????|
    000000e0  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef  |????????????????|
    000000f0  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe ff  |????????????????|

The illustrative example of piping this output to less or od from the original report shows that the bytes with those values are being output from hexdump.

    000000a0  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af  |<A0><A1><A2><A3><A4<A4><A5><A6><A7><A8><A9><AA><AB><AC><AD><AE><AF>|
    000000b0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  |<B0><B1><B2><B3><B4<B4><B5><B6><B7><B8><B9><BA><BB><BC><BD><BE><BF>|
    000000c0  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf  |<C0><C1><C2><C3><C4<C4><C5><C6><C7><C8><C9><CA><CB><CC><CD><CE><CF>|
    000000d0  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df  |<D0><D1><D2><D3><D4<D4><D5><D6><D7><D8><D9><DA><DB><DC><DD><DE><DF>|
    000000e0  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef  |<E0><E1><E2><E3><E4<E4><E5><E6><E7><E8><E9><EA><EB><EC><ED><EE><EF>|
    000000f0  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe ff  |<F0><F1><F2><F3><F4<F4><F5><F6><F7><F8><F9><FA><FB><FC><FD><FE><FF>|

This is incorrect. When configured to output UTF-8 (via LANG, LC_ALL, or LC_CTYPE) and when Terminal is configured to expect UTF-8, this invalid UTF-8 should never be produced.

That isprint returns true in a UTF-8 locale for input in the range [0xa0, 0xff] is non-portable, although not unreasonable. Valid printable Unicode code points are present in that range, but hexdump in macOS 12.3 errs in assuming that isprint(*bp) implies that printf("%c", *bp) will also be valid. In a multi-byte character set such as UTF-8, the code point (such as 0xbf) need not have a single-byte encoded representation: in this case, the sequence 0xc2 0xbf is the valid UTF-8 encoding of Unicode code point 0xbf, '¿'.

While it would be possible to make hexdump output valid UTF-8 in this case by changing the printf format string from "%c" to "%lc" or "%C" (the two are equivalent; this could be achieved via https://github.com/apple-oss-distributions/shell_cmds/blob/shell_cmds-234/hexdump/parse.c#L369) and casting the *bp argument to (wchar_t) (to avoid -Wformat warnings), I consider this inadvisable. The intent of the rightmost pane in hexdump -C output is to present the file in its original form as nearly as possible, and in a byte-oriented fashion. Where the file’s contents cannot be interpreted in a character-per-byte fashion, it is not appropriate to attempt to interpret it according to an arbitrary encoding and output multi-byte sequences. Even if the file contents is UTF-8-encoded, it is not appropriate for hexdump to output these multi-byte sequences, because that would violate the byte-oriented spirit, making it increasingly difficult to make sense of the rightmost pane, which is expected to contain precisely 16 characters (possibly including '.' placeholders) corresponding to the 16 bytes interpreted.

Having established that the removal of isascii is valuable for single-byte encodings, though, an alternative solution is available in btowc.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/btowc.html:

    > The btowc() function shall return WEOF if c has the value EOF or if (unsigned
    > char) c does not constitute a valid (one-byte) character in the initial shift
    > state. Otherwise, it shall return the wide-character representation of that
    > character.

As btowc can distinguish between code points valid as single-byte encodings and those that are not, it can be used to determine whether printf("%c", *bp) will produce valid output. Thus:

            (void)printf(pr->fmt, isprint(*bp) && btowc(*bp) != WEOF ? *bp : '.');

would be a viable fix at https://github.com/apple-oss-distributions/shell_cmds/blob/shell_cmds-234/hexdump/display.c#L182 (in shell_cmds-240.100.15, where I expect the isascii of shell_cmds-234 has been removed). It retains the desirable properties of removing the artificial ASCII-only restriction and allowing printable characters at *bp to pass as-is when they are valid single-byte encodings, but improves upon it by avoiding outputting bad data for *bp that, according to the current locale as set by LANG, LC_ALL, or LC_CTYPE, cannot be expressed with a single-byte encoding.


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!