hexdump -C prints nonprintable characters since macOS 12.3
Originator: | mark | ||
Number: | rdar://FB9973780 | Date Originated: | 2022-04-04 |
Status: | Open | Resolved: | |
Product: | macOS | Product Version: | 12.3.1 21E258 |
Classification: | Incorrect/Unexpected Behavior | Reproducible: | Always |
This bug report concerns /usr/bin/hexdump, part of shell_cmds. From “man hexdump”: -C Canonical hex+ASCII display. Display the input offset in hexadecimal, followed by sixteen space-separated, two column, hexadecimal bytes, followed by the same sixteen bytes in %_p format enclosed in ``|'' characters. […] _p Output characters in the default character set. Nonprinting characters are displayed as a single “.”. Within Terminal: % sw_vers ProductName: macOS ProductVersion: 12.3.1 BuildVersion: 21E258 % hexdump -C -n 64 /usr/lib/dyld | less 00000000 ca fe ba be 00 00 00 03 00 00 00 07 00 00 00 03 |<CA><FE><BA><BE>............| 00000010 00 00 40 00 00 09 e4 00 00 00 00 0e 01 00 00 07 |..@...<E4>.........| 00000020 00 00 00 03 00 0a 40 00 00 0b 4e 40 00 00 00 0e |......@...N@....| 00000030 01 00 00 0c 80 00 00 02 00 15 c0 00 00 0b 38 30 |..........<C0>...80| 00000040 (END) Here, <CA>, <FE>, <BA>, <BE>, <E4>, and <C0> on the rightmost side are displayed in inverse video, an indication that “less” received those bytes as 0xca, 0xfe, 0xba, 0xbe, 0xe4, and 0xc0 intact. hexdump -C is outputting nonprintable characters, in contravention of its documented and intended behavior. These characters should have appeared as dots. Steps to reproduce: 1. Run hexdump -C on a file with nonprintable characters. Here, /usr/lib/dyld is used, in conjunction with -n 64 to limit the amount of output produced. Examine the output via something that will highlight nonprintable characters such as less, or even by piping to another instance of hexdump or od. % hexdump -C -n 64 /usr/lib/dyld | less or % hexdump -C -n 64 /usr/lib/dyld | od -A x -t x1 2. Look for nonprintable characters. Expected results: There should be no nonprintable characters. Observed results: Nonprintable characters are apparent. These are displayed in inverse video when using “less”. Following the “od” example above, they are apparent beginning at offset 0x3d, which shows “ca” (indicating byte 0xca was output) instead of the expected “2e” (indicating that a “.” character was output). Note: testing was conducted with the Terminal-default value of LANG=en_US.UTF-8. The specific behaviors here may be locale-sensitive, so please conduct testing in Terminal at its defaults, allowing LANG to be set appropriately. This regressed in macOS 12.3 (hexdump from shell_cmds-240.100.15). I checked a system running macOS 12.2 (hexdump from shell_cmds-234) and found that its hexdump -C behaved correctly, in accordance with “Expected results” above.
Comments
Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!
2022-04-06 to Apple
Considering the most recent published open-source version of shell_cmds, shell_cmds-234 from macOS 12.0–12.2:
https://github.com/apple-oss-distributions/shell_cmds/blob/shell_cmds-234/hexdump/display.c#L182
Although source for shell_cmds-240.100.15 (macOS 12.3) has not yet been published publicly, it appears that the change in that version is to remove
isascii(*bp)
, so that the line would readThe intent of this change would be to allow non-ASCII printable characters to appear as themselves, rather than the placeholder character
'.'
. Indeed, the change does have the intended effect for locales (LC_CTYPE
) that specify a single-byte character set. For example, when configuring Terminal to interpret terminal output in a single-byte text encoding (Terminal:Preferences…:(selected profile):Advanced:International:Text Encoding), changing it from the default “Unicode (UTF-8)” to, for example, “Western (ISO Latin 1)” and opening a new Terminal window according to that profile:In ISO-8859-1, 0xca =
'Ê'
, 0xfe ='þ'
, etc. (https://en.wikipedia.org/wiki/ISO/IEC_8859-1). As it is a single-byte encoding, it is not undesirable to treat bytes such as 0xca as printable, and output those characters in the rightmost pane ofhexdump -C
output. Perhaps more illustrative:hexdump
from shell_cmds-234 (macOS 12.2) would have shown'.'
placeholders for the entire range [0xa0, 0xff]. In this regard, shell_cmds-240.100.15 (macOS 12.3) is an improvement.The bug reported here arises because this doesn’t work so well for locales using a multi-byte character set, including the default, which uses UTF-8. Resetting the text encoding in Terminal’s preference to its default, “Unicode (UTF-8)”, within the range [0xa0, 0xff],
hexdump -C
outputs those bytes as-is instead of placeholders. The issue is that Terminal, or any other program properly configured to interpret UTF-8 (either via theLANG
,LC_ALL
, orLC_CTYPE
environment variables or some other mechanism) will not find valid UTF-8. 0xa0 is not the start of a valid UTF-8 sequence, as everything in the range [0x80, 0xbf] is only valid in UTF-8 as a continuation byte. The bytes in the range [0xc0, 0xff] are valid UTF-8 lead bytes, but only when followed by continuation bytes ([0x80, 0xbf]). Upon encountering such invalid UTF-8, Terminal writes its own'?'
placeholders:The illustrative example of piping this output to
less
orod
from the original report shows that the bytes with those values are being output fromhexdump
.This is incorrect. When configured to output UTF-8 (via
LANG
,LC_ALL
, orLC_CTYPE
) and when Terminal is configured to expect UTF-8, this invalid UTF-8 should never be produced.That
isprint
returns true in a UTF-8 locale for input in the range [0xa0, 0xff] is non-portable, although not unreasonable. Valid printable Unicode code points are present in that range, buthexdump
in macOS 12.3 errs in assuming thatisprint(*bp)
implies thatprintf("%c", *bp)
will also be valid. In a multi-byte character set such as UTF-8, the code point (such as 0xbf) need not have a single-byte encoded representation: in this case, the sequence 0xc2 0xbf is the valid UTF-8 encoding of Unicode code point 0xbf,'¿'
.While it would be possible to make
hexdump
output valid UTF-8 in this case by changing the printf format string from"%c"
to"%lc"
or"%C"
(the two are equivalent; this could be achieved via https://github.com/apple-oss-distributions/shell_cmds/blob/shell_cmds-234/hexdump/parse.c#L369) and casting the*bp
argument to(wchar_t)
(to avoid -Wformat warnings), I consider this inadvisable. The intent of the rightmost pane inhexdump -C
output is to present the file in its original form as nearly as possible, and in a byte-oriented fashion. Where the file’s contents cannot be interpreted in a character-per-byte fashion, it is not appropriate to attempt to interpret it according to an arbitrary encoding and output multi-byte sequences. Even if the file contents is UTF-8-encoded, it is not appropriate forhexdump
to output these multi-byte sequences, because that would violate the byte-oriented spirit, making it increasingly difficult to make sense of the rightmost pane, which is expected to contain precisely 16 characters (possibly including'.'
placeholders) corresponding to the 16 bytes interpreted.Having established that the removal of
isascii
is valuable for single-byte encodings, though, an alternative solution is available inbtowc
.https://pubs.opengroup.org/onlinepubs/9699919799/functions/btowc.html:
As
btowc
can distinguish between code points valid as single-byte encodings and those that are not, it can be used to determine whetherprintf("%c", *bp)
will produce valid output. Thus:would be a viable fix at https://github.com/apple-oss-distributions/shell_cmds/blob/shell_cmds-234/hexdump/display.c#L182 (in shell_cmds-240.100.15, where I expect the
isascii
of shell_cmds-234 has been removed). It retains the desirable properties of removing the artificial ASCII-only restriction and allowing printable characters at*bp
to pass as-is when they are valid single-byte encodings, but improves upon it by avoiding outputting bad data for*bp
that, according to the current locale as set byLANG
,LC_ALL
, orLC_CTYPE
, cannot be expressed with a single-byte encoding.