-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fold: process streams as bytes, not strings, to handle non-utf8 data #8241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
GNU testsuite comparison:
|
GNU testsuite comparison:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution !
A few questions:
- What locale did you use to perform your tests ?
- Can you check there is no discrepancy with GNU's
fold
withLC_ALL
beingen_US
,en_US.UTF-8
,C
andC.UTF-8
?
A few remarks on unwrap
s, but otherwise it looks good to me !
As an extra remark, please stash your "clippy fixes" commit in the first one. Thanks ! |
ffd3838
to
faa6a9b
Compare
Here's the locale I was using for all tests:
I've tried all three of my new tests with those locales and they all give the exact same output as each other (and as the coreutils I also consolidated all the code changes into one commit and re-pushed, so the commits history should be simpler now. |
GNU testsuite comparison:
|
Thank you for your contribution ! |
This fixes #8227 by making
fold
process its input as bytes, rather than strings. Because it was reading input as a string, anything that wasn't valid UTF-8 (including valid Latin 1-encoded data, as the bug report has) would cause an error. GNU'sfold
appears to operate on bytes, so this improves its compatibility as well.I didn't need to change any tests and I added three that work on non-UTF8 files.
There is a change in behavior that isn't covered by the tests: Unicode isn't "properly" handled anymore. Take this example, where "test.input" contains these emoji:
Before my change:
And after:
That looks like a regression, but it matches GNU
fold
behavior: