Skip to content

Conversation

phinjensen
Copy link
Contributor

This fixes #8227 by making fold process its input as bytes, rather than strings. Because it was reading input as a string, anything that wasn't valid UTF-8 (including valid Latin 1-encoded data, as the bug report has) would cause an error. GNU's fold appears to operate on bytes, so this improves its compatibility as well.

I didn't need to change any tests and I added three that work on non-UTF8 files.

There is a change in behavior that isn't covered by the tests: Unicode isn't "properly" handled anymore. Take this example, where "test.input" contains these emoji:

🐕‍🦺🐕‍🦺🐕‍🦺

Before my change:

% fold -w1 test.input
🐕

🦺
🐕

🦺
🐕

🦺

And after:

% fold -w1 test.input

































That looks like a regression, but it matches GNU fold behavior:

% diff <(fold -w1 test.input) <(gfold -w1 test.input)
%

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

Copy link
Collaborator

@RenjiSann RenjiSann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution !

A few questions:

  • What locale did you use to perform your tests ?
  • Can you check there is no discrepancy with GNU's foldwith LC_ALL being en_US, en_US.UTF-8, C and C.UTF-8 ?

A few remarks on unwraps, but otherwise it looks good to me !

@RenjiSann
Copy link
Collaborator

As an extra remark, please stash your "clippy fixes" commit in the first one.
It is useful to split implementation and tests, and we shall keep both commits in the main branch, if we ever need to revert/check something, but clippy fixes are not useful to keep track of.

Thanks !

@phinjensen
Copy link
Contributor Author

* What locale did you use to perform your tests ?

Here's the locale I was using for all tests:

% locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
* Can you check there is no discrepancy with GNU's `fold`with `LC_ALL` being `en_US`, `en_US.UTF-8`, `C` and `C.UTF-8` ?

I've tried all three of my new tests with those locales and they all give the exact same output as each other (and as the coreutils fold).

I also consolidated all the code changes into one commit and re-pushed, so the commits history should be simpler now.

@phinjensen phinjensen requested a review from RenjiSann June 23, 2025 03:09
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@RenjiSann RenjiSann merged commit b8228fb into uutils:main Jun 23, 2025
116 of 117 checks passed
@RenjiSann
Copy link
Collaborator

Thank you for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fold: failed to read line: stream did not contain valid UTF-8
2 participants