Skip to content

Conversation

Earlopain
Copy link
Contributor

Closes #104, I think I found the fix myself 💪

Otherwise one character gets split up into its individual bytes.

# (This currently includes \^, \-, \&, \:, although these could potentially
# be meta chars when not escaped, depending on their position in the set.)
any > (escaped_set_alpha, 1) {
any > (escaped_set_alpha, 1) | utf8_multibyte > (escaped_alpha, 1) {
Copy link
Contributor Author

@Earlopain Earlopain Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also works but no tests fail. I mimicked it over to utf8_multibyte for consistency but I'm not really sure what it is supposed to do:

Suggested change
any > (escaped_set_alpha, 1) | utf8_multibyte > (escaped_alpha, 1) {
any | utf8_multibyte {

In fact, ragel doesn't seem to produce different output with or without it

Otherwise one character gets split up into its individual bytes
@Earlopain Earlopain force-pushed the fix-utf8-escapes-in-sets branch from bc5f0e4 to 2efa904 Compare August 27, 2025 14:24
@jaynetics jaynetics merged commit a611c88 into ammar:master Sep 15, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF8-escaped characters mess up locations when they are part of a character class
2 participants