Skip to content

Commit 8a44dc8

Browse files
author
Andy C
committed
[doc] Add info about Eggex bytes and code points
- doc/ref: Add re-chars section for Eggex - doc/eggex: more details
1 parent ce71351 commit 8a44dc8

File tree

3 files changed

+66
-48
lines changed

3 files changed

+66
-48
lines changed

doc/eggex.md

Lines changed: 23 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,9 @@ We've reserved syntactic space for PCRE and Python variants:
182182
- lazy/non-greedy: `x{L +}`, `x{L 3,4}`
183183
- possessive: `x{P +}`, `x{P 3,4}`
184184

185+
(Oils doesn't have these features, because Oils translates Eggex to POSIX ERE
186+
syntax.)
187+
185188
### Negation Consistently Uses !
186189

187190
You can negate named char classes:
@@ -258,13 +261,12 @@ You can also add type conversion functions:
258261

259262
Example:
260263

261-
[ a-f 'A'-'F' \yFF \u{03bc} \n \\ \' \" \0 ]
262-
263-
Terms:
264-
265-
- Ranges: `a-f` or `'A' - 'F'`
266-
- Literals: `\n`, `\y01`, `\u{3bc}`, etc.
267-
- Sets specified as strings: `'abc'`
264+
[ a b c ] # individual characters / code points
265+
[ '?' '*' '+' ] # quoted characters
266+
[ 'a' 'bc' '?*+' ] # reduce the number of quotes
267+
[ a-f 'A'-'F' x y ] # ranges of code points
268+
[ \n \\ \' \" ] # backslash escapes
269+
[ \yFF \u{03bc} ] # a byte and a code point
268270

269271
Only letters, numbers, and the underscore may be unquoted:
270272

@@ -409,6 +411,17 @@ Yes:
409411

410412
## POSIX ERE Limitations
411413

414+
In theory, Eggex can be translated to many different synatxes, like Perl or
415+
Python.
416+
417+
In practice, Oils translates Eggex to POSIX Extended Regular Expressions (ERE)
418+
syntax. This syntax is understood by `libc`, e.g. GNU libc or musl libc.
419+
420+
But not all of Eggex can be translated to ERE syntax. And the translation is
421+
straightforward and "dumb", rather than smart.
422+
423+
Here are some limitations.
424+
412425
### Repetition of Strings Requires Grouping
413426

414427
Repetitions like `* + ?` apply only to the last character, so literal strings
@@ -425,41 +438,15 @@ Yes:
425438

426439
Also OK:
427440

428-
('foo')+ # this is a CAPTURING group in ERE
441+
('foo')+ # but note this is a CAPTURING group in ERE
429442

430443
This is necessary because ERE doesn't have non-capturing groups like Perl's
431444
`(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
432445
constructs that change the meaning of the pattern.
433446

434-
### Unicode char literals are limited in range
435-
436-
ERE can't represent this set of 1 character reliably:
437-
438-
/ [ \u{0100} ] / # This char is 2 bytes encoded in UTF-8
439-
440-
These sets are accepted:
441-
442-
/ [ \u{1} \u{2} ] / # set of 2 chars
443-
/ [ \y01 \y02 ] ] / # set of 2 bytes
444-
445-
They happen to be identical when translated to ERE, but may not be when
446-
translated to PCRE.
447-
448-
### Don't put non-ASCII bytes in string sets in char classes
449-
450-
<!-- TODO: $'' will be disallowed in YSH -->
451-
452-
This is a sequence of characters:
453-
454-
/ $'\xfe\xff' /
455-
456-
This is a **set** of characters that is illegal:
457-
458-
/ [ $'\xfe\xff' ] / # set or sequence? It's confusing
459-
460-
This is a better way to write it:
447+
### Bytes and Code Points are Limited in Range
461448

462-
/ [ \yfe \yff ] / # set of 2 chars
449+
See [re-chars](ref/chap-expr-lang.html#re-chars) in the Oils Reference.
463450

464451
### Char class literals: `^ - ] \`
465452

doc/ref/chap-expr-lang.md

Lines changed: 41 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -735,36 +735,66 @@ splicing of strings:
735735
var dq = "hi $name"
736736
var eggex = / @dq /
737737

738+
738739
### class-literal
739740

740-
An eggex character class literal specifies a set. It can have individual
741-
characters and ranges:
741+
An eggex *character class literal* specifies a **set** of code points. It's
742+
enclosed in brackets:
743+
744+
var vowels = / [a e i o u] / # A set of 5 vowels
745+
746+
A class literal can have individual code points:
742747

743-
[ 'x' 'y' 'z' a-f A-F 0-9 ] # 3 chars, 3 ranges
748+
[ a e i o u '?' '*' '+' ]
744749

745-
Omit quotes on ASCII characters:
750+
It can also have ranges of code points, denoted with a hyphen:
746751

747-
[ x y z ] # avoid typing 'x' 'y' 'z'
752+
[ a-f A-F 0-9 ]
748753

749-
Sets of characters can be written as strings
754+
To reduce the number of quotes, you can write a set of characters as a string:
750755

751-
[ 'xyz' ] # any of 3 chars, not a sequence of 3 chars
756+
[ 'xyz' ] # any of 3 chars, NOT a sequence of 3 chars
752757

753-
Backslash escapes are respected:
758+
You can also use backslash escapes:
754759

755760
[ \\ \' \" \0 ]
756-
[ \xFF \u{3bc} ]
761+
[ \y7F \u{3bc} ] # a byte and a code point
757762

758-
(Note that we don't use `\yFF`, as in J8 strings.)
763+
[ \y01 - \y7F ] # range of bytes
764+
[ \u{1} - \u{7F} ] # range of code points
759765

760-
Splicing:
766+
The `@` operator lets you refer to string variables:
761767

768+
var str_var = 'xyz'
762769
[ @str_var ]
763770

764771
Negation always uses `!`
765772

766773
![ a-f A-F 'xyz' @str_var ]
767774

775+
### re-chars
776+
777+
Oils usually invokes `libc` in UTF-8 mode. In this mode, the regex engine
778+
can't match bytes like `0xFF`; it can only match code points.
779+
780+
var x = / [ \y7F \u{3bc} ] / # a byte and a code point
781+
782+
Oils translates Eggex to POSIX extended regex (ERE) syntax. Here are some
783+
restrictions when translating bytes and code points to ERE:
784+
785+
- The `NUL` byte `\y00` isn't allowed.
786+
- Its synonym, code point zero `\u{0}`, also isn't allowed.
787+
- Bytes `\y80` to `\yFF` aren't allowed, because they're outside the ASCII
788+
range.
789+
790+
Reminders:
791+
792+
- In the ASCII range, bytes and code points are the same
793+
- That is, `\y01` to `\y7F` are synonyms for `\u{1}` to `\u{7F}`.
794+
- Outside of the ASCII range, they are different, so Eggex disallows them.
795+
- For example, `\u{FF}` is a code point, and `\yFF` is a byte, but they are
796+
not the same.
797+
768798
### named-class
769799

770800
Perl-like shortcuts for sets of characters:

doc/ref/toc-ysh.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,8 @@ X [External Lang] BEGIN END when (awk)
314314
match-ops ~ !~ ~~ !~~
315315
[Eggex] re-literal / d+ ; re-flags ; ERE /
316316
re-primitive %zero 'sq'
317-
class-literal [c a-z 'abc' @str_var \\ \xFF \u{3bc}]
317+
class-literal [c a-z 'abc' @str_var \\ \y01 \u{3bc}]
318+
re-chars \y01 \u{3bc}
318319
named-class dot digit space word d s w
319320
re-repeat d? d* d+ d{3} d{2,4}
320321
re-compound seq1 seq2 alt1|alt2 (expr1 expr2)

0 commit comments

Comments
 (0)