Skip to content

Commit dae938f

Browse files
author
Andy C
committed
[libc unicode] Print warnings if the locale is not UTF-8
spec/sh-usage: Test that env vars like LC_ALL LANG are respected Update the docs.
1 parent 919f255 commit dae938f

File tree

5 files changed

+98
-24
lines changed

5 files changed

+98
-24
lines changed

bin/oils_for_unix.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -95,19 +95,22 @@ def InitLocale(environ):
9595
# passing None queries it
9696
#lo = locale.setlocale(locale.LC_CTYPE, None)
9797
except pylocale.Error:
98-
#print('INVALID')
98+
# we could make this nicer
99+
print_stderr('oils: setlocale() failed')
99100
locale_name = '' # unknown value
100101
#log('LOC %s', locale_name)
101102

102103
if locale_name not in ('', 'C'):
103-
# Check that it's utf-8
104+
# Check if the codeset is UTF-8, unless OILS_LOCALE_OK=1 is set
105+
if environ.get('OILS_LOCALE_OK') == '1':
106+
return
107+
104108
codeset = pylocale.nl_langinfo(pylocale.CODESET)
105109
#log('codeset %s', codeset)
106110

107111
if not match.IsUtf8Codeset(codeset):
108-
# TODO: enable this if not OILS_LOCALE_OK=1
109-
#print_stderr('Warning: not UTF-8')
110-
pass
112+
print_stderr("oils warning: codeset %r doesn't look like UTF-8" % codeset)
113+
print_stderr(' Set OILS_LOCALE_OK=1 to remove this message')
111114

112115

113116
# TODO: Hook up valid applets (including these) to completion

doc/ref/chap-special-var.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,10 @@ When the shell process exists, print GC stats to stderr.
239239

240240
When the shell process exists, print GC stats to this file descriptor.
241241

242+
### `OILS_LOCALE_OK`
243+
244+
Suppress the warning about `libc` locales that are not UTF-8.
245+
242246
## Float
243247

244248
### NAN

doc/ref/toc-ysh.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -380,10 +380,10 @@ X [External Lang] BEGIN END when (awk)
380380
[YSH read] _reply
381381
[History] YSH_HISTFILE
382382
[Interactive] OILS_COMP_UI
383-
[Oils VM] OILS_VERSION
383+
[Oils VM] OILS_VERSION LIB_YSH
384384
OILS_GC_THRESHOLD OILS_GC_ON_EXIT
385385
OILS_GC_STATS OILS_GC_STATS_FD
386-
LIB_YSH
386+
OILS_LOCALE_OK
387387
[Float] NAN INFINITY
388388
[Module] __provide__
389389
[Other Env] HOME PATH

doc/unicode.md

Lines changed: 11 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -133,37 +133,31 @@ Not unicode aware:
133133

134134
- `strcmp()` does byte-wise and UTF-8 wise comparisons?
135135

136-
### libc functions used
137-
138-
TODO
139-
140-
- Note: GNU readline calls `setlocale()`, which means that the `oils-for-unix`
141-
process is affected by the environment
142-
143136
### Data Languages
144137

145138
- Decoding JSON/J8 validates UTF-8
146139
- Encoding JSON/J8 decodes and validates UTF-8
147140
- So we can distinguish valid UTF-8 and invalid bytes like `\yff`
148141

149-
## Implementation Notes
142+
## libc locale
150143

151-
Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
152-
readline may call it.)
144+
At startup, Oils calls `setlocale()`, which initializes the global libc locale
145+
from the environment. (GNU readline also calls `setlocale()`, but Oils may or
146+
may not link against GNU readline.)
153147

154-
It's expected that your locale will respect UTF-8. This is true on most
155-
distros. If not, then some string operations will support UTF-8 and some
156-
won't.
148+
The locale affects the behavior of say `?` in globs, and `.` in libc regexes.
149+
150+
Oils only supports UTF-8. If the locale is not UTF-8, Oils prints a warning to
151+
stderr. You can silence it with `OILS_LOCALE_OK=1`.
152+
153+
### Some string operations use libc, and some don'
157154

158155
For example:
159156

160157
- String length like `${#s}` is implemented in Oils code, not libc, so it will
161158
always respect UTF-8.
162159
- `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
163-
settings. Same with Oils `(x ~ pat)`.
164-
165-
TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
166-
other `X`.
160+
settings. This is also true of YSH `(x ~ pat)`.
167161

168162
### List of Low-Level UTF-8 Operations
169163

spec/sh-usage.test.sh

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,76 @@ status=0
122122

123123
## N-I bash/dash/mksh/zsh STDOUT:
124124
## END
125+
126+
#### Set LC_ALL LC_CTYPE LC_COLLATE LANG - affects glob ?
127+
128+
# note: test/spec-common.sh sets LC_ALL
129+
unset LC_ALL
130+
131+
touch _x_ _μ_
132+
133+
LC_ALL=C $SH -c 'echo LC_ALL _?_'
134+
LC_ALL=C.UTF-8 $SH -c 'echo LC_ALL _?_'
135+
echo
136+
137+
LC_CTYPE=C $SH -c 'echo LC_CTYPE _?_'
138+
LC_CTYPE=C.UTF-8 $SH -c 'echo LC_CTYPE _?_'
139+
echo
140+
141+
LC_COLLATE=C $SH -c 'echo LC_COLLATE _?_'
142+
LC_COLLATE=C.UTF-8 $SH -c 'echo LC_COLLATE _?_'
143+
echo
144+
145+
LANG=C $SH -c 'echo LANG _?_'
146+
LANG=C.UTF-8 $SH -c 'echo LANG _?_'
147+
148+
## STDOUT:
149+
LC_ALL _x_
150+
LC_ALL _x_ _μ_
151+
152+
LC_CTYPE _x_
153+
LC_CTYPE _x_ _μ_
154+
155+
LC_COLLATE _x_
156+
LC_COLLATE _x_
157+
158+
LANG _x_
159+
LANG _x_ _μ_
160+
## END
161+
162+
## N-I dash/mksh STDOUT:
163+
LC_ALL _x_
164+
LC_ALL _x_
165+
166+
LC_CTYPE _x_
167+
LC_CTYPE _x_
168+
169+
LC_COLLATE _x_
170+
LC_COLLATE _x_
171+
172+
LANG _x_
173+
LANG _x_
174+
## END
175+
176+
177+
#### LC_ALL=invalid
178+
179+
# note: test/spec-common.sh sets LC_ALL
180+
unset LC_ALL
181+
182+
touch _x_ _μ_
183+
184+
LC_ALL=invalid $SH -c 'echo LC_ALL _?_' 2> err.txt
185+
186+
#cat err.txt
187+
wc -l err.txt
188+
189+
## STDOUT:
190+
LC_ALL _x_
191+
1 err.txt
192+
## END
193+
194+
## N-I dash/mksh/zsh STDOUT:
195+
LC_ALL _x_
196+
0 err.txt
197+
## END

0 commit comments

Comments
 (0)