You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rollup merge of rust-lang#68232 - Mark-Simulacrum:unicode-tables, r=joshtriplett
Optimize size/speed of Unicode datasets
The overall implementation has the same general idea as the prior approach,
which was based on a compressed trie structure, but modified to use less space
(and, coincidentally, be an overall performance improvement).
Sizes | Old | New | New/current
-- | -- | -- | --
Alphabetic | 4616 | 2982 | 64.60%
Case_Ignorable | 3144 | 2112 | 67.18%
Cased | 2376 | 934 | 39.31%
Cc | 19 | 43 | 226.32%
Grapheme_Extend | 3072 | 1734 | 56.45%
Lowercase | 2328 | 985 | 42.31%
N | 2648 | 1239 | 46.79%
Uppercase | 1978 | 934 | 47.22%
White_Space | 241 | 140 | 58.09%
| | |
Total | 20422 | 11103 | 54.37%
This table shows the size of the old and new tables in bytes. The most important
of these tables is "Grapheme_Extend", as it is present in essentially all Rust
programs due to being called from `str`'s Debug impl (`char::escape_debug`). In
a representative case given by this [blog post] for the embedded world, the
shrinking in this PR shrinks the final binary by 1,604 bytes, from 14,440 to
12,836.
The performance of these new tables, based on the (rough) benchmark of linearly
scanning the entire valid set of chars, querying for each `is_*`, is roughly
~50% better, though in some cases is either on par or slightly (3-5%) worse. In
practice, I believe the size benefits of this PR are the main concern. The new
implementation has been tested to be equivalent to the current nightly in terms
of returned values on the set of valid chars.
A (relatively) high-level explanation of the specific compression scheme used
can be found [in the generator].
This is split into three commits -- the first adds the generator which produces
the Rust code for the tables, the second adds support code for the lookup, and
the third actually swaps the current implementation out for the new one.
[blog post]: https://jamesmunns.com/blog/fmt-unreasonably-expensive/
[in the generator]: https://github.com/Mark-Simulacrum/rust/blob/unicode-tables/src/tools/unicode-table-generator/src/raw_emitter.rs
0 commit comments