-
Notifications
You must be signed in to change notification settings - Fork 31
Description
I have been experimenting with Utf8Validator and find that existing handling uses three lookup tables which are indexed by upper and lower nibbles of first byte and upper nibble of second byte in the pair of consecutive bytes to catch various error scenarios.
Effectively, we refer to twelve bits, 8 from first byte and 4 from second bytes for lookups in 16 byte tables. Following PoC implimentation[1] uses two 64 byte lookup tables accessed using 6 bit indices. For first lookup, index is compsed of least signifianct 6 bits of first byte and for second lookup index concatinates upper nibble of second byte with most significant two bits from first byte.
I see around 5-7% performance improvement[2] over three table lookup.
Algorithm can be directly ported to Utf8Validator.
Best Regards,
Jatin
[1] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/ThreeVsTwoTableLookup.java
[2] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/performance_3Tvs2T_lookup.txt