Use of two table lookups instead of existing three table lookups for Utf8Validator.

I have been experimenting with Utf8Validator and find that existing handling uses three lookup tables which are indexed by upper and lower nibbles of first byte and upper nibble of second byte in the pair of consecutive bytes to catch various error scenarios.

Effectively, we refer to twelve bits, 8 from first byte and 4 from second bytes for lookups in 16 byte tables.  Following PoC implimentation[1] uses two 64 byte lookup tables accessed using 6 bit indices. For first lookup, index is compsed of least signifianct 6 bits of first byte and for second lookup index concatinates upper nibble of second byte with most significant two bits from first byte.

I see around 5-7% performance improvement[2] over three table lookup.

Algorithm can be directly ported to Utf8Validator. 

Best Regards,
Jatin

[1] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/ThreeVsTwoTableLookup.java
[2] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/performance_3Tvs2T_lookup.txt



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use of two table lookups instead of existing three table lookups for Utf8Validator. #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use of two table lookups instead of existing three table lookups for Utf8Validator. #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions