Skip to content

Use of two table lookups instead of existing three table lookups for Utf8Validator. #69

@jatin-bhateja

Description

@jatin-bhateja

I have been experimenting with Utf8Validator and find that existing handling uses three lookup tables which are indexed by upper and lower nibbles of first byte and upper nibble of second byte in the pair of consecutive bytes to catch various error scenarios.

Effectively, we refer to twelve bits, 8 from first byte and 4 from second bytes for lookups in 16 byte tables. Following PoC implimentation[1] uses two 64 byte lookup tables accessed using 6 bit indices. For first lookup, index is compsed of least signifianct 6 bits of first byte and for second lookup index concatinates upper nibble of second byte with most significant two bits from first byte.

I see around 5-7% performance improvement[2] over three table lookup.

Algorithm can be directly ported to Utf8Validator.

Best Regards,
Jatin

[1] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/ThreeVsTwoTableLookup.java
[2] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/performance_3Tvs2T_lookup.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions