Skip to content

Why does s3.read_parquet() returns different data type depending on chunk-size? #3123

@jakov7

Description

@jakov7
df = pd.DataFrame({"cat1": ['a', 'b', 'b']})
df["cat1"] = df["cat1"].astype("category")

wr.s3.to_parquet(
    df=df,
    path='s3://DWH/test',
    dataset=True
)
for chunk_size in range(1,4):
    print(f"chunk_size: {chunk_size}")
    for df in wr.s3.read_parquet("s3://DWH/test", chunked=chunk_size):
        print(df["cat1"].dtypes)

This returns all categories

chunk_size: 1
category
category
category
chunk_size: 2
category
category
chunk_size: 3
category
df = pd.DataFrame({"cat1": ['a', 'b', 'b', 'xxx']})
df["cat1"] = df["cat1"].astype("category")

wr.s3.to_parquet(
    df=df,
    path='s3://DWH/test',
    dataset=True
)
for chunk_size in range(1,8):
    print(f"chunk_size: {chunk_size}")
    for df in wr.s3.read_parquet("s3://DWH/test", chunked=chunk_size):
        print(df["cat1"].dtypes)

This returns mixed data types

chunk_size: 1
category
category
category
category
category
category
category
chunk_size: 2
category
category
category
category
chunk_size: 3
category
object
object
chunk_size: 4
category
category
chunk_size: 5
object
object
chunk_size: 6
object
object
chunk_size: 7
object

Problem occurs when new category is introduced. Is there a way to be sure category is returned. pyarrow_additional_kwargs do not help with this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions