-
Notifications
You must be signed in to change notification settings - Fork 717
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
df = pd.DataFrame({"cat1": ['a', 'b', 'b']})
df["cat1"] = df["cat1"].astype("category")
wr.s3.to_parquet(
df=df,
path='s3://DWH/test',
dataset=True
)
for chunk_size in range(1,4):
print(f"chunk_size: {chunk_size}")
for df in wr.s3.read_parquet("s3://DWH/test", chunked=chunk_size):
print(df["cat1"].dtypes)
This returns all categories
chunk_size: 1
category
category
category
chunk_size: 2
category
category
chunk_size: 3
category
df = pd.DataFrame({"cat1": ['a', 'b', 'b', 'xxx']})
df["cat1"] = df["cat1"].astype("category")
wr.s3.to_parquet(
df=df,
path='s3://DWH/test',
dataset=True
)
for chunk_size in range(1,8):
print(f"chunk_size: {chunk_size}")
for df in wr.s3.read_parquet("s3://DWH/test", chunked=chunk_size):
print(df["cat1"].dtypes)
This returns mixed data types
chunk_size: 1
category
category
category
category
category
category
category
chunk_size: 2
category
category
category
category
chunk_size: 3
category
object
object
chunk_size: 4
category
category
chunk_size: 5
object
object
chunk_size: 6
object
object
chunk_size: 7
object
Problem occurs when new category is introduced. Is there a way to be sure category is returned. pyarrow_additional_kwargs
do not help with this.
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested