-
Notifications
You must be signed in to change notification settings - Fork 716
Description
Describe the bug
Sorry, it's me again, and again about the error on missing columns: #1370
When using dataset=True
and non-empty partition_colums
during write, validate_schema=True
and non-empty columns
fails during read because partition columns are removed from the dataframe before writing.
Is this now intended behavior? As someone using partitioning and datasets a lot, I find this a non-intuitive and breaking change in behavior. Especially as partition columns are transparently removed during write and added back during read.
How to Reproduce
>>> import awswrangler as wr
>>> import pandas as pd
>>> df = pd.DataFrame({"column":[1], "partition": [2]})
>>> wr.catalog.create_database("test_partition_columns")
# write simple partitioned dataset
>>> wr.s3.to_parquet(
... df,
... dataset=True,
... database="test_partition_columns",
... table="test_table",
... path="s3://<redacted>/test_partition_columns/test_table/",
... partition_cols=["partition"],
... )
{
'paths': [
's3://<redacted>/test_partition_columns/test_table/partition=2/06cdfcd5ebd14e92a88b1ee18fd5061d.snappy.parquet'
],
'partitions_values': {
's3://<redacted>/test_partition_columns/test_table/partition=2/': ['2']
}
}
# reading all columns works just fine
>>> wr.s3.read_parquet(
... "s3://<redacted>/test_partition_columns/test_table/",
... dataset=True,
... validate_schema=True,
... )
column partition
0 1 2
# reading a column that is there ('partition', see above) explicitly fails
>>> wr.s3.read_parquet(
... "s3://<redacted>/test_partition_columns/test_table/",
... dataset=True,
... validate_schema=True,
... columns=["column", "partition"],
... )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/rschmidtke/miniconda3/envs/<redacted>/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py", line 777, in read_parquet
return _read_parquet(
File "/home/rschmidtke/miniconda3/envs/<redacted>/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py", line 540, in _read_parquet
table=_read_parquet_file(
File "/home/rschmidtke/miniconda3/envs/<redacted>/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py", line 489, in _read_parquet_file
raise exceptions.InvalidArgument(f"column: {column} does not exist")
awswrangler.exceptions.InvalidArgument: column: partition does not exist
# not reading the 'partition' column still returns it though, but this may already be existing behavior
>>> wr.s3.read_parquet(
... "s3://<redacted>/test_partition_columns/test_table/",
... dataset=True,
... validate_schema=True,
... columns=["column"],
... )
column partition
0 1 2
Expected behavior
Not to fail. But not sure how to implement this. When dataset=True
and columns
is non-empty, extract the PartitionKeys
from the table input and remove them from columns
if present?
Your project
No response
Screenshots
No response
OS
Ubuntu 20.04.4 LTS
Python version
3.9.13
AWS DataWrangler version
2.16.1
Additional context
No response