Skip to content

validate_schema=True breaks with partition columns and dataset=True #1426

@robert-schmidtke

Description

@robert-schmidtke

Describe the bug

Sorry, it's me again, and again about the error on missing columns: #1370

When using dataset=True and non-empty partition_colums during write, validate_schema=True and non-empty columns fails during read because partition columns are removed from the dataframe before writing.

Is this now intended behavior? As someone using partitioning and datasets a lot, I find this a non-intuitive and breaking change in behavior. Especially as partition columns are transparently removed during write and added back during read.

How to Reproduce

>>> import awswrangler as wr
>>> import pandas as pd

>>> df = pd.DataFrame({"column":[1], "partition": [2]})

>>> wr.catalog.create_database("test_partition_columns")

# write simple partitioned dataset
>>> wr.s3.to_parquet(
...     df,
...     dataset=True,
...     database="test_partition_columns",
...     table="test_table",
...     path="s3://<redacted>/test_partition_columns/test_table/",
...     partition_cols=["partition"],
... )
{
    'paths': [
        's3://<redacted>/test_partition_columns/test_table/partition=2/06cdfcd5ebd14e92a88b1ee18fd5061d.snappy.parquet'
    ],
    'partitions_values': {
        's3://<redacted>/test_partition_columns/test_table/partition=2/': ['2']
    }
}

# reading all columns works just fine
>>> wr.s3.read_parquet(
...     "s3://<redacted>/test_partition_columns/test_table/",
...     dataset=True,
...     validate_schema=True,
... )
   column partition
0       1         2

# reading a column that is there ('partition', see above) explicitly fails
>>> wr.s3.read_parquet(
...     "s3://<redacted>/test_partition_columns/test_table/",
...     dataset=True,
...     validate_schema=True,
...     columns=["column", "partition"],
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rschmidtke/miniconda3/envs/<redacted>/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py", line 777, in read_parquet
    return _read_parquet(
  File "/home/rschmidtke/miniconda3/envs/<redacted>/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py", line 540, in _read_parquet
    table=_read_parquet_file(
  File "/home/rschmidtke/miniconda3/envs/<redacted>/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py", line 489, in _read_parquet_file
    raise exceptions.InvalidArgument(f"column: {column} does not exist")
awswrangler.exceptions.InvalidArgument: column: partition does not exist

# not reading the 'partition' column still returns it though, but this may already be existing behavior
>>> wr.s3.read_parquet(
...     "s3://<redacted>/test_partition_columns/test_table/",
...     dataset=True,
...     validate_schema=True,
...    columns=["column"],
... )
   column partition
0       1         2

Expected behavior

Not to fail. But not sure how to implement this. When dataset=True and columns is non-empty, extract the PartitionKeys from the table input and remove them from columns if present?

Your project

No response

Screenshots

No response

OS

Ubuntu 20.04.4 LTS

Python version

3.9.13

AWS DataWrangler version

2.16.1

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingminor releaseWill be addressed in the next minor releaseready to release

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions