read_csv: read file as binary when encoding_errors is set to ignore #1723

cnfait · 2022-10-27T10:10:24Z

Feature or Bugfix

Bugfix

Detail

read_csv chokes on encoding errors even when passing encoding_errors='ignore'. This happens due to ours casting the S3 object to TextIOWrapper after retrieving it and passing that to pd.read_csv.
When specifying encoding_errors='ignore' we now keep the object as a set of bytes (mode=rb). In this case pandas is now responsible for wrapping this in a TextIOWrapper and deals with encoding and encoding errors.

I'm actually thinking we should never wrap the S3 object into a TextIOWrapper ourselves - as far as I can tell there is no advantage doing that and pandas will take care of it anyway. mode should always be set to rb in our code... but I'm curious about others' opinion!

Relates

Wrangler S3 read_csv API is not honouring "encoding_errors=ignore" option #1668

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant · 2022-10-27T10:25:42Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 16410de
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

kukushking · 2022-10-27T10:41:38Z

Would you add a test case covering this, please?

malachi-constant · 2022-10-27T16:13:59Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 39bcd61
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

cnfait · 2022-10-27T16:20:06Z

Would you add a test case covering this, please?

added a test case where a UnicodeDecodeError is raised by default, and not raised when adding encoding_errors=ignore.

malachi-constant

Nice! Test worked for me.

kukushking

Awesome, thanks!

cnfait added the bug Something isn't working label Oct 27, 2022

cnfait self-assigned this Oct 27, 2022

read_csv: read file as binary when encoding_errors is set to ignore

0c3de78

cnfait force-pushed the read-csv-encoding-errors-param branch from 16410de to 0c3de78 Compare October 27, 2022 10:15

cnfait marked this pull request as ready for review October 27, 2022 10:27

cnfait requested review from kukushking, jaidisido, LeonLuttenberger and malachi-constant October 27, 2022 10:28

read_csv: add test case for encoding_errors pandas argument

f9cb833

cnfait force-pushed the read-csv-encoding-errors-param branch from 39bcd61 to f9cb833 Compare October 27, 2022 16:10

malachi-constant approved these changes Oct 27, 2022

View reviewed changes

cnfait mentioned this pull request Oct 27, 2022

Wrangler S3 read_csv API is not honouring "encoding_errors=ignore" option #1668

Closed

jaidisido approved these changes Oct 28, 2022

View reviewed changes

kukushking approved these changes Oct 28, 2022

View reviewed changes

cnfait merged commit 7b1d252 into main Oct 28, 2022

cnfait deleted the read-csv-encoding-errors-param branch October 28, 2022 09:49

kukushking added this to the 2.18.0 milestone Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

read_csv: read file as binary when encoding_errors is set to ignore #1723

read_csv: read file as binary when encoding_errors is set to ignore #1723

Uh oh!

cnfait commented Oct 27, 2022

Uh oh!

malachi-constant commented Oct 27, 2022

Uh oh!

kukushking commented Oct 27, 2022

Uh oh!

malachi-constant commented Oct 27, 2022

Uh oh!

cnfait commented Oct 27, 2022

Uh oh!

malachi-constant left a comment

Uh oh!

kukushking left a comment

Uh oh!

Uh oh!

read_csv: read file as binary when encoding_errors is set to ignore #1723

read_csv: read file as binary when encoding_errors is set to ignore #1723

Uh oh!

Conversation

cnfait commented Oct 27, 2022

Feature or Bugfix

Detail

Relates

Uh oh!

malachi-constant commented Oct 27, 2022

AWS CodeBuild CI Report

Uh oh!

kukushking commented Oct 27, 2022

Uh oh!

malachi-constant commented Oct 27, 2022

AWS CodeBuild CI Report

Uh oh!

cnfait commented Oct 27, 2022

Uh oh!

malachi-constant left a comment

Choose a reason for hiding this comment

Uh oh!

kukushking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!