-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: Support reading CSV files with inconsistent column counts #17553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Support reading CSV files with inconsistent column counts #17553
Conversation
Enable DataFusion to read directories containing CSV files with different numbers of columns by implementing schema union during inference. Changes: - Modified CSV schema inference to create union schema from all files - Extended infer_schema_from_stream to handle varying column counts - Added tests for schema building logic and integration scenarios Requires CsvReadOptions::new().truncated_rows(true) to handle files with fewer columns than the inferred schema. Fixes apache#17516
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some CI checks to address as well; would suggest running the standard cargo clippy, fmt & test commands locally before pushing so we don't need to wait for CI checks on GitHub to catch these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this file can be included at the bottom of datafusion/datasource-csv/src/file_format.rs
instead of being a separate file, given it unit tests the build_schema_helper()
function only
/// number of lines that were read | ||
/// number of lines that were read. | ||
/// | ||
/// This method now supports CSV files with different numbers of columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This method now supports CSV files with different numbers of columns. | |
/// This method can handle CSV files with different numbers of columns. |
// Verify that the union schema is being used correctly | ||
// We should be able to find records from both files | ||
println!("✅ Successfully read {} record batches with {} total rows", results.len(), total_rows); | ||
|
||
// Verify schema has all expected columns | ||
for batch in &results { | ||
assert_eq!(batch.schema().fields().len(), 6, "Each batch should use the union schema with 6 fields"); | ||
} | ||
|
||
println!("✅ Successfully verified CSV schema inference fix!"); | ||
println!(" - Read {} files with different column counts (3 vs 6)", temp_dir.path().read_dir().unwrap().count()); | ||
println!(" - Inferred schema with {} columns", schema.fields().len()); | ||
println!(" - Processed {} total rows", total_rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove these prints, and keep the tests to assert statements only? To cut down on verbosity
// Verify we can actually read the data | ||
let results = df.collect().await?; | ||
|
||
// Calculate total rows across all batches | ||
let total_rows: usize = results.iter().map(|batch| batch.num_rows()).sum(); | ||
assert_eq!(total_rows, 6, "Should have 6 total rows across all batches"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should assert the actual row contents as well
// Handle files with different numbers of columns by extending the schema | ||
if fields.len() > column_type_possibilities.len() { | ||
// New columns found - extend our tracking structures | ||
for field in fields.iter().skip(column_type_possibilities.len()) { | ||
column_names.push(field.name().clone()); | ||
let mut possibilities = HashSet::new(); | ||
if records_read > 0 { | ||
possibilities.insert(field.data_type().clone()); | ||
} | ||
column_type_possibilities.push(possibilities); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if file A has columns t1, t3 but file B has columns t1, t2, t3?
Do we only allow files having subset of columns of other files in the exact correct order?
AKA we don't support union by name?
// We take the minimum of fields.len() and column_type_possibilities.len() | ||
// to avoid index out of bounds when a file has fewer columns | ||
let max_fields_to_process = fields.len().min(column_type_possibilities.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this minimum strictly necessary? Above we check that if fields > column_type_possibilities
then column_type_possibilities
is increased to match fields
; if fields < column_type_possibilities
then we only need to iterate over fields anyway
{ | ||
let mut set = HashSet::new(); | ||
set.insert(DataType::Int64); | ||
set | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | |
let mut set = HashSet::new(); | |
set.insert(DataType::Int64); | |
set | |
}, | |
HashSet::from([DataType::Int64]) |
FYI, applies for the rest too
let df = ctx | ||
.read_csv( | ||
temp_path.to_str().unwrap(), | ||
CsvReadOptions::new().truncated_rows(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do decide to reuse this existing config option then we should update the documentation for it as we're now repurposing it for something different (albeit similar) in purpose
Summary
Enables DataFusion to read directories containing CSV files with
different numbers of columns by implementing schema union during
inference.
Previously, attempting to read multiple CSV files with different
column counts would fail with:
Arrow error: Csv error: incorrect number of fields for line 1,
expected 17 got 20
This was particularly problematic for evolving datasets where newer
files include additional columns (e.g., railway services data where
newer files added platform information).
Changes
infer_schema_from_stream
to create union schema from all filesinstead of rejecting files with different column counts
requires explicit opt-in via
truncated_rows(true)
logic and integration test with real CSV scenarios
Usage
Test Results
CSV files
Example
Before this fix:
After this fix:
Closes #17516