feat: Support reading CSV files with inconsistent column counts #17553

EeshanBembi · 2025-09-13T17:36:41Z

Summary

Enables DataFusion to read directories containing CSV files with
different numbers of columns by implementing schema union during
inference.

Previously, attempting to read multiple CSV files with different
column counts would fail with:
Arrow error: Csv error: incorrect number of fields for line 1,
expected 17 got 20

This was particularly problematic for evolving datasets where newer
files include additional columns (e.g., railway services data where
newer files added platform information).

Changes

Enhanced CSV schema inference: Modified
infer_schema_from_stream to create union schema from all files
instead of rejecting files with different column counts
Backward compatible: Existing functionality unchanged,
requires explicit opt-in via truncated_rows(true)
Comprehensive testing: Added unit tests for schema building
logic and integration test with real CSV scenarios

Usage

// Read CSV directory with mixed column counts
let df = ctx.read_csv(
    "path/to/csv/directory/",
    CsvReadOptions::new().truncated_rows(true)
).await?;

Test Results

✅ All existing tests pass (368/368 DataFusion lib tests)
✅ All CSV functionality intact (125/125 CSV tests)
✅ New integration test verifies fix with 3-column and 6-column
CSV files
✅ Schema inference creates union schema with proper null handling

Example

Before this fix:

services_2024.csv: 3 columns → ❌ Error when reading together
services_2025.csv: 6 columns → ❌ "incorrect number of fields"

After this fix:

Both files → ✅ Union schema with 6 columns
Missing columns filled with nulls automatically

Closes #17516

Enable DataFusion to read directories containing CSV files with different numbers of columns by implementing schema union during inference. Changes: - Modified CSV schema inference to create union schema from all files - Extended infer_schema_from_stream to handle varying column counts - Added tests for schema building logic and integration scenarios Requires CsvReadOptions::new().truncated_rows(true) to handle files with fewer columns than the inferred schema. Fixes apache#17516

Jefffrey

Looks like some CI checks to address as well; would suggest running the standard cargo clippy, fmt & test commands locally before pushing so we don't need to wait for CI checks on GitHub to catch these

Jefffrey · 2025-09-14T23:07:19Z

datafusion/datasource-csv/src/tests.rs

I feel this file can be included at the bottom of datafusion/datasource-csv/src/file_format.rs instead of being a separate file, given it unit tests the build_schema_helper() function only

Jefffrey · 2025-09-14T23:11:03Z

datafusion/datasource-csv/src/file_format.rs

-    /// number of lines that were read
+    /// number of lines that were read.
+    ///
+    /// This method now supports CSV files with different numbers of columns.


Suggested change

/// This method now supports CSV files with different numbers of columns.

/// This method can handle CSV files with different numbers of columns.

Jefffrey · 2025-09-14T23:28:01Z

datafusion/core/tests/csv_schema_fix_test.rs

+    // Verify that the union schema is being used correctly
+    // We should be able to find records from both files
+    println!("✅ Successfully read {} record batches with {} total rows", results.len(), total_rows);
+
+    // Verify schema has all expected columns
+    for batch in &results {
+        assert_eq!(batch.schema().fields().len(), 6, "Each batch should use the union schema with 6 fields");
+    }
+
+    println!("✅ Successfully verified CSV schema inference fix!");
+    println!("   - Read {} files with different column counts (3 vs 6)", temp_dir.path().read_dir().unwrap().count());
+    println!("   - Inferred schema with {} columns", schema.fields().len());
+    println!("   - Processed {} total rows", total_rows);


Could we remove these prints, and keep the tests to assert statements only? To cut down on verbosity

Jefffrey · 2025-09-14T23:28:33Z

datafusion/core/tests/csv_schema_fix_test.rs

+    // Verify we can actually read the data
+    let results = df.collect().await?;
+
+    // Calculate total rows across all batches
+    let total_rows: usize = results.iter().map(|batch| batch.num_rows()).sum();
+    assert_eq!(total_rows, 6, "Should have 6 total rows across all batches");


I feel we should assert the actual row contents as well

Jefffrey · 2025-09-14T23:34:16Z

datafusion/datasource-csv/src/file_format.rs

+                // Handle files with different numbers of columns by extending the schema
+                if fields.len() > column_type_possibilities.len() {
+                    // New columns found - extend our tracking structures
+                    for field in fields.iter().skip(column_type_possibilities.len()) {
+                        column_names.push(field.name().clone());
+                        let mut possibilities = HashSet::new();
+                        if records_read > 0 {
+                            possibilities.insert(field.data_type().clone());
+                        }
+                        column_type_possibilities.push(possibilities);
+                    }
+                }


What happens if file A has columns t1, t3 but file B has columns t1, t2, t3?

Do we only allow files having subset of columns of other files in the exact correct order?

AKA we don't support union by name?

Jefffrey · 2025-09-14T23:35:55Z

datafusion/datasource-csv/src/file_format.rs

+                // We take the minimum of fields.len() and column_type_possibilities.len() 
+                // to avoid index out of bounds when a file has fewer columns
+                let max_fields_to_process = fields.len().min(column_type_possibilities.len());


Is this minimum strictly necessary? Above we check that if fields > column_type_possibilities then column_type_possibilities is increased to match fields; if fields < column_type_possibilities then we only need to iterate over fields anyway

Jefffrey · 2025-09-14T23:38:23Z

datafusion/datasource-csv/src/tests.rs

+            {
+                let mut set = HashSet::new();
+                set.insert(DataType::Int64);
+                set
+            },


Suggested change

{

let mut set = HashSet::new();

set.insert(DataType::Int64);

set

},

HashSet::from([DataType::Int64])

FYI, applies for the rest too

Jefffrey · 2025-09-14T23:41:17Z

datafusion/core/tests/csv_schema_fix_test.rs

+    let df = ctx
+        .read_csv(
+            temp_path.to_str().unwrap(), 
+            CsvReadOptions::new().truncated_rows(true)


If we do decide to reuse this existing config option then we should update the documentation for it as we're now repurposing it for something different (albeit similar) in purpose

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Sep 13, 2025

Jefffrey reviewed Sep 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support reading CSV files with inconsistent column counts #17553

feat: Support reading CSV files with inconsistent column counts #17553

EeshanBembi commented Sep 13, 2025

Uh oh!

Jefffrey left a comment

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Uh oh!

	/// This method now supports CSV files with different numbers of columns.
	/// This method can handle CSV files with different numbers of columns.

feat: Support reading CSV files with inconsistent column counts #17553

Are you sure you want to change the base?

feat: Support reading CSV files with inconsistent column counts #17553

Conversation

EeshanBembi commented Sep 13, 2025

Summary

Changes

Usage

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!