pyodbc batch insert to SQL Server works with single records but fails with multiple records in pandas dataframe

FlyingEagle · July 23, 2025, 2:23pm

I’m trying to upload a big pandas dataframe to SQL Server using pyodbc. My code splits the data into batches because there’s so much of it. The weird thing is that when I set batch_size to 1, everything works fine. But if I use batch_size of 2 or higher, some batches fail to insert.

I think the problem is with null values in my float and integer columns. I’m not sure how to handle np.nan values properly when doing batch inserts.

Here’s a simple example that shows the issue:

import pandas as pd
import numpy as np

sample_data = {
    "record_id": [
        "ID_A", "ID_B", "ID_C", "ID_D", "ID_E",
        "ID_F", "ID_G", "ID_H", "ID_I", "ID_J"
    ],
    "person_name": [
        "John", None, "Mary", "John", "Steve",
        "John", "John", None, "David", "Mary"
    ],
    "person_location": [
        "New York", "Boston", "Chicago", "New York", None,
        "New York", "Chicago", None, "Boston", "Boston"
    ],
    "person_score": [
        425.0, 178.0, np.nan, np.nan, 642.0,
        777.0, 533.0, 289.0, 811.0, 156.0
    ],
    "person_rating": [
        85.2, 42.7, 67.8, 91.3, 15.4,
        np.nan, 28.9, np.nan, 73.1, np.nan
    ],
    "person_value": [
        12.5, np.nan, 88.7, 45.2, 76.3,
        69.8, np.nan, 31.6, 18.9, 52.4
    ]
}

test_df = pd.DataFrame(sample_data)
test_df['person_score'] = modify_column_type(test_df['person_score'], 'Int64')

My insert code looks like this:

primary_keys = ['record_id']
table_name = 'PersonDataTable'
batch_count = 3
connection_string = "xxxxxxx"
db_conn = pyodbc.connect(connection_string)
db_cursor = db_conn.cursor()

# Process batches
for i, start_row in enumerate(range(0, len(test_df), batch_count)):
    current_batch = test_df.iloc[start_row:start_row + batch_count]
    db_cursor.fast_executemany = True
    
    batch_values = current_batch.map(lambda val: None if (pd.isna(val)) else val).values.tolist()
    
    try:
        db_cursor.executemany(insert_statement, batch_values)
        db_conn.commit()
        print(f"Batch {i} inserted successfully")
    except Exception as error:
        print(f"Batch {i} failed: {error}")
        continue
        
db_cursor.close()
db_conn.close()

The error I get shows that np.nan values aren’t being converted to None properly when there are multiple records in a batch. With batch_size=1, the same data inserts without any problems.

Why does the batch size matter for null value handling? Is there a better way to convert np.nan to None for SQL Server inserts when using executemany?

GrowingTree · August 3, 2025, 10:01am

That’s weird! Try current_batch.where(pd.notnull(current_batch), None) instead of map. executemany can be picky about data types with multiple rows. What pyodbc version are you using?

Silvia85 · August 1, 2025, 4:50am

Check your modify_column_type function - it’s probly causing type conflicts. Try disabling fast_executemany for debugging: db_cursor.fast_executemany = False. When it’s on, pyodbc does weird things with batches that mix nulls and values, but single rows work fine.

Lu_57Read · July 30, 2025, 4:52am

Had this exact same problem with large datasets. fast_executemany treats type conversion differently when you’re doing single records vs batches. It gets way pickier about data types being consistent across all rows when batching.

Here’s what fixed it for me: right before you extract values, add current_batch = current_batch.astype(object).where(current_batch.notnull(), None). This converts all the numpy types to Python native types that SQL Server actually wants.

Or just ditch raw pyodbc entirely - use to_sql() with SQLAlchemy instead. Way better null handling for big datasets.