I’m trying to upload a big pandas dataframe to SQL Server using pyodbc. My code splits the data into batches because there’s so much of it. The weird thing is that when I set batch_size to 1, everything works fine. But if I use batch_size of 2 or higher, some batches fail to insert.
I think the problem is with null values in my float and integer columns. I’m not sure how to handle np.nan values properly when doing batch inserts.
Here’s a simple example that shows the issue:
import pandas as pd
import numpy as np
sample_data = {
"record_id": [
"ID_A", "ID_B", "ID_C", "ID_D", "ID_E",
"ID_F", "ID_G", "ID_H", "ID_I", "ID_J"
],
"person_name": [
"John", None, "Mary", "John", "Steve",
"John", "John", None, "David", "Mary"
],
"person_location": [
"New York", "Boston", "Chicago", "New York", None,
"New York", "Chicago", None, "Boston", "Boston"
],
"person_score": [
425.0, 178.0, np.nan, np.nan, 642.0,
777.0, 533.0, 289.0, 811.0, 156.0
],
"person_rating": [
85.2, 42.7, 67.8, 91.3, 15.4,
np.nan, 28.9, np.nan, 73.1, np.nan
],
"person_value": [
12.5, np.nan, 88.7, 45.2, 76.3,
69.8, np.nan, 31.6, 18.9, 52.4
]
}
test_df = pd.DataFrame(sample_data)
test_df['person_score'] = modify_column_type(test_df['person_score'], 'Int64')
My insert code looks like this:
primary_keys = ['record_id']
table_name = 'PersonDataTable'
batch_count = 3
connection_string = "xxxxxxx"
db_conn = pyodbc.connect(connection_string)
db_cursor = db_conn.cursor()
# Process batches
for i, start_row in enumerate(range(0, len(test_df), batch_count)):
current_batch = test_df.iloc[start_row:start_row + batch_count]
db_cursor.fast_executemany = True
batch_values = current_batch.map(lambda val: None if (pd.isna(val)) else val).values.tolist()
try:
db_cursor.executemany(insert_statement, batch_values)
db_conn.commit()
print(f"Batch {i} inserted successfully")
except Exception as error:
print(f"Batch {i} failed: {error}")
continue
db_cursor.close()
db_conn.close()
The error I get shows that np.nan values aren’t being converted to None properly when there are multiple records in a batch. With batch_size=1, the same data inserts without any problems.
Why does the batch size matter for null value handling? Is there a better way to convert np.nan to None for SQL Server inserts when using executemany?