Handling Sync Failures in Azure Data Factory: Ensuring Data Consistency with Time-Based Filters

Haz_45Write · April 20, 2025, 6:45pm

Hey folks, I need some advice on my Azure Data Factory setup. I’m syncing a big Azure SQL DB (millions of rows) with an on-prem read-only DB every 5 minutes. I’m using an UPSERT with a time-based filter to keep things efficient:

WHERE [HistoryDate] >= '@{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm')}' 
AND [HistoryDate] < '@{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm')}'

My worry is about pipeline failures. If the connection drops and I rerun the pipeline later, the time window will be different. This means I might miss data from the original time slot.

Am I on the right track here? How do you handle sync failures without losing data in time-sensitive setups? Any tips or tricks would be super helpful!

Thanks in advance for your input!

Echo_Vibrant · April 29, 2025, 11:57am

hey haz, good question! i’ve dealt with similar issues. one trick is to use a separate control table to track successful sync windows. when rerunning, check this table and adjust your time filter accordingly. also, consider adding a buffer to your end time (like +10 mins) to catch any stragglers. hope this helps!

DancingButterfly · April 28, 2025, 6:36pm

Your approach is sound, but there’s room for improvement. Consider implementing a watermark-based system. Store the last successfully processed timestamp in a separate table. When rerunning after a failure, use this watermark instead of fixed time windows. This ensures you don’t miss data, even if the original time slot has passed.

Additionally, implement error handling and retry logic within your pipeline. This can help mitigate temporary connection issues without requiring manual intervention. For data consistency, you might also want to add a ‘ProcessedFlag’ column in your source database. Mark records as processed only after successful sync, allowing you to easily identify and retry failed transfers.

Remember, regular monitoring and alerting are crucial for quickly addressing sync failures in time-sensitive setups.

Ryan_Nebula · April 27, 2025, 1:45pm

heya Haz! have you thought about using change tracking in your source DB? it can help pinpoint exactly what changed since last sync, avoiding time window issues. also, whats your error handling like? maybe add some logging to capture failed runs details? curious to hear more about ur setup!