Hey folks, I need some advice on my Azure Data Factory setup. I’m syncing a big Azure SQL DB (millions of rows) with an on-prem read-only DB every 5 minutes. I’m using an UPSERT with a time-based filter to keep things efficient:
WHERE [HistoryDate] >= '@{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm')}'
AND [HistoryDate] < '@{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm')}'
My worry is about pipeline failures. If the connection drops and I rerun the pipeline later, the time window will be different. This means I might miss data from the original time slot.
Am I on the right track here? How do you handle sync failures without losing data in time-sensitive setups? Any tips or tricks would be super helpful!
Thanks in advance for your input!
hey haz, good question! i’ve dealt with similar issues. one trick is to use a separate control table to track successful sync windows. when rerunning, check this table and adjust your time filter accordingly. also, consider adding a buffer to your end time (like +10 mins) to catch any stragglers. hope this helps!
Your approach is sound, but there’s room for improvement. Consider implementing a watermark-based system. Store the last successfully processed timestamp in a separate table. When rerunning after a failure, use this watermark instead of fixed time windows. This ensures you don’t miss data, even if the original time slot has passed.
Additionally, implement error handling and retry logic within your pipeline. This can help mitigate temporary connection issues without requiring manual intervention. For data consistency, you might also want to add a ‘ProcessedFlag’ column in your source database. Mark records as processed only after successful sync, allowing you to easily identify and retry failed transfers.
Remember, regular monitoring and alerting are crucial for quickly addressing sync failures in time-sensitive setups.
heya Haz! have you thought about using change tracking in your source DB? it can help pinpoint exactly what changed since last sync, avoiding time window issues. also, whats your error handling like? maybe add some logging to capture failed runs details? curious to hear more about ur setup!