I’m working on a project to move data from SQL Server to Databricks using the medallion architecture. We have about 50 tables to transfer. The process is straightforward up to the bronze layer, but I’m unsure about the best way to handle the silver and gold layers.
For the silver layer, each table needs different transformations. Should I make a separate notebook for every table? Or is there a way to use one notebook for all of them?
The gold layer is even trickier. Each gold table combines 3-4 silver tables. Do I need individual notebooks here too? And how can I ensure that a gold table only updates after all its dependent tables are ready?
I’m trying to find the most efficient way to manage this process. Any advice on structuring the workflows would be really helpful. Thanks!
hey max! for silver layer, u could try using a single notebook with dynamic parameters. set up a config file with table-specific transformations, then use that in ur notebook.
for gold, same idea but more complex. u might need some sorta dependency tracking system. maybe look into airflow or databricks workflows for orchestration?
good luck with ur project!
For the silver layer, I recommend creating a parameterized notebook that can handle various table transformations. Implement a configuration-driven approach where you define transformation logic for each table in a separate config file. This way, you can reuse the same notebook for multiple tables, reducing code duplication and simplifying maintenance.
Regarding the gold layer, consider implementing a modular approach. Create a main orchestrator notebook that calls individual transformation notebooks for each gold table. Use Databricks jobs or Azure Data Factory to manage dependencies and ensure proper execution order. This approach provides flexibility while maintaining clear separation of concerns.
To track dependencies and trigger updates efficiently, implement a metadata-driven system. Store table metadata, including last update timestamps and dependencies, in a control table. Use this information to determine which gold tables need updating based on changes in their source silver tables.
Have u considered using delta live tables for this? They can handle transformations and dependencies pretty smoothly. might simplify ur silver and gold layer setup.
What kinda performance requirements do u have? That could impact how u structure things. curious to hear more about ur specific use case!