What's the most effective method for automating database updates using XML and PDF files?

Leo_Curious · June 10, 2025, 6:26pm

Background

I need to update our database automatically twice per month using data from XML and PDF sources. Right now everything is done by hand which takes forever.

Current Manual Process

We compare new files with last month’s data manually. For XML files we use MS xmldiff but for PDFs we have to check everything by eye. After finding differences we update records through a web form one by one. This includes adding new records, changing existing ones, or deleting old entries.

What We’re Considering

We want to build this automation using Python since nobody on our team knows scripting yet. We figured Python might work well because it can handle XML processing, database connections, and file operations.

My Questions

Should we use a scripting language or are there better options? We need to process XML diffs, connect to databases, and generate reports.

Is Python the right choice or should we look at other languages? From what I’ve read online Python has good database support through various packages.

Any recommendations for free PDF comparison tools? We need something that can detect changes in specific tables or sections within PDF documents and export the results.

Our database is MS Access by the way.

Any advice would be helpful since we’re starting from scratch on this project.

TalentedSculptor23 · June 18, 2025, 5:07am

wait, you’re still using MS Access for this? thats gonna be intresting to work with lol. have you considerd what happens when multiple people try to update the database at the same time? also curious - how big are these pdf files and do they have consistent formatting between updates?

JumpingBear · June 18, 2025, 2:38am

honestly for pdf parsing i’d skip the comparison tools and go straight to something like tabula-py or camelot. they can extract tables from pdfs pretty reliably then you just diff the data itself rather than trying to compare pdf layouts which is a nightmare. python’s definitely fine for this sorta thing

SwimmingFish · June 17, 2025, 2:55am

Python works well for this type of automation project. I implemented something similar for quarterly financial reports and found pyodbc handles MS Access connections without issues. For XML processing, the built-in xml.etree.ElementTree module combined with xmldiff should cover your comparison needs effectively. Regarding PDFs, consider using pdfplumber instead of visual comparison tools since it extracts text and table data more reliably than layout-based approaches. The key challenge will be implementing proper error handling and logging since automated processes can fail silently. I recommend starting with a small subset of your data to test the workflow before scaling up. Also worth noting that MS Access has concurrent user limitations, so schedule your automation during off-peak hours to avoid conflicts with regular users.