Automating database updates from XML and PDF files: What's the optimal solution?

Our current situation:

We need to update our database twice monthly using data from XML and PDF files. Right now everything is done manually which takes forever.

Current workflow:

  • Compare new XML files with previous versions using MS xmldiff
  • For PDF files we have to check everything by hand (really painful)
  • Make database changes through a web form one by one
  • Handle new records, updates, and deletions

This whole process is boring and eats up so much time. We want to automate as much as possible.

What we’re considering:

We’re thinking about using Python since nobody on our team knows scripting yet. We’d have to learn from scratch.

Main questions:

  1. Should we go with a scripting approach or are there better tools out there for handling XML comparison and database updates?

  2. Is Python a solid choice for this type of work? We need something that can handle file processing and connect to databases easily.

  3. Does anyone know good free tools for comparing PDF files? We need to detect changes in specific tables and sections.

Our database is MS Access by the way. Any advice would be great!

what’s your data volume like? have you tried access’s import wizards? they sometimes handle basic xml without any coding. but for pdf comparison, you’ll probably need something more powerful than what access can do out of the box.

python’s def the way to go! I’ve done similar stuff, and it really shines with xml parsing using lxml. for pdfs, try tabula-py or pdfplumber to get that table data nice. and connecting to access? super easy with pyodbc!

Go with PowerShell since you’re already using Access and Microsoft tools. I migrated something similar last year - PowerShell’s great because it handles XML natively and connects to Access databases without any hassle. For PDFs, you can hook into Adobe Acrobat’s COM interface (if you’ve got the full version) or use external tools like PDFtk. It’s easier to learn than Python if your team’s Windows-focused, and you don’t have to rebuild everything from scratch. Just automate piece by piece. I’d start with the XML comparison and database updates, then move to PDF processing once you’ve got the hang of it.

skip python if your team’s never scripted before. check out pentaho data integration (it’s free) or talend open studio instead. both are drag-and-drop etl tools that handle xml and database connections without any coding. you’ll still need to learn them, but it’s way easier than starting from zero with python.