I’m building an app that needs to store lots of structured documents. I have about 1000 categories organized in a tree structure. Each category has thousands of documents (maybe up to 10000 each). The documents are just a few KB each and I want to store them as YAML files.
My users need to do these things:
- Get documents by their ID
- Search documents by certain fields inside them
- Edit documents and track who made changes with comments
- See the full history of changes for any document
I know I could use MongoDB or CouchDB for this, but I had a crazy idea. What if I just use git as my database? Here’s how I think it would work:
- Each category becomes a folder
- Each document becomes a file
- Reading documents means just reading files from disk
- Editing documents means making git commits with the user and comment
- History comes from git log and checking out old versions
- Search would be harder, I’d probably need to export data to a real database for indexing
Has anyone tried this before? I’m wondering if there are any major problems with this approach. Would git be too slow compared to a real database? Are there reliability issues I should worry about?
I think having multiple servers that sync with git push/pull would actually be pretty robust.
What do you think - will this work or am I missing something obvious?
that’s actually fascinating! I’m curious though - how’d you handle the git repo size growing over time? with 10k docs per category and full history, wouldn’t that balloon pretty quickly? Also wondering about your backup strategy - did you just rely on the distributed nature or have separate backups too?
I built something like this for a documentation system two years ago - worked better than expected. Git actually handles thousands of small files pretty well if you structure things right. We used the same folder-per-category setup and rarely hit performance issues day-to-day. But there are some gotchas to watch out for. Concurrent writes are a pain since git wasn’t built for database-style access. We ended up adding a simple queue to handle writes one at a time, which prevented corruption but made things more complex. Search was the biggest headache. We had to build a separate search index that updated after each commit, which kind of kills the whole ‘git as database’ simplicity. The upside? Deployment and backups became dead simple. Every clone is a full backup, and we could replicate everything across multiple servers easily. For your scale, this could work, but ask yourself if dealing with concurrent access and search indexing is really worth skipping a proper database.
sounds like overengineering to me. git gets messy with merge conflicts when multiple people edit docs at once - you’ll waste more time fixing conflicts than building features. plus git wasn’t built for this, so performance will tank as your repo grows. why not use postgres with jsonb columns? same flexibility, way fewer headaches.