Optimizing Document Storage for RAG: Integrated vs. Separate Vector Databases?

Question

Implementing AWS RAG. Should document chunks be stored alongside vectors or separately (for instance, in S3)? How do you link metadata? Seeking scalable, cost-effective advice.

From my experience, storing document chunks separately from vectors in a service like S3 provides more flexibility and cost control over time. Keeping the vectors in a dedicated database simplifies query performance while allowing document content to be updated or reprocessed independently. Using clear metadata linking, such as document IDs or unique references, helps maintain a consistent relationship between the vector entries and their corresponding documents, reducing redundancy. This approach also eases scaling challenges by allowing each component to be optimized separately as overall storage needs grow.

i lean toward an integrated approach if scale isn’t huge. embedding metadata in each record can simplify retrieval, though might hit limits as data grows. it’s a tradeoff between manageability and cost-optimiztion when scaling out into separate clouds.

hey, i’ve been testing a hybrid solution storing docs in s3 and vectors in a dedicated db. its cost-effectiv and quite responsive, though syncing meta data can sometimes fumble. how do you all handle real-time updates? any neat hacks or experiences?

Based on personal experience with AWS RAG systems, maintaining separate storage solutions for documents and vectors has proven advantageous. It allows the document storage in S3 to be scaled independently of the vector database, which is useful when document revisions or reprocessing are frequent. Using unique identifiers to link metadata reliably ensures that retrieval remains consistent even if the data source for the documents changes. This separation strategy tends to minimize impact on query performance while enhancing overall scalability and cost management.