Hi there! I’m exploring the implementation of a RAG solution on AWS and would like to gain insights into the optimal strategies for storing document chunks. I would appreciate hearing your experiences and preferences on this topic.
More specifically, I’m interested in a few key areas:
- When it comes to document chunk storage, do you merge them with your vector data in the same database, or do you opt for a distinct storage option like S3 for the objects? If S3 is used, does that mean you have to retrieve chunks repeatedly?
- For those who use separate storage solutions, which options do you prefer? (Examples include S3, document-oriented databases, etc.)
- For users who combine storage solutions, which vector databases do you recommend that manage this effectively? I’m considering pg_vector in PostgreSQL.
- How is metadata and the connection between vectors and their corresponding documents managed in your setup?
- Are there any challenges or valuable lessons you’ve encountered that you’d be willing to share?
I’m particularly keen on scalable and cost-efficient solutions, especially as my document collection expands.
In my experience, integrating storage within the same vector database can streamline operations significantly, but it’s crucial to use a database that handles this integration efficiently. Products like pg_vector in PostgreSQL can offer benefits such as simplified queries and cohesive management of metadata. This approach also simplifies the synchronization between vectors and documents and can lead to performance benefits if the database is optimized for both tasks. However, careful consideration of the database’s limitations in scalability and throughput is essential to ensure it aligns with your requirements as your data grows.
Keeping document storage separate, like using S3, often ensures better scalability and flexibility. In high traffic scenarios, it can improve performance since retrieval processes can be optimized separately from vector operations. Just need to manage efficient caching or preloading strategies to minimize repeated retrieval of chunks. Hope that helps! 
Hey, that’s an intriguing question! What factors make you lean towards integrating or separating document storage? I’ve heard people mention the ease of scalability with S3, but doesn’t that complicate the retrieval and sync process a bit? Any particuler use case or challenge you’re dealing with right now? 
using a document-oriented database helps maintain structure within dat chunks and metadata, enabling direct queries and enhancing retrieval efficiency. it is good for setups that need frequent data updates or transformations, but one downside is cost, so factor that in when choosing a storage solution.