How do price comparison websites organize products from different retailers?

I’m trying to understand how price comparison sites work on the frontend side. My main question is about how these platforms manage to show the same product from multiple stores on one page.

Here are my main questions:

  • Do these sites maintain their own product catalog with images and details, then use web scraping to find matching items across different retailers using product names or SKU numbers?

  • Or do they scrape retailer websites first and then use database queries to group similar items together without having a pre-built product database?

I’m working on a similar project and want to understand the best approach for matching products across different online stores. Any insights would be really helpful!

Interesting thread! Quick question though - what happens when retailers change their urls or restructure their sites? Do these comparison platforms constantly update their scraping logic? Also, have you thought about using product apis instead of scraping? Might be more reliable even if fewer stores offer them.

Most price comparison sites I’ve worked with use both methods you mentioned. They start with a core product database—standardized info and images from manufacturer partnerships or data providers. Then they match products using UPCs and part numbers when scraping retailer sites. For products without clear identifiers, they use machine learning to parse titles and descriptions, plus human reviewers handle the tricky cases. The big challenge is data normalization since retailers describe the same product differently. You need standardized attributes to make comparisons work.

from my experience building something similar, most sites scrape first then group products after. it’s easier to build that way, even if it’s less efficient. you can use fuzzy string matching on product titles and compare prices within groups. the tricky part is handling variants like different colors or sizes - they shouldn’t be grouped together but algorithms often mess this up.