dimanche 9 octobre 2022

Good structure/ design pattern for scraping multiple pages?

in my new project I want to scrape & aggregate different data from multiple websites.

We are talking about a number of specific, previously known websites, each of which can provide up to N attributes. Comparable to a product on different online stores.The input is always the link and then you should get the scraped data as result. I have already implemented the scraping of the data and it works for the different websites. So it is only about the structure.

Now for my questions:

  1. Do you have any tips on how to structure the basic "framework" in the most sustainable way to keep it low-maintenance and extensible? My first thought was to have the different scrapers inherit in OOP style from a general abstract scraper, so they can all run together in the main-script. However, I am not quite sure about this yet. Would appreciate any ideas concerning the design?
  2. Since some pages give less info, I thought to save all findable information into one big dict. Does this idea make sense or should I define an separate class?

Additional information (if important):

  • it's only about the structure and the main-script, the scraping in detail is already working.
  • There is not much data (n<500) involved, so I don't expect to get problems with IP bans. However, I would like to design it as "smart" as possible.
  • For scraping I use BS4, Requests, Selenium and AJAX.
  • it is primarily about targeted data retrieval, no complicated crawling

Thanks in advance!

Aucun commentaire:

Enregistrer un commentaire