in my new project I want to scrape & aggregate different data from multiple websites.
We are talking about a number of specific, previously known websites, each of which can provide up to N attributes. Comparable to a product on different online stores.The input is always the link and then you should get the scraped data as result. I have already implemented the scraping of the data and it works for the different websites. So it is only about the structure.
Now for my questions:
- Do you have any tips on how to structure the basic "framework" in the most sustainable way to keep it low-maintenance and extensible? My first thought was to have the different scrapers inherit in OOP style from a general abstract scraper, so they can all run together in the main-script. However, I am not quite sure about this yet. Would appreciate any ideas concerning the design?
- Since some pages give less info, I thought to save all findable information into one big dict. Does this idea make sense or should I define an separate class?
Additional information (if important):
- it's only about the structure and the main-script, the scraping in detail is already working.
- There is not much data (n<500) involved, so I don't expect to get problems with IP bans. However, I would like to design it as "smart" as possible.
- For scraping I use BS4, Requests, Selenium and AJAX.
- it is primarily about targeted data retrieval, no complicated crawling
Thanks in advance!
Aucun commentaire:
Enregistrer un commentaire