I am working on a scraper that is growing bigger and bigger and I'm worried about making the wrong design choices.
I have never done more than short scripts in python and I'm at a loss knowing how to design a project with bigger proportions.
The scraper retrieves data from different, but similar themed websites, so an implementation for each site is needed.
The desired raw text of each website is then put through a parser which searches for the required values.
After retrieving the values they should be stored in a 3N-Database.
In its final evolution the scraper should run on a cloud service and check all the different sites periodically for new data. Speed and performance are not of highest importance but desirable. Most importantly the required data should be retrieved without unnecessary reuse of code.
I'm using the Selenium webdriver and have the driver object implemented as a singleton, so all the requests are done by the same driver object. The website text is then part of state of that object.
All the other functionality is currently modelled as functions, everything in one file. For adding another website to the project I first copied the script and just changed the retrieval part. As it soon occurred to me that that's pretty stupid I wanted to ask for design recommendations.
Would you rather implement a Retriever mother class and inherit from that for every website or is there an even better way to go?
Many thanks for any ideas!
Aucun commentaire:
Enregistrer un commentaire