mardi 18 août 2020

How can I de-couple the two components of my python application

I'm trying to learn python development, and I have been reading on topics of architectural pattern and code design, as I want to stop hacking my code. I am currently implementing an application, and I know it has a problematic structure as you'll see, but I don't know how to change it for better.

I'm implementing a webcrawler that will input it's information in a mongoDB instance.

So I this is my general structure:

Spiders

crawlers.py
connections.py
utils.py
__init__.py

crawlers.py implements a class of type Crawler, and each specific crawler inherits it. Each Crawler has an attribute table_name, and a method: crawl. In connections.py, I implemented a pymongo driver to connect to the DB. It expects a crawler as a parameter to it's write method. Now here come's the trick part... the crawler2 depends on the results of crawler1, so I end up with something like this:

from pymongo import InsertOne

class crawler1(Crawler):
    def __init__(self):
        super().__init__('Crawler 1', 'table_A')

    def crawl(self):
        return list of InsertOne

class crawler2(Crawler):
    def __init__(self):
        super().__init__('Crawler 1', 'table_A')

    def crawl(self, list_of_codes):
        return list of InsertOne # After crawling the list of codes/links

and then, in my connections, I create a class that expects a crawler.

class MongoDriver:
    def __init__.py
        self.db = MongoClient(...)

    def write(crawler, **kwargs):
        self.db[crawler.table_name].bulk_write(crawler.crawl(), **kwargs)

    def get_list_of_codes():
        query = {}
        return [x['field'] for x in self.db.find(query)]

and so, here comes the (biggest) problem (because I think there are many other, some of which I can barely grasp, and others that I'm still totally blind to): the implementation of my connections needs context of the crawler!! For example:

mongo_driver = MongoDriver()
crawler1 = Crawler1()
crawler2 = Crawler2()
mongo_driver.write(crawler1)
mongo_driver.write(crawler2, mongo_driver.get_list_of_codes())```

How would one go about solving it? And what else is particularly worrysome in this construct? Thanks for the feedback!

Aucun commentaire:

Enregistrer un commentaire