dimanche 26 décembre 2021

How To Approach Server OOP Design?

I am developing a program that processes scraped data from several websites. We will call it the "main server".

The scraping process occurs on a series of other programs which are distributed amongst different remote computers. The data is sent from each scraping program to my main server program over a socket. A single website has multiple scraping programs pulling data from it (each Scraper is responsible for its own portion of the website).

I have a few classes in mind for the main server, the relevant ones are Bot, Server, Website, and Scraper.

Bot is the "program" class, it contains executive data at the highest level of abstraction:

class Bot:
    def __init__(self):
        self.websites = dict()
    
    def add_website(self, name):
        self.websites[name] = website(name)

    def process_data(self):
        for website in websites:
            ...

Website is self-explanatory:

class Website:
    def __init__(self, name):
        self.name = name
        self.pages = dict()
        self.scrapers = list()

    def add_page(self, pageDescription):
        ...

Server is where things start to get a bit shaky:

class Server:
    def __init__(self, addr, scraperPort, placerPort):
        self.addr = addr
        self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        pass

    def handle_requests(self, c):
        while True:
            try:
                data = c.recv(65535)
            except (ConnectionAbortedError,ConnectionResetError):
                return
            decoded = data.decode()
            try:
                requestJson = json.dumps(decoded)
            except:
                # error
            if requestJson['type'] == 'hello':
                requestJson['socket'] = c

            packetQueue.append(requestJson)


    def start(self):
        self.socket.bind('localhost', PORT)
        self.socket.listen(1000)
        while True:
            (client, address) = self.socket.accept()
            start_new_thread(self.handle_requests, (client))

As you can see, Server is responsible for accepting socket connections from all scrapers. Once it receives a connection, it will receive a "hello" packet, which will contain the Website the scraper is for as well as a unique identifier. The goal is to "attach" that socket to the proper Website/Scraper object, so that further communications with this socket are designated to that Website.

However, as it stands, Server has no awareness of the dictionary of Website objects. Even if I made Server a member of the high-level Bot object, Server itself would still have no awareness of these objects from inside handle_requests(). As of right now, you can see at the last line of handle_requests() I am adding each packet to a queue. Each Scraper is polling this queue:

class Scraper:
    def poll_packet_queue(self):
        packet = packetQueue.peek()
        if packet['website'] == self.website and packet['guid'] == self.guid:
            # the packet was designated for this website, and this scraper's portion of the website data

This feels messy and wrong. The first issue with this is that when one of the sockets (scrapers) accepted by Server throws a connection error, it is important I clear the website data corresponding to the scraper's socket. However, since Server's handle_requests() has no direct awareness/communication with the Scraper/Website object besides the packetQueue, it is impossible to know which scraper's data to clear when the socket throws the error. The only way this would be possible with the current design is to keep track on the unique identifier and website associated with the socket INSIDE the server object using a local variable in handle_requests(), and on disconnect, create a local disconnectPacket with the locally stored identifier and website, which would eventually be picked up by the scraper object polling the packetQueue:

    def handle_requests(self, c):
        website = None
        identifier = None

        while True:
            try:
                data = c.recv(65535)
            except (ConnectionAbortedError,ConnectionResetError):
                disconnectPacket = dict()
                disconnectPacket['guid'] = identifier
                disconnectPacket['website'] = website
                return
            decoded = data.decode()
            try:
                requestJson = json.dumps(decoded)
            except:
                # error
            if requestJson['type'] == 'hello':
                requestJson['socket'] = c
                identifier = requestJson['guid']
                website = requestJson['website']

            packetQueue.append(requestJson)

But again, this feels messy. Is there a better way to design this program? If so, is there a more fundamental rule behind it that I can apply to similar situations in the future?

Aucun commentaire:

Enregistrer un commentaire