I am developing a program that processes scraped data from several websites. We will call it the "main server".
The scraping process occurs on a series of other programs which are distributed amongst different remote computers. The data is sent from each scraping program to my main server program over a socket. A single website has multiple scraping programs pulling data from it (each Scraper
is responsible for its own portion of the website).
I have a few classes in mind for the main server, the relevant ones are Bot
, Server
, Website
, and Scraper
.
Bot
is the "program" class, it contains executive data at the highest level of abstraction:
class Bot:
def __init__(self):
self.websites = dict()
def add_website(self, name):
self.websites[name] = website(name)
def process_data(self):
for website in websites:
...
Website
is self-explanatory:
class Website:
def __init__(self, name):
self.name = name
self.pages = dict()
self.scrapers = list()
def add_page(self, pageDescription):
...
Server
is where things start to get a bit shaky:
class Server:
def __init__(self, addr, scraperPort, placerPort):
self.addr = addr
self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
pass
def handle_requests(self, c):
while True:
try:
data = c.recv(65535)
except (ConnectionAbortedError,ConnectionResetError):
return
decoded = data.decode()
try:
requestJson = json.dumps(decoded)
except:
# error
if requestJson['type'] == 'hello':
requestJson['socket'] = c
packetQueue.append(requestJson)
def start(self):
self.socket.bind('localhost', PORT)
self.socket.listen(1000)
while True:
(client, address) = self.socket.accept()
start_new_thread(self.handle_requests, (client))
As you can see, Server
is responsible for accepting socket connections from all scrapers. Once it receives a connection, it will receive a "hello" packet, which will contain the Website
the scraper is for as well as a unique identifier. The goal is to "attach" that socket to the proper Website
/Scraper
object, so that further communications with this socket are designated to that Website
.
However, as it stands, Server has no awareness of the dictionary of Website
objects. Even if I made Server
a member of the high-level Bot
object, Server itself would still have no awareness of these objects from inside handle_requests()
. As of right now, you can see at the last line of handle_requests()
I am adding each packet to a queue. Each Scraper
is polling this queue:
class Scraper:
def poll_packet_queue(self):
packet = packetQueue.peek()
if packet['website'] == self.website and packet['guid'] == self.guid:
# the packet was designated for this website, and this scraper's portion of the website data
This feels messy and wrong. The first issue with this is that when one of the sockets (scrapers) accepted by Server
throws a connection error, it is important I clear the website data corresponding to the scraper's socket. However, since Server's handle_requests()
has no direct awareness/communication with the Scraper
/Website
object besides the packetQueue
, it is impossible to know which scraper's data to clear when the socket throws the error. The only way this would be possible with the current design is to keep track on the unique identifier and website associated with the socket INSIDE the server object using a local variable in handle_requests()
, and on disconnect, create a local disconnectPacket
with the locally stored identifier
and website
, which would eventually be picked up by the scraper object polling the packetQueue
:
def handle_requests(self, c):
website = None
identifier = None
while True:
try:
data = c.recv(65535)
except (ConnectionAbortedError,ConnectionResetError):
disconnectPacket = dict()
disconnectPacket['guid'] = identifier
disconnectPacket['website'] = website
return
decoded = data.decode()
try:
requestJson = json.dumps(decoded)
except:
# error
if requestJson['type'] == 'hello':
requestJson['socket'] = c
identifier = requestJson['guid']
website = requestJson['website']
packetQueue.append(requestJson)
But again, this feels messy. Is there a better way to design this program? If so, is there a more fundamental rule behind it that I can apply to similar situations in the future?
Aucun commentaire:
Enregistrer un commentaire