I'm working on a small application that is supposed to scrape/parse a few website and I'm wondering which would be the best way to achieve this (keeping DRY/SOLID in mind).
Here's some pseudocode:
class ScraperScheduler
def perform
SraperWorker.perform_async(ParserTypeOne.new)
SraperWorker.perform_async(ParserTypeTwo.new)
SraperWorker.perform_async(ParserTypeThree.new)
SraperWorker.perform_async(ParserTypeFour.new)
end
end
class ScraperWorker
def initialize(scraper)
@scraper = scraper
end
def perform
html = RestClient.get(@scraper.url)
@scraper.perform_async(html)
end
end
class ParserTypeOne
def perform(html)
#parse page with nokogiri
page = Nokogiri::HTML(html)
parserd_objects.each do |o|
PersistToDB.perform(o)
end
end
end
class PersistToDB
def perform(o)
# split o into several ActiveRecord objects
# check if unique and save to db
end
end
The ScraperScheduler
class is basically just a cronjob that will be called with sidekiq-scheduler once a day. The perform methods are there so I can basically make sidkiq jobs out of everything, but I don't think this is necessary for each one of those classes. Some questions/concerns I have:
- ScraperWorker basically only performs the HTTP request. Yet in my example it knows about the
url
andperform_async
properties of the Parser. Any way to do this in a more "loosely coupled" way? - The
ParserTypeOne
job should be just extracting the data from the HTML with nokogiri. Is it too closely coupled to PersistToDB? How cann I call PersistToDB differently? - Any other suggestions?
I know this would work fine I'm just interested in a few ideas on how to improve this. Suggestions?
Aucun commentaire:
Enregistrer un commentaire