design-patterns: Organise code design-wise for web scraper

mardi 1 mai 2018

Organise code design-wise for web scraper

I'm working on a small application that is supposed to scrape/parse a few website and I'm wondering which would be the best way to achieve this (keeping DRY/SOLID in mind).

Here's some pseudocode:

class ScraperScheduler
  def perform
    SraperWorker.perform_async(ParserTypeOne.new)
    SraperWorker.perform_async(ParserTypeTwo.new)
    SraperWorker.perform_async(ParserTypeThree.new)
    SraperWorker.perform_async(ParserTypeFour.new)
  end
end

class ScraperWorker
  def initialize(scraper)
    @scraper = scraper
  end

  def perform
    html = RestClient.get(@scraper.url)
    @scraper.perform_async(html)
  end
end

class ParserTypeOne 
  def perform(html)
    #parse page with nokogiri
    page = Nokogiri::HTML(html)

    parserd_objects.each do |o|
      PersistToDB.perform(o)
    end
  end
end

class PersistToDB
  def perform(o)
    # split o into several ActiveRecord objects
    # check if unique and save to db
  end
end

The ScraperScheduler class is basically just a cronjob that will be called with sidekiq-scheduler once a day. The perform methods are there so I can basically make sidkiq jobs out of everything, but I don't think this is necessary for each one of those classes. Some questions/concerns I have:

ScraperWorker basically only performs the HTTP request. Yet in my example it knows about the url and perform_async properties of the Parser. Any way to do this in a more "loosely coupled" way?
The ParserTypeOne job should be just extracting the data from the HTML with nokogiri. Is it too closely coupled to PersistToDB? How cann I call PersistToDB differently?
Any other suggestions?

I know this would work fine I'm just interested in a few ideas on how to improve this. Suggestions?

design-patterns

mardi 1 mai 2018

Organise code design-wise for web scraper

Aucun commentaire:

Enregistrer un commentaire