I want to design an application which executes millions of similar tasks every few hours/days.
To make it easier to explain, I'm going to use scraping Amazon as an example (not the real project I'm trying to do)
Let's say we have a long list of products, which we would like to collect their price on Amazon once a day and store it somewhere (databse, file, etc). The list changes overtime, every day a few products gets removed and new ones gets added. The list of procucts can be stored in a database, file, etc. We are okay with running duplicate tasks, but we should minimize it to avoid getting our IPs blocked.
- A simplified solution is to have a cron job that runs once a day, gets the list of products, loops through them, collect the price for each one, stores the price. Repeat the next day.
One issue is that there is not much room for parallization, we could split the work and have multiple threads handle this. What if one server cannot process the whole list in a day and we need to distribute the work among multiple servers? We could split the work among servers and each server has to go through a batch. How do we handle server failures? Should we then run another task to reschedule remiaing work for all the failed/incomplete tasks? How do we make sure the workload is evenly distributed among servers?
- A better solution is to queue the tasks, then workers can take new tasks from the queue as they become available.
But how about the code that has to queue up the tasks? It has to go through millions of products and create millions of tasks in the queue. The work for queuing up the work itself can fail, then we either endup with missing tasks or more than expected duplicate tasks.
- Another option is to use relational databases which support row locking, each worker would lock a row which hasn't processed that day, complete the work and update the timestamp, then unlock the row.
I'm not sure if this would scale up and if the database can handle all the workers as the number of workers increase. If we are going to get into deadlocks and all that.
Aucun commentaire:
Enregistrer un commentaire