vendredi 14 juillet 2017

What is the best practice for listening to events, group them, and submitting it in batch?

Let's say my system wants to listen to users' click events and save them into the archive storage. I know where the event is coming from (userId - about hundreds of users), and what url was clicked. (url - infinite variations)

class ClickEvent {
  String userId;
  String url;
}

If my system is potentially getting thousands of events per second, I do not want to put the massive load into the storage by calling it once per every click event coming in. Assume the storage is an AWS S3-like storage, or data warehouse, where it is good at storing the fewer number of store large files than submitting tens of thousands of requests per second.

My approach currently is.. using GoogleGuava's Cache library. (or just any cache with cache expiration support)

Assume that key for the cache is userId, and value for the cache is List<url>.

  • Cache miss -> Add an entry to the cache (userId, [url1])
  • Cache hit -> I append new url to the list (userId, [url1, url2...])
  • Cache entry expires after configurable X min since the initial write or after having 10000 urls.
  • Upon expiration of entry, I push the data into the storage, ideally, reducing up to 10000 small separate transactions to 1 large transaction.

I am not sure if there is a "standard" or better way (or even a well known library) to tackle this problem, that is, accumulating thousands of events per second and save them all in the storage/file/data warehouse at once, instead of transferring high tps loads into the downstream services. I feel like this is one of the common use cases for big data system.

Aucun commentaire:

Enregistrer un commentaire