mercredi 11 février 2015

Design MapReduce job to find the maximum value below a given treshold

The query:


I am trying to design a query to find the live version of a web page at a given date .


The date is passed as run-time argument. The input to a mapper is the following key -> value pair: webpage_id -> revision_id and revision_timestamp .


For each webpage_id , the job has to output the latest revision_id of the page that happened before the given date .


Current design:


The mappers would discard any record with revision_timestamp after the threshold date and would output all other records.


The combiners would then sort all the revisions for a given webpage and output only the latest one (this is done by using an internal data structure and emitting key-value pairs at the cleanup stage of the combiners).


The reducers would do the same thing as the combiners, but on the combiner's output.


Idea:


I want to further optimize the job. I think it is a good idea to update a "global" variable with the latest revision processed by a mapper for a given webpage. Given that, before a mappers output a record, it would check if the revision is the "globally latest" revision for that webpage and will not emit if it isn't. If it is, he would emit the record and update the global variable. I think that this can reduce the amount of records transferred over the network and speed up the job. Do you think this idea is feasible and is it likely to improve the performance?


Question:


Is there a way to create and update such global variables - I read that one of the features of ZooKeeper is to act as a key-value store but I cannot find a code example of how to initialize or access a ZooKeeper record/variable inside a map task?


Are there any other ways in which I can improve the performance of my MapReduce job?


Aucun commentaire:

Enregistrer un commentaire