jeudi 29 novembre 2018

Data processing pipeline design for processing data

I have a use case for which I need to build a data processing pipeline

  • Customer contact leads data coming from different data sources like csv, data base, api has to be first mapped to a universal schema fields. There could be ~100k rows coming each day that need to be processed.
  • Then some of the fields have to be cleansed, validated and enriched. For example- the email field has to be validated by calling an external API to check if it's valid and does not bounce, address field has to be standardized to a particular format. There are other operations like estimating city, state from zip, phone number validation. Atleast 20 operations already planned, more to come in future
  • The above rules are not fixed and can change based on what user wants to do with his data (saved from user interface). For example, for a particular data, a user may only choose to standardize his phone number, but not check if its valid: thus operations performed on the data is dynamic.

Here is what I am doing currently:

  1. Load the data as a pandas data frame(have considered spark. But data set is not that large[max 200 mb-]to use spark). Have a list of user-defined operations that need to be performed on each field like

    actions = {"phone_number": ['cleanse', 'standardise'], "zip": ["enrich", "validate"]}

As I mentioned above, the actions are dynamic and vary from the data source to data source based on what user choose to do on each field. There is lot many custom business like this that can be applied specifically to a specific field.

  1. I have a custom function for each operation that user can define for fields. I call them based on the "actions" dictionary and pass the data frame to the function - the function applies the logic written to the data frame and returns the modified data frame.
def cleanse_phone_no(df, configs):
    # Logic
    return modified_df

I am not sure if this is the right approach to do it. Things are going to get complicated when I have to call external API's to do enrichment of certain fields in future. So I am considering a producer-consumer model

a. Have a producer module that creates that splits each row in file(1 contact record) as single message in a queue like AMQ or Kafka

b. Have the logic to process the data in consumers - they will take one message at a time and process them

c. The advantage I see with this approach is -it simplifies the data processing part- data is processed one record at a time. There is more control and granularity Disadvantage is it will create overhead in terms of computation as a record in processed one by one - which I can overcome to an extent using multiple consumers

Here are my questions:

  • What is your opinion about the approach? Do you have any suggestions for a better approach?
  • Is there any more elegant pattern I can use to apply the custom rules to the data set that what I am using currently
  • Is using a producer-consumer model to process the data one row at a time than entire data set advisable (considering all the complexity in logic that would come in future)? If so should I use AMQ or Kafka?

Aucun commentaire:

Enregistrer un commentaire