jeudi 26 décembre 2019

ETL design - what Queue should i use instead of my SQL table and still be able to process in parallel

Need your help with re-design my system. we have very simple ETL but also very old and now when we handle massive amount of data it became extremely slow and not flexible

the first process is the collector process:

collector process- always up

  1. collector collect the message from the queue (rabbitMQ)
  2. parse the message properties (Json format) to java object (for example if the json contains field like 'id' and 'name' and 'color' we will create java object with int field 'id' and string field 'name' ,and string field 'color')
  3. after parsing we write the object to csv file as csv row with all the properties in the object
  4. we send ack and continuing to the next message in the queue

processing work-flow - happens every hour once

  1. a process named 'Loader' loads all the csv files (the collector outputs) to DB table named 'Input' using SQL INFILE LOAD all new rows have 'Not handled' status . the Input table is like a Queue in this design
  2. a process named 'Processor' read from the table all the records with 'Not handled' status ,transform it to java object ,make some enrichment and than insert the record to another table named 'Output' with new fields, **each iteration we process 1000 rows in parallel - and using JDBC batchupdate for the DB insert ** .

the major problem in this flow:

The message are not flexible in the existing flow - if i want for example to add new property to the json message (for example to add also 'city' ) i have to add also column 'city' to the table (because of the csv infile Load) , the table contains massive amount of data and its not possible to add column every time the message changes .

My conclusion

Table is not the right choice for this design.

i have to get rid of the csv writing,and remove the 'Input' table to be able to have flexible system , i thought of maybe using a queue instead of the table like KAFKA and maybe use tools such KAFKA streams for the enrichment. - this will allow me flexibly and i wont need to add column to a table every time i want to add field to the message the huge problem that i wont be able to process in parallel like i process today.

what can i use instead of table that will allow me process the data in parallel?

Aucun commentaire:

Enregistrer un commentaire