mercredi 28 mars 2018

Spark ETL Unique Identifier for Entities Generated

We have a requirement in Spark where the every record coming from the feed is broken into set of entites example {col1,col2,col3}=>Resource,{Col4,col5,col6}=> Account ,{col7,col8}=>EntityX etc.

Now i need a unique identifier generated in the ETL Layer which can be persisted to the database table respectively for each of the above mentioned tables/entities. This Unique Identifier acts a lookup value to identify the each table records and generate sequence in the DB.

1.First Approach was using the Redis keys to generate the keys for every entities identified using the Natural Unique columns in the feed. But this approach was not stable as the redis used crash in the peak hours and redis operates in the single threaded mode.It woulbe slow when im running too many etl jobs parallely. 2.My Thought is to used a Crypto Alghorithm like SHA256 rather than Sha32 Algorithm has 32 bit there is possibility of hash collision for different values.were as SHA256 has more bits so the range of hash values = 2^64 so the Possibility of the HashCollision is very less since the SHA256 uses Block Cipher of 4bit to encryption.

But the Second option is not well accepted by many people. What are the other options/solutions to Create a Unique Keys in the ETL layer which can looked back in the DB for comparison.

Thanks in Advance, Rajesh Giriayppa

Aucun commentaire:

Enregistrer un commentaire