mercredi 5 octobre 2022

How to read and do following validation of Fixed Width file in PySpark

I have a fixed width file with multiple record types as provided below:

A record block starts with Record Type 01 and ends with Record Type 14. There could be multiple rows for Record Type 13 and 14. But Record Type 01 and 11 will have one row every time.

The first 2 bytes represent "Record Type" eg - 00,01,11,13,14,99. The next 6 bytes represent "Account No" eg - 778899, 997788 After "Account Number" field the schema of each record type is different. Record Type = 11 has all the Master level Information of Record Type 13 and 14. Record Type = 11 has Group Entity Name(in Short) eg TATA, RELI and Date (in DDMMYY) 120822,080822 Record Type = 13 has Business Entity Name (20 Bytes) under the Group Entity Record Type = 14 has the Business Entity Name (20 Bytes), Share Holders Name (20 Bytes) in order of their holdings.

00 - Header 01778899 11778899TATA120822 13778899TataCommunications 13778899TataElectronics 13778899TataDigital 14778899Tata Communications Tata Sons Chandrashekaran Mistry
14778899Tata Electronics Tata Sons Mistry Gopal 14778899Tata Digital Tata Sons Mistry Chandrashekaran 01997788 11997788RELI080822 13997788Reliance Retail 13997788Reliance Jio 13997788Reliance Energy 14997788Reliance Retail Reliance Future Retail Ambani
14997788Reliance Jio Reliance Adani Meta 99 - Trailer

The Validation Rules are as below: Each Record Block should start with Record Type 01. The Record Type 11 should be there immediately after Record Type 01. The Record Type 13 Should be immediately after Record Type 11. The Record Type 14 should be immediately after Record Type 13.

Could you please help me with the design. Thanks a lot.

Aucun commentaire:

Enregistrer un commentaire