mardi 8 septembre 2020

Design pattern for a data parsing&feature engineering pipeline

I know this question has been asked a few times on these boards in the past, but I have a more generic version of it, which might be applicable to someone else's projects in the future as well.

In short - I am building an ML system (with Python, but language choice in this case is not very critical), which has its ML model at the end of a pipeline of actions happening:

  • Data upload
  • Data parsing
  • Feature engineering
  • Feature engineering (in a different logical bracket from prev. step)
  • Feature engineering (in a different logical bracket from prev. steps)

... (more steps like the last 3)

  • Data passed to ML model

Each of the above steps, has its own series of actions it must take, in order to build a proper output, which is then used as an input in the next one etc. These sub-steps in turn, can either be completely decoupled from one another, or some of them might need some steps inside of that big step, to be completed first, to produce data these following steps use. The thing right now is, that I need to build a custom pipeline, which will make it super easy to add new steps into the mix (both big and small), without upsetting the existing ones.

So far, I have this concept idea of how this might look like from an architecture perspective, as shown below:

Architecture concept

While looking at this architecture, I am immediately thinking about a Chain of Responsibility Design Pattern, which manages BIG STEPS (1, 2, ..., n), and each of these BIG STEPS having their own small version of Chain of Responsibility happening inside of their guts, which happen independently for NO_REQ steps, and then for REQ steps (with REQ steps looping-over until they are all done). With a shared interface for running logic inside of big and small steps, it would probably run rather neatly.

Yet, I am wondering, if there is any better way of doing it? Moreover, what I do not like about a Chain of Responsibility, is that it would require a person adding new BIG/SMALL step, to always edit the "guts" of the logic setting up step bags, to manually include the newly added step. I would love to build something, which instead would just scan a folder specific to steps under each BIG STEP, and build a list of NO_REQ and REQ steps on its own (to uphold the Open/Closed SOLID principle).

I would be grateful for any ideas.

Aucun commentaire:

Enregistrer un commentaire