Here is the context for the question. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming. I hope it is obvious that I left out the directory init files for convenience.
I am interested in opinions on which design adheres best to standard architectural practices and SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.
Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural programming paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.
I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs.
The directory structure preserves the level of abstraction in the code (readability). Design III puts the 3 files in one module and sits within the same abstract level of the manage_the_etl_pipeline.py module. Design II pus the files one level of abstraction lower where they belong, but collects them into the same module. In this case, if using OOP, with classes encapsulating the object, my objections would not be as strong.
The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
- Tolerate change,
- Are easy to understand, and
- Are the basis of components that can be used in many software systems.
I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.
I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming, albeit at a much higher level (SA Martin).
One last point. "Everything" in Python is a namespace: Built-in, global, function, enclosing, and user namespaces, e.g. dictionaries, SimpleNamespace, dataclasses. Namespaces group similar "things" together. Creating a Global namespace like Design II and III, confounds it.
I have expressed some of the reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.
Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf. It can be found on the web.
SEVERAL DESIGNS FOR COMPARISON
DESIGN I:
manage_the_etl_pipeline.py
-- etl_helpers
extract.py
transform.py
load.py
Of course one could also
DESIGN II:
manage_the_etl_pipeline.py
-- etl_helpers
extract_transform_load.py
or probably even:
DESIGN III:
manage_the_etl_pipeline.py
extract_transform_load.py
Referred to online literature. Looked in GoF Design patters, Software Architecture (Martin), Clean Code (Martin), Clean Cod in Python.
Aucun commentaire:
Enregistrer un commentaire