jeudi 26 décembre 2019

How to combine and maintain multiple applications that do the same thing using sound engineering principles?

I currently have around 15 automation applications I have adopted that are all doing similar things, but they are written by different people. Almost all of the code is copy-pasted over a 3 year period. A new automation would need to be created, and whoever was responsible for making it, would copy the most recent project, and modify it. I also have new projects coming up that do similar things, and would like to avoid what has happened in the past.

I'm trying to combine the identical functionality in some way and looking for ideas about the best way to do this so these projects will be easier to maintain and fix bugs in. First, I will describe what these automation programs do. They are all C# console applications that are run from the task scheduler at set times of day on a couple machines. However, they are currently all being moved to the same machine. Ideally, I think we should assume they will not be on the same machine. They can loop around and wait, but all are supposed to complete their execution by end day (23:59:59 local).

They all retrieve files from a source, and copy them to a working directory. This working directory is usually the local machine, but not always. Some transfer the source to working directory via ftp, some via UNC path, and probably other ways as well. The retrieved source files are in different formats, sometimes JSON, XML, or CSV. Sometimes there is pre-processing that needs to be done, because the source files are malformed XML, or CSV files with bad records, and so on; I am responsible for fixing or removing bad entries. From here, depending on the type of data in these files, we send the data off to different sources for more processing, sometimes combined with our own internal data. So for instance, the source file might contain addresses, but this specific automation will also need information in our internal databases along with files located in an image server.

This is the most varied part of each of the applications, but is limited to about 6 different places the data can go to (ie 6 different formats going out). Each destination has a different format the data we compile will need to be in, but each current automation application has a different way of handling this. After we send the formatted source out, we sometimes get it back and send to another source.

This can be complicated. Lets say we have initial data D0 that has been preprocessed (removed bad entries, combined with internal data), and 2 external file processes A,B. We format D0 for A, call it D1. D1 is sent to A and we get back D2 in a different format. D2 needs to be in a new format to be processed by B, so we create D3 for B, which gives us back D4. D4 will then go through some finalization specific to the automation usually.

After the data has been fully processed, we move the files to a final location and update necessary values in databases, generate reports, and send emails to people who need them. Along the way, errors are tracked and written to logs, and error emails are sent out when a critical part of the process fails.

My intuition tells me a shared and versioned library is the way to go for most of this, so that when we make a change or fix, we can deploy it to every app by updating the configs. My colleagues are resistant to this and want to go a web service route, so I'd like any points/counterpoints to bring up in discussions.

The next part I would like suggestions for is the different data formats we use in formatting. My initial thoughts on this are to make a large class that contains all the data from our internal data and the data we collect from sources, and perhaps even data that can come back from external processing. Call our monolithic class format M, and our external processing input and output format classes Ai, Ao, Bi, Bo, ... I'd have to have a M->Ai, Ao->M for every external process. Sometimes we need to do M->Ai, Ao->M, before we can do a M->Bi though, because process B relies on data from A processing. Not sure if its better to do M->Ai->Ao->Bi->Bo->M, or something like M->Ai->Ao->M->Bi->Bo->M. It seems like converting every external process format to another format might be an impossible task, because it grows exponentially with additional external sources. I'm leaning towards some system that tracks which external processes have been applied to the monolithic data in M.

Thank you and I'm looking forward to hearing any suggestions.

Aucun commentaire:

Enregistrer un commentaire