dimanche 8 décembre 2019

OO Design Principles for document parsing

The have been tasked to create a software application that will parse a set of 150 documents, extracting certain data elements (not html elements), such as tables, comments, invalid content, invalid characters, etc. Unfortunately there appears to be very little conformity in how these documents are arranged or formatted. For example, a data table could look very different from one document to another and other documents may never have a table.

I successfully parsed 7 of the most important documents. The original requirements only called for taking care of these. I ended up creating an abstract class that provided a default implementation for the common parsing that all documents need. I created 7 implementation classes that provided custom parsing for non-conforming scenarios. I knew this was not an ideal design approach but it worked splendidly. The customer was very pleased.

Unfortunately the customer was so pleased that now they want all 150 documents to be parsed! Needless to say I don't want to create 150 implementation classes. I then took out a spreadsheet and arrived at 7 different types of parsing I would need to do for a document. Two of the types would be required for ALL documents. Then there are combinations of the 5 remaining types of parsing that any document could need to be fully parsed.

Now I'm trying to come up with a sound design approach. My first impression is to create a combination of direct inheritance for the required parsing activities and composition via interfaces for the possible parsing activities. For example, all documents will have a header and a set of invalid characters (too many spaces, multiple carriage returns, etc) that will need to be removed. Some documents may have a data table that needs to be removed. Others will have technical notes that will need to be removed.

I've come up with 34 possible combinations of parsing activities. How did I come up with this? 2 + 5^2 = 34. I have not finished analyzing all 150 documents, but it looks like maybe 1/3 to 1/2 of all documents will fall into three or four combinations of parsing activities.

I need to come up with a sound design that handles these seven parsing activities and will be flexible enough to deal with new one-off documents that rear their ugly heads from time to time.

Aucun commentaire:

Enregistrer un commentaire