The program OCRs text forms. It saves the data one row per form. The end users consume the data through SQL. I want them to be able to query the contents of the document by just selecting one row of the table.
The OCR is nowhere perfect. But there are a lot of small tricks I can do to improve bits of it post-scan. I think of them as 'grooming filters' on the data.
So I might OCR a form and it generates the initial row of imperfect data. Then later I realize I can run various operations on the row data to improve it. So maybe
- removeSpuriousPunction(row)
- implyCorrectColumnValueFromOtherColum(row)
- updateColumnWithExternalLookup(row) etc. etc.
All those filters can improve the quality of columns of a given row.
But I'd like to be able to track the changes caused by these grooming filters so that (1) I can look at a given row and understand what filters made the row look like it does (2) undo the change of a given filter if it's behaving badly
Is there a canonical pattern for a model like this?
The additional twist, remember, is that I don't want the end user to have to deal with this more complicated model. They don't need to care about the intermediate version of a row -- they just want to search a table, get a row, and know that row is the most-groomed version of the OCRd document.
Aucun commentaire:
Enregistrer un commentaire