mercredi 27 mai 2015

Sql data model patterns for 'version control' of rows of OCR'd data

The program OCRs text forms. It saves the data one row per form. The end users consume the data through SQL. I want them to be able to query the contents of the document by just selecting one row of the table.

The OCR is nowhere perfect. But there are a lot of small tricks I can do to improve bits of it post-scan. I think of them as 'grooming filters' on the data.

So I might OCR a form and it generates the initial row of imperfect data. Then later I realize I can run various operations on the row data to improve it. So maybe

  1. removeSpuriousPunction(row)
  2. implyCorrectColumnValueFromOtherColum(row)
  3. updateColumnWithExternalLookup(row) etc. etc.

All those filters can improve the quality of columns of a given row.

But I'd like to be able to track the changes caused by these grooming filters so that (1) I can look at a given row and understand what filters made the row look like it does (2) undo the change of a given filter if it's behaving badly

Is there a canonical pattern for a model like this?

The additional twist, remember, is that I don't want the end user to have to deal with this more complicated model. They don't need to care about the intermediate version of a row -- they just want to search a table, get a row, and know that row is the most-groomed version of the OCRd document.

Aucun commentaire:

Enregistrer un commentaire