vendredi 24 juin 2022

Best design for a generic reader [closed]

I'm trying to implement a text reader, agnostic to the document nature/type (reading word, pdf, etc ...).

Each reader has:

  • common functionnality and attributes: for exemple we want to extract whole text of the file in input and store the text in a "fulltext" attribute. So i created a Document class to store the common attributes and i have to get a function to read the whole text.
  • but for further applications needs (like NLP application for example), i am interested by specific attributes specific to document type. For example, with word api we can extract the text with a granularity "paragraph" with the win32 api since word let the possibility to the word creator to embed a portion of text with paragraph delimitation. In that way we can get the author intention structuring its document. (getting this paragraph splitting in pdf is not easy since the "paragraph spacing" differ from a pdf to an other, way more complicate thant retrieving a sentence splitting easy with regex). This functionnality can be really interesting for my further nlp treatment, and since only the win32 api can help me to do that i need to treat that inside the read_doc function. (I can imagine other specific things: for exemple for read_pptx extract text at the slide granularity,...)

I feel like my current api and implementation choice is not good and am a bit lost how to design all that. What would you suggest?

I was thinking:

  • to remove the "document" input in read_pdf, read_img etc ... (it is not natural to pass that as a variable) which is noit natural and kind of ugly
  • perhaps create read_doc, read_ppt class rather than function to allow different method (to extract whole text but other functionality) which would inherit from a generic Reader class (kind of abstract class). This Reader class would have minimally: a document attribute, and an extract_fulltext method.
import os
import fitz
import docx
import win32com.client as win32
import subprocess
from tempfile import TemporaryDirectory
import traceback

class Document():
    def __init__(self,filepath):
        self.filepath = filepath
        self.filename = os.path.basename(filepath)
        self.readable = None
        self.fulltext = ""
        self.doctype = None
        self.extension = filepath.lower().split('.')[-1] 

    def to_dict(self,drop=None):
        data=vars(self)
        return data

def read_pdf(filepath,document):
    try:
        doc = fitz.open(filepath)
    except Exception as e: 
        logger.error("Lecture PDF impossible.|{}|{}".format(e,filepath))
        raise
    fulltext = []
    pagetext = {}
    for i,page in enumerate(doc):
        page_text = page.get_text()
        pagetext[i]= page_text
        fulltext.append(page_text)
    document.fulltext = '\n'.join(fulltext)
    document.pagetext = pagetext
    return document

def read_pptx(filepath,document):
    return ""

def read_docx(filepath,document):
    return ""

def read_img(filepath,document):
    return ""


def is_textdoc(text):
        """
        Based on volume of text (nchar), sort doc in two classes: scanneddoc (image),  textdoc
        :param text: text of the pdf file
        :return : bool True if scanned image, false in case of textual pdf
        """
        nchar = 20
        if len(text) > nchar:
            return True
        else:
            return False

def build_doc(filepath):
    document = Document(filepath)
    ext = filepath.lower().split(".")[-1]
    #import pdb;pdb.set_trace()
    app_ref = {'pdf':read_pdf,'doc':read_doc,'docx':read_docx,'pptx':read_pptx,'ppt':read_ppt}
    try:
        print("Reading: {}".format(filepath))
        document = app_ref[ext](filepath,document)
        document.readable =True
    except Exception as e:
        logger.error("Format not read.|{}|{}".format(e,filepath))
        traceback.print_exc()
        document.readable = False
    if is_textdoc(document.fulltext):
        document.doctype = 'text'
    else:
        try:
            document = read_img(filepath,document)
            if is_textdoc(document.fulltext):
                document.doctype = 'img'
        except:
            raise
    return document

Aucun commentaire:

Enregistrer un commentaire