samedi 4 décembre 2021

python: multiple functions or abstract classes when dealing with data flow requirement

I have more of a design question, but I am not sure how to handle that. I have a script preprocessing.py where I read a .csv file of text column that I would like to preprocess by removing punctuations, characters, ...etc.

What I have done now is that I have written a class with several functions as follows:

class Preprocessing(object):
    def __init__(self, file):
        self.my_data = pd.read_csv(file)

    def remove_punctuation(self):
        self.my_data['text'] = self.my_data['text'].str.replace('#','')

    def remove_hyphen(self):
        self.my_data['text'] = self.my_data['text'].str.replace('-','')

    def remove_words(self):
        self.my_data['text'] = self.my_data['text'].str.replace('reference','')

    def save_data(self):
        self.my_data.to_csv('my_data.csv')

def preprocessing(file_my):
    f = Preprocessing(file_my)
    f.remove_punctuation()
    f.remove_hyphen()
    f.remove_words()
    f.save_data()
    return f

if __name__ == '__main__':
    preprocessing('/path/to/file.csv')

although it works fine, i would like to be able to expand the code easily and have smaller classes instead of having one large class. So i decided to use abstract class:

import pandas as pd
from abc import ABC, abstractmethod

my_data = pd.read_csv('/Users/kgz/Desktop/german_web_scraping/file.csv')

class Preprocessing(ABC):
    @abstractmethod
    def processor(self):
        pass

class RemovePunctuation(Preprocessing):
    def processor(self):
        return my_data['text'].str.replace('#', '')


class RemoveHyphen(Preprocessing):
    def processor(self):
        return my_data['text'].str.replace('-', '')


class Removewords(Preprocessing):
      def processor(self):
          return my_data['text'].str.replace('reference', '')

 final_result = [cls().processor() for cls in Preprocessing.__subclasses__()]
 print(final_result)

So now each class is responsible for one task but there are a few issues I do not know how to handle since I am new to abstract classes. first, I am reading the file outside the classes, and I am not sure if that is good practice? if not, should i pass it as an argument to the processor function or have another class who is responsible to read the data.

Second, having one class with several functions allowed for a flow, so every transformation happened in order (i.e, first punctuation is removes, then hyphen is removed,...etc) but I do not know how to handle this order and dependency in abstract classes.

Aucun commentaire:

Enregistrer un commentaire