mercredi 5 janvier 2022

What is the best way to organise a log parser function?

I've been writing a log parser to get some information out of some logs and then use it elsewhere. The idea is to run it over a series of log files and store the useful information in a database for use in the future. The language I'm using is python(3.8)

The types of information extracted from the logs are json-type strings, which I store in dictionaries, normal alphanumeric strings, timestamps(which we convert to datetime objects), integers and floats - sometimes as values in dictionary-type format.

I've made a parse_logs(filepath) method that takes a filepath and returns a list of dictionaries with all the messages within them. A message can consist of multiple of the above types, and in order to parse those logs I've written a number of methods to isolate message from the log lines into a list of strings and then manipulate those lists of lines that make up a message to extract various kinds of information.

This has resulted in a main parse_logs(filepath: str) -> list function with multiple helper functions (like extract_datetime_from_header(header_line: str) -> datetime , extract_message(messages: list) -> list and process_message(message: list) -> dict that each does a specific thing, but are not useful to any other part of the project I'm working on as they are very specific to aid this function. The only additional thing I wish to do (right now, at least) is take those messages and save their information in a database.

-So, there are 2 main ways that I'm thinking of organising my code: One is making a LogParser class and it will have a path to the log and a message list as attributes, and all of the functions as class methods. (In that case what should the indentation level of the helper classes be? should they be their own methods or should they just be functions defined inside the method they are supposed to enable? ). The other is just having a base function(and nesting all helper functions inside it, as I assume that I wouldn't want them imported as standalone functions) and just run that method with only the path as an argument, and it will return the message list to a caller function that will take the list, parse it and move each message in it's place in the database. -Another thing that I'm considering is whether to use dataclasses instead of dictionaries for the data. The speed difference won't matter much since it's a script that's gonna run just a few times a day as a cronjob and it won't matter that much if it takes 5 seconds or 20 to run(unless the difference is way more, I've only tested it on log examples of half a MB instead of 4-6 GB that are the expected ones) My final concern is keeping the message objects in-memory and feeding them directly to the database writer. I've done a bit of testing and estimating and I expect that 150MB seems like a reasonable ceiling for a worst-case scenario (that is a log full of only useful data that's a 40% larger than the current largest log that we have - so even if we scale to 3times that amount, I think that a 16gb RAM machine should be able to handle that without any trouble).

So, with all these said, I'd like to ask for best practices on how to handle organising the code, namely:

  1. Is the class/oop way a better practice than the functional way? Is it more readable/maintainable?
  2. Should I use dataclasses or stick to dictionaries? What are the advantages/disadvantages of both? Which is better maintainable and which is more efficient?
  3. If I care about handling data from the database and not from these objects(dicts or data classes), which is the more efficient way to go?
  4. Is it alright to keep the message objects in-memory until the database transaction is complete or should I handle it in a different manner? I've thought of either doing a single transaction after I finish parsing a single log (but I was told that it could lead to both bad scalability since the temporary list of messages would keep increasing in-memory up to the point where they'd be used in the db transaction - and that a single large transaction could also be in turn slow) or of writing every message as it's parsed(as a dictionary object) in a file in disc and then parse that intermediary(is that the correct word? ) file to the function that will handle the db transactions and do them in batches (I was told that's not a good practice either), or write directly to the db while parsing messages (either after every message or in small batches so that the total message list doesn't get to grow too large). I've even thought of going a producer/consumer route and keep a shared variable that the producer(log parser) will append to while the consumer(database writer) will consume, both until the log is fully parsed. But this route is not something that I've done before (except for a few times for interview questions, which was rather simplistic and it felt hard to debug or maintain so I don't feel that confident in doing right now). What are the best practices regarding the above?

Thank you very much for your time! I know it's a bit of a lot that I've asked, but I did feel like writing down all of the thoughts that I had and read some people's opinions on them. Till then I'm gonna try to do an implementation for all of the above ideas (except perhaps the producer/consumer) and see which feels more maintainable, human readable and intuitively correct to me.

Aucun commentaire:

Enregistrer un commentaire