mercredi 5 janvier 2022

What is the best way to organise a log parser function?

I've been writing a log parser to get some information out of some logs and then use it elsewhere. The idea is to run it over a series of log files and store the useful information in a database for use in the future. The language I'm using is python(3.8)

The types of information extracted from the logs are json-type strings, which I store in dictionaries, normal alphanumeric strings, timestamps(which we convert to datetime objects), integers and floats - sometimes as values in dictionary-type format.

I've made a parse_logs(filepath) method that takes a filepath and returns a list of dictionaries with all the messages within them. A message can consist of multiple of the above types, and in order to parse those logs I've written a number of methods to isolate message from the log lines into a list of strings and then manipulate those lists of lines that make up a message to extract various kinds of information.

This has resulted in a main parse_logs(filepath: str) -> list function with multiple helper functions (like extract_datetime_from_header(header_line: str) -> datetime , extract_message(messages: list) -> list and process_message(message: list) -> dict that each does a specific thing, but are not useful to any other part of the project I'm working on as they are very specific to aid this function. The only additional thing I wish to do (right now, at least) is take those messages and save their information in a database.

-So, there are 2 main ways that I'm thinking of organising my code: One is making a LogParser class and it will have a path to the log and a message list as attributes, and all of the functions as class methods. (In that case what should the indentation level of the helper classes be? should they be their own methods or should they just be functions defined inside the method they are supposed to enable? ). The other is just having a base function(and nesting all helper functions inside it, as I assume that I wouldn't want them imported as standalone functions) and just run that method with only the path as an argument, and it will return the message list to a caller function that will take the list, parse it and move each message in it's place in the database. -Another thing that I'm considering is whether to use dataclasses instead of dictionaries for the data. The speed difference won't matter much since it's a script that's gonna run just a few times a day as a cronjob and it won't matter that much if it takes 5 seconds or 20 to run(unless the difference is way more, I've only tested it on log examples of half a MB instead of 4-6 GB that are the expected ones) My final concern is keeping the message objects in-memory and feeding them directly to the database writer. I've done a bit of testing and estimating and I expect that 150MB seems like a reasonable ceiling for a worst-case scenario (that is a log full of only useful data that's a 40% larger than the current largest log that we have - so even if we scale to 3times that amount, I think that a 16gb RAM machine should be able to handle that without any trouble).

So, with all these said, I'd like to ask for best practices on how to handle organising the code, namely:

  1. Is the class/oop way a better practice than the functional way? Is it more readable/maintainable?
  2. Should I use dataclasses or stick to dictionaries? What are the advantages/disadvantages of both? Which is better maintainable and which is more efficient?
  3. If I care about handling data from the database and not from these objects(dicts or data classes), which is the more efficient way to go?
  4. Is it alright to keep the message objects in-memory until the database transaction is complete or should I handle it in a different manner? I've thought of either doing a single transaction after I finish parsing a single log (but I was told that it could lead to both bad scalability since the temporary list of messages would keep increasing in-memory up to the point where they'd be used in the db transaction - and that a single large transaction could also be in turn slow) or of writing every message as it's parsed(as a dictionary object) in a file in disc and then parse that intermediary(is that the correct word? ) file to the function that will handle the db transactions and do them in batches (I was told that's not a good practice either), or write directly to the db while parsing messages (either after every message or in small batches so that the total message list doesn't get to grow too large). I've even thought of going a producer/consumer route and keep a shared variable that the producer(log parser) will append to while the consumer(database writer) will consume, both until the log is fully parsed. But this route is not something that I've done before (except for a few times for interview questions, which was rather simplistic and it felt hard to debug or maintain so I don't feel that confident in doing right now). What are the best practices regarding the above?

Thank you very much for your time! I know it's a bit of a lot that I've asked, but I did feel like writing down all of the thoughts that I had and read some people's opinions on them. Till then I'm gonna try to do an implementation for all of the above ideas (except perhaps the producer/consumer) and see which feels more maintainable, human readable and intuitively correct to me.

Should/Can we use names from dependency interface in our wrapping interface?

let us assume I'm making a wrapper of a third party library like f.e. OpenSSL, the wrapper allows to establish a TLS session, without making a direct dependency between client code and OpenSSL.

So basically, the library provides an abstraction for a OpenSSL TLS implementation (from client code point of view). To configure the OpenSSL, its interface offers a set o defines (f.e. pointing to cipher suits, tls versions etc.). The configuration is done by a client code, my library just needs to "forward it" to OpenSSL. So the question is how to construct the wrapping library interface in way where I still can keep its abstraction (do not use directly names from OpenSSL library ) and I'm not defining all required values again (or creating new values and mapping them to openSSL values)?

F.e. If OpenSSL has #define TLS_1_1 32, should my library make same define, keeping interface header file clean from any OpenSSL names ? #define MY_LIBRARY_TLS_1_1 32, or should I include openssl.h file and use openssl name: #define MY_LIBRARY_TLS_1_1 TLS_1_1 ?

The second option makes maintaining a bit simpler (in case OpenSSL values change, there is no need of modifications in my interface), but it creates a direct dependency from my interface to the OpenSSL interface and forces me to deliver OpenSSL header with my library ?

What is the right way to do this ?

mardi 4 janvier 2022

How to find specific pattern in a paragraph in Python?

I want to find a specific pattern in a paragraph. The pattern must contain a-zA-Z and 0-9 and length is 5 or more than 5. How to implement it on Python?

My code is:

str = "I love5 verye mu765ch"
print(re.findall('(?=.*[0-9])(?=.*[a-zA-Z]{5,})',str))

this will return a null.

Expected result like:

love5
mu765ch

the valid pattern is like:

9aacbe
aver23893dk
asdf897

Java Class with a lot of fields [closed]

I need to design a class with a lot of fields to receive some optional and mandatory data. For the mandatory fields i use Fluent or Step Builder design pattern, as i understand it’s the same. But for the optional fields I also need to design some structure, group them, that is convenient to use my class. What would be the best options?

For example, there is a Data class with many optional fields.

  @Accessors(chain = true)
  @Setter
  public class Data{
    String A;
    String B;
    String C;
    String D;
    String E;
    String F;
  }

I could split fields in Group1 and Group2 logically. And use setter like in the class Data1 or final fields like in the class Data2.

  @Accessors(chain = true)
  @Setter
  public class Data1{
    Group1 group1;
    Group2 group2;
  }

  public class Data2{
    public final Group1 group1=new Group1();
    public final Group2 group2=new Group2();
  }

  @Accessors(chain = true)
  @Setter
  public class Group1{
    String A;
    String B;
    String C;
  }

  @Accessors(chain = true)
  @Setter
  public class Group2{
    String D;
    String E;
    String F;
  }

Another questing is what would be the best naming for a setter setA(…), withA(…) or just a(…)

In which cases is the observer pattern superior to the PubSub pattern?

At the moment I'm learning about important OOP design patterns and are currently studying the differences between the Observer and the PubSub pattern and how to implement them in Python. For that reason, I'm asked to implement some easy notification service for a SmartHome with the help of the Observer Pattern. Thereby, the notification service should send notification messages when the house is empty and e.g. the light is still turned on.

Now in a subsequent task I'm asked to argue why the observer pattern is superior to the PubSub pattern in the above example. At the moment I am a little unsure how to answer this question, because I would have considered the PubSub pattern for the above example to be more suitable. However, despite all that I've come to an argument where the Observer pattern might be superior to the PubSub pattern.

  • PubSub pattern relies on a many-to-many communication.
  • Confidentiality and authenticity of messages is strongly coupled to the security of the broker that mediates all dataflows.
  • Publishers and subscribers can easily connect to a broker and exchange data via specific topics which is also true for an attacker (many-to-many communication).
  • Passive attackers outside the publish/subscribe network can eavesdrop the communication and try to discover content of events and subscriptions.

Due to the above "weaknesses", this would mean that an attacker could potentially monitor if any of the house owners is currently at home to commit burglary if the PubSub pattern is not implemented carefully. However, from my point of view the Subject in the Observer pattern could establish a one-to-one communication on calling the "update" method and establish a secure channel to maintain Confidentiality and Integrity of the sent state change notification.

At the moment, I'm not quite sure about the correctness of my argumentation and would like to ask if anyone could give me a hint to answer this question.

Thank you in advance for your help.

Best regards,

RatbaldMeyer

lundi 3 janvier 2022

C program to fill an n x n matrix in a spiral form

I am implementing a C program using functions to fill a square matrix in a spiral form. Here is what I already did:

#include <stdio.h>
#include <conio.h>

const N = 5;
int top = 0;
int bottom = N - 1;
int right = 0;
int left = N -1;

int main(){
    int z = 1 /*N = 5*/;
    int Array[100][100];
    while (z <= (N*N))
    {
        FillRowForward(Array, z);
        FillColumnDownward(Array, z);
        FillRowBackward(Array, z);
        FillColumnUpward(Array, z);
    }

    printf("Two dimensional array elements: \n");

    for (int i = 0; i < N; i++)
    {
        // printf("\t");
        for (int j = 0; j < N; j++)
        {
            printf("%d \t", Array[i][j]);
        }
        printf("\n");
    }

    return 0;
}


/*Definition of functions*/

int FillRowForward(int A[][N], /*int top, int left, int right,*/ int Z)
{
    for (int i = right; i <= left; i++)
    {
        A[top][i] = Z++;
    }
}

int FillColumnDownward(int A[][N], /*int top, int bottom, int right,*/ int Z)
{
    for (int j = top + 1; j <= bottom; j++)
    {
        A[j][bottom] = Z++;
    }
}

int FillRowBackward(int A[][N], /*int bottom, int left, int right,*/ int Z)
{
    for (int i = left - 1; i >= top; i--)
    {
        A[bottom][i] = Z++;
    }
}

int FillColumnUpward(int A[][N], /*int top, int bottom, int right,*/ int Z)
{
    for (int j = bottom - 1; j >= top + 1; j--)
    {
        A[j][left] = Z++;
    }

}

The first function is supposed to fill the first row (FillRowForward), the next is supposed to fill the first column downward and so on until all the matrix is filled. But when I run it only shows a black and blank screen. No output. Need some help on this please!

How to avoid passing around runtime parameter everywhere

I have a client app that let's user take part in a match. In almost all cases, maybe except tests, user takes part in one or zero matches at the same time. Every match have a unique ID, let's call it MatchID. I have many services that provides information about the match, e.g.:

interface IMatchPlayersRepository
{
    string[] GetMatchPlayers(string matchId);
    string GetBestPlayer(string matchId);
}

interface IMatchFieldsInfo
{
    int GetFieldSize(string matchId);
}

interface IMatchScoreboardProvider
{
    Scoreboard GetScoreboard(string matchId, string someOtherParam);
}

// And many others...

Implementations of those use one another, e.g.:

public class MatchScoreboardProvider : IMatchScoreboardProvider
{
    public MatchScoreboardProvider(IMatchPlayersRepository playersRepository)
    { //...
    }

    public Scoreboard GetScoreboard(string matchId, string someOtherParam)
    {
        var bestPlayer = _playersRepository.GetBestPlayer(matchId);
        //...
    }
}

How do I prevent passing matchId everywhere? Using factories only moves the problem to factory methods, e.g.:

public class MatchPlayersRepositoryFactory
{
    public IMatchPlayersRepository Get(string matchId) => new MatchPlayersRepository(matchId);
}

public class MatchScoreboardProvider : IMatchScoreboardProvider
{
    // Created from factory
    public MatchScoreboardProvider(string matchId, MatchPlayersRepositoryFactory playersRepositoryFactory)
    { //...
    }

    public Scoreboard GetScoreboard(string someOtherParam)
    {
        var bestPlayer = _playersRepositoryFactory.Get(_matchId).GetBestPlayer();
        //...
    }
}

Ideally, I would have another DI "composition root" for every match, where all object are created from this new DI framework, that passes matchId to them. However, creating a new "composition root" seems like a really bad idea, from what I understand.