lundi 18 octobre 2021

Dealing with large data binary files

I am working with large binary files (aprox 2 Gb each) that contain raw data. These files have a well defined structure, where each file is an array of events, and each event is an array of data banks. Each event and data bank have a structure (header, data type, etc.).

From these files, all I have to do is extract whatever data I might need, and then I just analyze and play with the data. I might not need all of the data, sometimes I just extract XType data, other just YType, etc.

I don't want to shoot myself in the foot, so I am asking for guidance/best practice on how to deal with this. I can think of 2 possibilities:

Option 1

  • Define a DataBank class, this will contain the actual data (std::vector<T>) and whatever structure this has.
  • Define a Event class, this has a std::vector<DataBank> plus whatever structure.
  • Define a MyFile class, this is a std::vector<Event> plus whatever structure.

The constructor of MyFile will take a std:string (name of the file), and will do all the heavy lifting of reading the binary file into the classes above.

Then, whatever I need from the binary file will just be a method of the MyFile class; I can loop through Events, I can loop through DataBanks, everything I could need is already in this "unpacked" object.

The workflow here would be like:

int main() {
    MyFile data_file("data.bin");
    std::vector<XData> my_data = data_file.getXData();
    \\Play with my_data, and never again use the data_file object
    \\...
    return 0;
}

Option 2

  • Write functions that take std::string as an argument, and extract whatever I need from the file e.g. std::vector<XData> getXData(std::string), int getNumEvents(std::string), etc.

The workflow here would be like:

int main() {
    std::vector<XData> my_data = getXData("data.bin");
    \\Play with my_data, and I didn't create a massive object
    \\...
    return 0;
}

Pros and Cons that I see

Option 1 seems like a cleaner option, I would only "unpack" the binary file once in the MyFile constructor. But I will have created a huge object that contains all the data from a 2 Gb file, which I will never use. If I need to analyze 20 files (each of 2 Gb), will I need 40 Gb of ram? I don't understand how these are handled, will this affect performance?

Option number 2 seems to be faster; I will just extract whatever data I need, and that's it, I won't "unpack" the entire binary file just to later extract the data I care about. The problem is that I will have to deal with the binary file structure in every function; if this ever changes, that will be a pain. I will only create objects of the data I will play with.

As you can see from my question, I don't have much experience with dealing with large structures and files. I appreciate any advice.

Aucun commentaire:

Enregistrer un commentaire