I am working with large binary files (aprox 2 Gb each) that contain raw data. These files have a well defined structure, where each file is an array of events
, and each event is an array of data banks
. Each event
and data bank
have a structure (header
, data type
, etc.).
From these files, all I have to do is extract whatever data I might need, and then I just analyze and play with the data. I might not need all of the data, sometimes I just extract XType
data, other just YType
, etc.
I don't want to shoot myself in the foot, so I am asking for guidance/best practice on how to deal with this. I can think of 2 possibilities:
Option 1
- Define a
DataBank
class, this will contain the actual data (std::vector<T>
) and whatever structure this has. - Define a
Event
class, this has astd::vector<DataBank>
plus whatever structure. - Define a
MyFile
class, this is astd::vector<Event>
plus whatever structure.
The constructor of MyFile
will take a std:string
(name of the file), and will do all the heavy lifting of reading the binary file into the classes above.
Then, whatever I need from the binary file will just be a method of the MyFile
class; I can loop through Events
, I can loop through DataBanks
, everything I could need is already in this "unpacked" object.
The workflow here would be like:
int main() {
MyFile data_file("data.bin");
std::vector<XData> my_data = data_file.getXData();
\\Play with my_data, and never again use the data_file object
\\...
return 0;
}
Option 2
- Write functions that take
std::string
as an argument, and extract whatever I need from the file e.g.std::vector<XData> getXData(std::string)
,int getNumEvents(std::string)
, etc.
The workflow here would be like:
int main() {
std::vector<XData> my_data = getXData("data.bin");
\\Play with my_data, and I didn't create a massive object
\\...
return 0;
}
Pros and Cons that I see
Option 1 seems like a cleaner option, I would only "unpack" the binary file once in the MyFile
constructor. But I will have created a huge object that contains all the data from a 2 Gb file, which I will never use. If I need to analyze 20 files (each of 2 Gb), will I need 40 Gb of ram? I don't understand how these are handled, will this affect performance?
Option number 2 seems to be faster; I will just extract whatever data I need, and that's it, I won't "unpack" the entire binary file just to later extract the data I care about. The problem is that I will have to deal with the binary file structure in every function; if this ever changes, that will be a pain. I will only create objects of the data I will play with.
As you can see from my question, I don't have much experience with dealing with large structures and files. I appreciate any advice.
Aucun commentaire:
Enregistrer un commentaire