jeudi 6 avril 2017

Concurrently parsing records in a binary file in Go

I have a binary file that I want to parse. The file is broken up into records that are 1024 bytes each. The high level steps needed are:

  1. Read 1024 bytes at a time from the file.
  2. Parse each 1024-byte "record" (chunk) and place the parsed data into a map or struct.
  3. Return the parsed data to the user and any error(s).

I'm not looking for code, just design/approach help.

Due to I/O constraints, I don't think it makes sense to attempt concurrent reads from the file. However, I see no reason why the 1024-byte records can't be parsed using goroutines so that multiple 1024-byte records are being parsed concurrently. I'm new to Go, so I wanted to see if this makes sense or if there is a better (faster) way:

  1. A main function opens the file and reads 1024 bytes at a time into byte arrays (records).
  2. The records are passed to a function that parses the data into a map or struct. The parser function would be called as a goroutine on each record.
  3. The parsed maps/structs are appended to a slice via a channel. I would preallocate the underlying array managed by the slice as the file size (in bytes) divided by 1024 as this should be the exact number of elements (assuming no errors).

I'd have to make sure I don't run out of memory as well, as the file can be anywhere from a few hundred MB up to 256 TB (rare, but possible). Does this make sense or am I thinking about this problem incorrectly? Will this be slower than simply parsing the file in a linear fashion as I read it 1024 bytes at a time, or will passing these records as byte arrays perform better? Or am I thinking about the problem all wrong?

I'm not looking for code, just design/approach help.

Cross-posted on Software Engineering

Aucun commentaire:

Enregistrer un commentaire