I'm building a decoder for a very non-compliant binary file and I need to search the binary file (specifically, a partial byte buffer) for frame headers. This means using an efficient multiple-pattern string search algorithm. I decided aho-corasick. The problem is that some headers are partial bytes, ie 11 bits. These are the approaches I've come up with:
-
Modify the aho-corasick trie to store children in interval trees rather than hashes. So given the example pattern
0b11111111111
, the first node will store the range[0xff-0xff]
and the second node will store the range[0xe0-0xff]
. Unfortunately this greatly expands the number of suffix links and dictionary suffix links among trie nodes, so I think this solution is equivalent to the following solution: -
Expand the partial bytes pattern into all possible matching byte patterns:
0b11111111111
->[0xffe0, 0xffe1, ..., 0xfffe, 0xffff]
. Obviously increasing the number of possible patterns greatly increases the search time. -
Convert the bytes to bits using the
bitarray
module (the individual bits are still backed by bytes in-memory, but now individually accessible). Unfortunately this increases the search space 8-fold, as well as the search pattern lengths. -
Truncate the pattern to be a multiple of bytes (ie
0b11111111111
->0b11111111
) and then manually check for the three remaining bits using standard integer bit shifting. This greatly increases the number of pattern occurrences in the search space.
Am I crazy to think all these approaches have exactly the same Big-Theta complexity? ... Or is one approach more efficient?
Aho-corasick complexity is O(n + m + z) where n is the length of the search text, m is the total length of all search patterns, and z is the number of matches. Aho-corasick detects overlapping patterns which isn't necessary but can't happen since my patterns do not overlap.
- Use a different string-searching algorithm. Or some feature of python byte comparisons I'm not aware of.
Aucun commentaire:
Enregistrer un commentaire