design-patterns: Multiple-String Search Efficiency for Partial Matches

jeudi 17 août 2017

Multiple-String Search Efficiency for Partial Matches

I'm building a decoder for a very non-compliant binary file and I need to search the binary file (specifically, a partial byte buffer) for frame headers. This means using an efficient multiple-pattern string search algorithm. I decided aho-corasick. The problem is that some headers are partial bytes, ie 11 bits. These are the approaches I've come up with:

Modify the aho-corasick trie to store children in interval trees rather than hashes. So given the example pattern 0b11111111111, the first node will store the range [0xff-0xff] and the second node will store the range [0xe0-0xff]. Unfortunately this greatly expands the number of suffix links and dictionary suffix links among trie nodes, so I think this solution is equivalent to the following solution:
Expand the partial bytes pattern into all possible matching byte patterns: 0b11111111111 -> [0xffe0, 0xffe1, ..., 0xfffe, 0xffff]. Obviously increasing the number of possible patterns greatly increases the search time.
Convert the bytes to bits using the bitarray module (the individual bits are still backed by bytes in-memory, but now individually accessible). Unfortunately this increases the search space 8-fold, as well as the search pattern lengths.
Truncate the pattern to be a multiple of bytes (ie 0b11111111111 -> 0b11111111) and then manually check for the three remaining bits using standard integer bit shifting. This greatly increases the number of pattern occurrences in the search space.

Am I crazy to think all these approaches have exactly the same Big-Theta complexity? ... Or is one approach more efficient?

Aho-corasick complexity is O(n + m + z) where n is the length of the search text, m is the total length of all search patterns, and z is the number of matches. Aho-corasick detects overlapping patterns which isn't necessary but can't happen since my patterns do not overlap.

Use a different string-searching algorithm. Or some feature of python byte comparisons I'm not aware of.

design-patterns

jeudi 17 août 2017

Multiple-String Search Efficiency for Partial Matches

Aucun commentaire:

Enregistrer un commentaire