jeudi 17 août 2017

Multiple-String Search Efficiency for Partial Matches

I'm building a decoder for a very non-compliant binary file and I need to search the binary file (specifically, a partial byte buffer) for frame headers. This means using an efficient multiple-pattern string search algorithm. I decided aho-corasick. The problem is that some headers are partial bytes, ie 11 bits. These are the approaches I've come up with:

  1. Modify the aho-corasick trie to store children in interval trees rather than hashes. So given the example pattern 0b11111111111, the first node will store the range [0xff-0xff] and the second node will store the range [0xe0-0xff]. Unfortunately this greatly expands the number of suffix links and dictionary suffix links among trie nodes, so I think this solution is equivalent to the following solution:

  2. Expand the partial bytes pattern into all possible matching byte patterns: 0b11111111111 -> [0xffe0, 0xffe1, ..., 0xfffe, 0xffff]. Obviously increasing the number of possible patterns greatly increases the search time.

  3. Convert the bytes to bits using the bitarray module (the individual bits are still backed by bytes in-memory, but now individually accessible). Unfortunately this increases the search space 8-fold, as well as the search pattern lengths.

  4. Truncate the pattern to be a multiple of bytes (ie 0b11111111111 -> 0b11111111) and then manually check for the three remaining bits using standard integer bit shifting. This greatly increases the number of pattern occurrences in the search space.

Am I crazy to think all these approaches have exactly the same Big-Theta complexity? ... Or is one approach more efficient?

Aho-corasick complexity is O(n + m + z) where n is the length of the search text, m is the total length of all search patterns, and z is the number of matches. Aho-corasick detects overlapping patterns which isn't necessary but can't happen since my patterns do not overlap.

  1. Use a different string-searching algorithm. Or some feature of python byte comparisons I'm not aware of.

Aucun commentaire:

Enregistrer un commentaire