mercredi 11 décembre 2019

Iteratively matching substrings and then removing matches

I have a list of N strings with patterns in that I would like to match. I am doing this using the difflib library:

from difflib import SequenceMatcher

def longestSubstring(str1,str2):  
     seqMatch = SequenceMatcher(None,str1,str2)  
     match = seqMatch.find_longest_match(0, len(str1), 0, len(str2)) 
     return match

a = 'grandfather's clock'
b = 'father'

longestSubstring(a, b).size # returns 6 which is the length of 'father'

I would like to store this information for all of the strings that I may have (N is of the order 100s rather than 1000s or higher, with string lengths being at max 100 in length).

Once the information is stored I need to remove the pairs in order of longest substring match, and then iteratively do another match with the remaining parts of the strings that aren't matched.

e.g.

str1 = 'abcdefghijk'
str2 = 'bcde'
str3 = 'fghz'

result = {'a'  : False, 
          'bcd': True, 
          'fgh': True, 
          'ijk': False,
          'z'  : False}

My current plan is to store the match.size value for each pair as an entry in a square numpy.array object with length equal to the number of strings. The exception to this would be if i==j; array[i][j] = 0 so that strings do not match themselves.

e.g.

str1 = 'abcdefghijk'
str2 = 'bcde'
str3 = 'fghz'

matches = np.array([[0, 4, 3],
                    [4, 0, 0],
                    [3, 0, 0]])

However I do not know how to do this iteratively, especially when the match is in the middle of a string. Please note that I am open to changing my methodology for matching the strings, or iterating through strings after matches have been completed if someone knows a better way to do either of these.

I can also edit the question if more detail is needed.

Aucun commentaire:

Enregistrer un commentaire