I have a list of N strings with patterns in that I would like to match. I am doing this using the difflib
library:
from difflib import SequenceMatcher
def longestSubstring(str1,str2):
seqMatch = SequenceMatcher(None,str1,str2)
match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
return match
a = 'grandfather's clock'
b = 'father'
longestSubstring(a, b).size # returns 6 which is the length of 'father'
I would like to store this information for all of the strings that I may have (N is of the order 100s rather than 1000s or higher, with string lengths being at max 100 in length).
Once the information is stored I need to remove the pairs in order of longest substring match, and then iteratively do another match with the remaining parts of the strings that aren't matched.
e.g.
str1 = 'abcdefghijk'
str2 = 'bcde'
str3 = 'fghz'
result = {'a' : False,
'bcd': True,
'fgh': True,
'ijk': False,
'z' : False}
My current plan is to store the match.size
value for each pair as an entry in a square numpy.array
object with length equal to the number of strings. The exception to this would be if i==j; array[i][j] = 0
so that strings do not match themselves.
e.g.
str1 = 'abcdefghijk'
str2 = 'bcde'
str3 = 'fghz'
matches = np.array([[0, 4, 3],
[4, 0, 0],
[3, 0, 0]])
However I do not know how to do this iteratively, especially when the match is in the middle of a string. Please note that I am open to changing my methodology for matching the strings, or iterating through strings after matches have been completed if someone knows a better way to do either of these.
I can also edit the question if more detail is needed.
Aucun commentaire:
Enregistrer un commentaire