vendredi 28 octobre 2022

Regular Expression to match on common starting characters including wildcard characters in either/both test and lookup strings

I'm trying to construct a regular expression to match an input/test String variable against a list of strings. The logic needs to be as follows :

  1. If there is an item in the list that is simply a single asterisk "*", consider that a match

  2. If there is an item in the list where the entirety of the string matches the beginning of the test string, consider that a match

Test String List Item Expected Result
ABCD A Match
ABCD AB Match
ABCD ABC Match
ABCD ABCD Match
ABCD ABCDE Not a match
ABCD AC Not a match
ABCD ABD Not a match
ABCD XA Not a match
ABCD XABC Not a match
  1. If there is an item in the list which contains one or more period characters ("."), they should each be treated as a single-character wildcard (equivalent of "?" or could also be [A-Z]{1})
Test String List Item Expected Result
ABCD A.C Match
ABCD AB.D Match
ABCD A.CD Match
ABCD A..D Match
ABCD A.B Not a match
ABCD A.BC Not a match
  1. Similarly, any period characters in the test string should also be treated as single-character wildcards
Test String List Item Expected Result
A..D ABCD Match
A..D AB.D Match
A..D A.CD Match
A..D A..D Match
A..D ABC Not a match
A..D ACBD Not a match
A..D A.BC Not a match

(NB : Period characters cannot appear at the very start or very end of either the test string nor any of the list items - only surrounded by alpha characters)

So, taking the "ABCD" example, a (poor) regular expression that (I think?) would work would be something like :

^((\*)|(A)|(AB)|(ABC)|(ABCD)|(A\.CD)|(AB\.D)|(A\.\.D))$

Regular expression visualization

Debuggex Demo

The "A..D" example is slightly more straightforward (again, I think?) :

^((\*)|(A)|(A.)|(A..)|(A..D))$

Regular expression visualization

Debuggex Demo

However - the test string is dynamic (string variable) so that would mean I would have to construct this pattern with some kind of loop or nested loops based on the characters in the test string every time I need to run a pattern match. Which is fine if the test string is short like "ABCD" but the pattern complexity grows exponentially as the length of the test string increases.

For example, if "ABCD" is changed to "ABCDE", the equivalent pattern becomes :

^((\*)|(A)|(AB)|(ABC)|(ABCD)|(ABCDE)|(A\.CDE)|(AB\.DE)|(ABC\.E)|(A\.\.DE)|(AB\.\.E)|(A\.\.\.E))$

Regular expression visualization

Debuggex Demo

So... I'm just wondering if there's a smarter way of constructing a regex pattern that meets these rules, based on an input/test String variable of arbitrary length?

Aucun commentaire:

Enregistrer un commentaire