Skip to content Skip to sidebar Skip to footer

Python Regex Module Fuzzy Match: Substitution Count Not As Expected

Background The Python module regex allows fuzzy matching. You can specify the allowable number of substitutions (s), insertions (i), deletions (d), and total errors (e). The fuzzy_

Solution 1:

The issue seems to be related to the value in the allowed error setting.

Reducing the s to s < 3 changes the fuzzy match tuple score downwards:

>>>reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<3,i<3,d<3,e<4}">>>query = "TATGGACCAAAGTCTCAAGCCATGTG">>>match = regex.search(reference, query, regex.BESTMATCH)>>>print(match.fuzzy_counts) 
(1,0,1)

reducing the allowed error for 's' even further returns the expected 's' score for this match:

>>>reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<2,i<3,d<3,e<4}">>>query = "TATGGACCAAAGTCTCAAGCCATGTG">>>match = regex.search(reference, query, regex.BESTMATCH)>>>print(match.fuzzy_counts)
(0,0,1)

Why it behaves in this way is still a mystery to me.

Solution 2:

This was caused by what looks to be a bug in the regex module's cost calculations. It was still present up until regex version 2015.10.05, but was fixed in the next version, 2015.10.22, as shown below:

$ sudo pip3 install regex==2015.10.05
Processing /root/.cache/pip/wheels/24/cb/ae/9653e30c8f801544a645e17d26fa6803aeaf76ad0482663c27/regex-2015.10.5-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Successfully installed regex-2015.10.5
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(5, 0, 1)
$ sudo pip3 install regex==2015.10.22
Processing /root/.cache/pip/wheels/60/f6/9a/23e723633e62a79064cb301c54a3b50482b8c690f86c9983ee/regex-2015.10.22-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
  Found existing installation: regex 2015.10.5
    Uninstalling regex-2015.10.5:
      Successfully uninstalled regex-2015.10.5
Successfully installed regex-2015.10.22
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(0, 0, 1)

Given these dates, I infer that the commit that fixed the bug was https://bitbucket.org/mrabarnett/mrab-regex/commits/296c1daf86619039c6fe55868e7d861097d01aae, with description

Hg issue 161: Unexpected fuzzy match results

Fixed the bug and did some related tidying up.

The referenced bug is https://bitbucket.org/mrabarnett/mrab-regex/issues/161.

Post a Comment for "Python Regex Module Fuzzy Match: Substitution Count Not As Expected"