Python Regex Module Fuzzy Match: Substitution Count Not As Expected
Solution 1:
The issue seems to be related to the value in the allowed error setting.
Reducing the s to s < 3 changes the fuzzy match tuple score downwards:
>>>reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<3,i<3,d<3,e<4}">>>query = "TATGGACCAAAGTCTCAAGCCATGTG">>>match = regex.search(reference, query, regex.BESTMATCH)>>>print(match.fuzzy_counts)
(1,0,1)
reducing the allowed error for 's' even further returns the expected 's' score for this match:
>>>reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<2,i<3,d<3,e<4}">>>query = "TATGGACCAAAGTCTCAAGCCATGTG">>>match = regex.search(reference, query, regex.BESTMATCH)>>>print(match.fuzzy_counts)
(0,0,1)
Why it behaves in this way is still a mystery to me.
Solution 2:
This was caused by what looks to be a bug in the regex module's cost calculations. It was still present up until regex version 2015.10.05, but was fixed in the next version, 2015.10.22, as shown below:
$ sudo pip3 install regex==2015.10.05
Processing /root/.cache/pip/wheels/24/cb/ae/9653e30c8f801544a645e17d26fa6803aeaf76ad0482663c27/regex-2015.10.5-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Successfully installed regex-2015.10.5
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(5, 0, 1)
$ sudo pip3 install regex==2015.10.22
Processing /root/.cache/pip/wheels/60/f6/9a/23e723633e62a79064cb301c54a3b50482b8c690f86c9983ee/regex-2015.10.22-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Found existing installation: regex 2015.10.5
Uninstalling regex-2015.10.5:
Successfully uninstalled regex-2015.10.5
Successfully installed regex-2015.10.22
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(0, 0, 1)
Given these dates, I infer that the commit that fixed the bug was https://bitbucket.org/mrabarnett/mrab-regex/commits/296c1daf86619039c6fe55868e7d861097d01aae, with description
Hg issue 161: Unexpected fuzzy match results
Fixed the bug and did some related tidying up.
The referenced bug is https://bitbucket.org/mrabarnett/mrab-regex/issues/161.
Post a Comment for "Python Regex Module Fuzzy Match: Substitution Count Not As Expected"