Map Values From Another Dataframe Within Multiple Conditions
Solution 1:
Your code has two main flaws:
- Going by your description of the problem (below), whether or not
df1['sal_date']
is betweendte_from
anddte_to
is the necessary condition and thus should be checked first. The second step is returning the highest possible match. Since you want to force 1:1 mapping, the match being>=80
doesn't matter, you simply return the highest one.
Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'].
- Your code doesn't really return the row from
df2
with the highest match percentage over 80%, but it returns the last one. Every time the conditionvariable>=80
is met, the current current row indf1
is overwritten.
also, the name for column 1 in df2
is inconsistent; in df2
it's called OR_score
with lowercase s
but in the code it's called OR_Score
with capital S
.
I changed your code a little bit. I added highest_match
, which keeps track of what the variable
of the highest match was and only overwrites if the new match's variable
is higher than the highest match. This resets for each row if df1
.
I dont use >=
thus it keeps the first match if variable
is equal. If you want to keep your >=80
condition, you can initialize highest_match = 80
, however this code want warn you if for one row of df1
no match >=80
is found and the row thus just stays as it was.
The code also only proceeds, if the date condition is met first.
from fuzzywuzzy import fuzz
forindex, row in df1.iterrows():
highest_match = 0for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
This code is not optimized for time complexity, it just does what you were trying to accomplish. Or atleast it produces your expected output.. Adding the >=80
constraint might improve time, but then you'll need to add some logic for what should happen if no match is >=80
.
Please add your code of how the tables are created as well the next time and not just the output. That makes recreating your problem much easier and more people would be willing to help, thanks.
EDIT:
If youn want to keep rows with missing sal_date
simply skip them:
from fuzzywuzzy import fuzz
forindex, row in df1.iterrows():
if pd.isna(row['sal_date']):
continue
highest_match = 0for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
Post a Comment for "Map Values From Another Dataframe Within Multiple Conditions"