Map Values From Another Dataframe Within Multiple Conditions
Solution 1:
Your code has two main flaws:
- Going by your description of the problem (below), whether or not
df1['sal_date']is betweendte_fromanddte_tois the necessary condition and thus should be checked first. The second step is returning the highest possible match. Since you want to force 1:1 mapping, the match being>=80doesn't matter, you simply return the highest one.
Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'].
- Your code doesn't really return the row from
df2with the highest match percentage over 80%, but it returns the last one. Every time the conditionvariable>=80is met, the current current row indf1is overwritten.
also, the name for column 1 in df2 is inconsistent; in df2 it's called OR_score with lowercase s but in the code it's called OR_Score with capital S.
I changed your code a little bit. I added highest_match, which keeps track of what the variable of the highest match was and only overwrites if the new match's variable is higher than the highest match. This resets for each row if df1.
I dont use >= thus it keeps the first match if variable is equal. If you want to keep your >=80 condition, you can initialize highest_match = 80, however this code want warn you if for one row of df1 no match >=80 is found and the row thus just stays as it was.
The code also only proceeds, if the date condition is met first.
from fuzzywuzzy import fuzz
forindex, row in df1.iterrows():
highest_match = 0for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
This code is not optimized for time complexity, it just does what you were trying to accomplish. Or atleast it produces your expected output.. Adding the >=80 constraint might improve time, but then you'll need to add some logic for what should happen if no match is >=80.
Please add your code of how the tables are created as well the next time and not just the output. That makes recreating your problem much easier and more people would be willing to help, thanks.
EDIT:
If youn want to keep rows with missing sal_date simply skip them:
from fuzzywuzzy import fuzz
forindex, row in df1.iterrows():
if pd.isna(row['sal_date']):
continue
highest_match = 0for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
Post a Comment for "Map Values From Another Dataframe Within Multiple Conditions"