Skip to content Skip to sidebar Skip to footer

Map Values From Another Dataframe Within Multiple Conditions

Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'] . Want to compare

Solution 1:

Your code has two main flaws:

  1. Going by your description of the problem (below), whether or not df1['sal_date'] is between dte_from and dte_to is the necessary condition and thus should be checked first. The second step is returning the highest possible match. Since you want to force 1:1 mapping, the match being >=80 doesn't matter, you simply return the highest one.

Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'].

  1. Your code doesn't really return the row from df2 with the highest match percentage over 80%, but it returns the last one. Every time the condition variable>=80 is met, the current current row in df1 is overwritten.

also, the name for column 1 in df2 is inconsistent; in df2 it's called OR_score with lowercase s but in the code it's called OR_Score with capital S.

I changed your code a little bit. I added highest_match, which keeps track of what the variable of the highest match was and only overwrites if the new match's variable is higher than the highest match. This resets for each row if df1.

I dont use >= thus it keeps the first match if variable is equal. If you want to keep your >=80 condition, you can initialize highest_match = 80, however this code want warn you if for one row of df1 no match >=80 is found and the row thus just stays as it was.

The code also only proceeds, if the date condition is met first.

from fuzzywuzzy import fuzz

forindex, row in df1.iterrows():
    highest_match = 0for index2, config2 in df2.iterrows():
        cond1 = df1['sal_date'][index] <= config2['dte_to']
        cond2 = df1['sal_date'][index] >= config2['dte_from']
        if cond1 and cond2:
            variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
            if variable > highest_match:
                df1['id_number'][index] = config2['identity_No']
                df1['company_name'][index] = config2['comp_name']
                df1['company_code'][index] = config2['comp_code']
                df1['score'][index] = config2['OR_score']
                highest_match = variable

This code is not optimized for time complexity, it just does what you were trying to accomplish. Or atleast it produces your expected output.. Adding the >=80 constraint might improve time, but then you'll need to add some logic for what should happen if no match is >=80.

Please add your code of how the tables are created as well the next time and not just the output. That makes recreating your problem much easier and more people would be willing to help, thanks.

EDIT:

If youn want to keep rows with missing sal_date simply skip them:

from fuzzywuzzy import fuzz

forindex, row in df1.iterrows():
    if pd.isna(row['sal_date']):
        continue
    highest_match = 0for index2, config2 in df2.iterrows():
        cond1 = df1['sal_date'][index] <= config2['dte_to']
        cond2 = df1['sal_date'][index] >= config2['dte_from']
        if cond1 and cond2:
            variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
            if variable > highest_match:
                df1['id_number'][index] = config2['identity_No']
                df1['company_name'][index] = config2['comp_name']
                df1['company_code'][index] = config2['comp_code']
                df1['score'][index] = config2['OR_score']
                highest_match = variable

Post a Comment for "Map Values From Another Dataframe Within Multiple Conditions"