Skip to content Skip to sidebar Skip to footer

Sql To Pandas - Aggregation Over Partition Python

what is the best way to aggregate values based on a particular over partition by : SQL : select a.*, b.vol1 / sum(vol1) over ( partition by a.sale, a.d_id, a.month, a.p_id )

Solution 1:

The first part is the join, similar to the left join in your sql code. One thing I noticed is that four columns are repeated in your code : 'sale', 'd_id', 'month', 'p_id', in the joins and windowing. In sql, you can just create a window reference at the end of your code and reuse. In python, you can store it in a variable and reuse (gives a clean look). I also use these values as index, since at some point, there will be a windowing operation (again, the reuse):

index = ['sale', 'd_id', 'month', 'p_id']

df1 = df1.set_index(index)

df2 = df2.set_index(index)

merged = df1.join(df2, how='left')

Next, groupby on the index and get the aggregate sum for vol1. Since we need the aggregate aligned to each row, in pandas the transform helps with that:

grouped = merged.groupby(index)
partitioned_sum = grouped.vol1.transform('sum')

From here, we can create vol_r and vol_t via the assign method, and drop the vol1 column:

(merged.assign(vol_r = merged.vol1.div(partitioned_sum), 
               vol_t = lambda df: df.vol_r.mul(df.vol2))
       .drop(columns='vol1')
       .reset_index()
)

   sale  d_id  month  p_id    vol2     vol_r     vol_t
025804911.0000.0846530.931185125804911.0000.0870700.957766225804911.0000.1616111.777716325804911.3140.0846530.957766425804911.3140.0870700.985106525804911.3140.1616111.828462625804920.0650.0846531.698566725804920.0650.0870701.747052825804920.0650.1616113.242716

Post a Comment for "Sql To Pandas - Aggregation Over Partition Python"