Sql To Pandas - Aggregation Over Partition Python
what is the best way to aggregate values based on a particular over partition by : SQL : select a.*, b.vol1 / sum(vol1) over ( partition by a.sale, a.d_id, a.month, a.p_id )
Solution 1:
The first part is the join, similar to the left join in your sql code. One thing I noticed is that four columns are repeated in your code : 'sale', 'd_id', 'month', 'p_id'
, in the joins and windowing. In sql, you can just create a window reference at the end of your code and reuse. In python, you can store it in a variable and reuse (gives a clean look). I also use these values as index, since at some point, there will be a windowing operation (again, the reuse):
index = ['sale', 'd_id', 'month', 'p_id']
df1 = df1.set_index(index)
df2 = df2.set_index(index)
merged = df1.join(df2, how='left')
Next, groupby on the index and get the aggregate sum for vol1
. Since we need the aggregate aligned to each row, in pandas the transform
helps with that:
grouped = merged.groupby(index)
partitioned_sum = grouped.vol1.transform('sum')
From here, we can create vol_r
and vol_t
via the assign method, and drop the vol1
column:
(merged.assign(vol_r = merged.vol1.div(partitioned_sum),
vol_t = lambda df: df.vol_r.mul(df.vol2))
.drop(columns='vol1')
.reset_index()
)
sale d_id month p_id vol2 vol_r vol_t
025804911.0000.0846530.931185125804911.0000.0870700.957766225804911.0000.1616111.777716325804911.3140.0846530.957766425804911.3140.0870700.985106525804911.3140.1616111.828462625804920.0650.0846531.698566725804920.0650.0870701.747052825804920.0650.1616113.242716
Post a Comment for "Sql To Pandas - Aggregation Over Partition Python"