Skip to content Skip to sidebar Skip to footer

Aggregations Over Specific Columns Of A Large Dataframe, With Named Output

I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should p

Solution 1:

Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:

cols = sorted([(x[0],x[1]) for x inset([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
    df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)

Result:

A.1.EA.1.FA.1.GA.2.E...B.SUM.GC.SUM.EC.SUM.FC.SUM.G2018-08-31    978746408109...4061     5413     4102     49082018-09-30    923649488447...5585     3634     3857     42282018-10-31    911359897425...5039     2961     5246     41262018-11-30     77479536509...4634     4325     2975     42492018-12-31    608995114603...5377     5277     4509     34992019-01-31    138612363218...4514     5088     4599     48352019-02-28    994148933990...3907     4310     3906     35522019-03-31    950931209915...4354     5877     4677     55572019-04-30    255168357800...5267     5200     3689     50012019-05-31    593594824986...4221     2108     4636     36062019-06-30    975396919242...3841     4787     4556     31412019-07-31    350312104113...4071     5073     4829     3717

If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:

result = pd.DataFrame()
for c0, c1 in cols:
    result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)

Update: using simple groupby (which is even more simple in this particular case):

defgrouper(col):
    c = col.split('.')
    returnf'{c[0]}.SUM.{c[-1]}'

df.groupby(grouper, axis=1).sum()

Post a Comment for "Aggregations Over Specific Columns Of A Large Dataframe, With Named Output"