Skip to content Skip to sidebar Skip to footer

Rolling Average With Window Size An Interval Of Column Values

I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rol

Solution 1:

Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.

In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:

from pandas.api.indexers import BaseIndexer

classRangeWindow(BaseIndexer):
    def__init__(self, val, width):
        self.val = val.values
        self.width = width

    defget_window_bounds(self, num_values, min_periods, center, closed):
        if min_periods isNone: min_periods = 0if closed isNone: closed = 'left'
        w = (-self.width/2, self.width/2) if center else (0, self.width)
        side0 = 'left'if closed in ['left', 'both'] else'right'
        side1 = 'right'if closed in ['right', 'both'] else'left'
        ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
        ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
        ix1 = np.maximum(ix1, ix0 + min_periods)

        return ix0, ix1

Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.

Application:

df = pd.DataFrame([
        [4.5, 10],
        [4.6, 11],
        [4.8, 9],
        [5.5, 6],
        [5.6, 6],
        [8.1, 10],
        [8.2, 13]
    ], columns='a b'.split())

df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()

# gives:010.0110.0210.036.046.0511.5611.5
Name: b, dtype: float64

Timing:

df = pd.DataFrame(
    np.random.uniform(0, 1000, size=(1_000_000, 2)),
    columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)


%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()

CPU times: user133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms

Update on performance:

Following a comment from @anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.

Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:

# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
    z = np.concatenate(([0], np.cumsum(b)))
    ix0 = np.searchsorted(a, a - width/2, side='left')
    ix1 = np.searchsorted(a, a + width/2, side='right')
    return z[ix1] - z[ix0]

def fast_rolling_mean(a, b, width):
    z = np.concatenate(([0], np.cumsum(b)))
    ix0 = np.searchsorted(a, a - width/2, side='left')
    ix1 = np.searchsorted(a, a + width/2, side='right')
    return (z[ix1] - z[ix0]) / (ix1 - ix0)

With this (and the 1-million rows df above), I see:

%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop

versus:

%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop

However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).

Solution 2:

df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:

df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)

output:

     0   1  rolling_mean
0  4.5  10     10.000000
1  4.6  11     10.500000
2  4.8   9     10.000000
3  5.5   6      8.666667
4  5.6   6      7.000000
5  8.1  10     10.000000
6  8.2  13     11.500000

Advantages This is a three-line solution that should have great performance, leveraging pandas datetime backend.

Disadvantages This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).

Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.

Post a Comment for "Rolling Average With Window Size An Interval Of Column Values"