stackoverflow March 20, 2026 Rep: 1,071

How to ffill with a maximum time limit (GAP) across 120M+ rows?

Score

Answers

Views

24.3

Trend Score

Question Details

No question body available.

Answers (2)

March 21, 2026 Score: 3 Rep: 28,415 Quality: Medium Completeness: 80%

I need a vectorized way (or maybe something using numpy or numba) to do this.

I would personally suggest Numba - it's generally the easiest way to obtain a very specific behavior like this.

I would suggest an algorithm along these lines:

Sort data by timestamp.
Iterate over rows, keeping track of the last sensor observation for each sensor.
If the observation is nan, ffill it if it was recent enough.
If not nan, record it as a new observation.

This allows you to do the iterative step in one pass over the data.

Code


import pandas as pd
import numpy as np
import numba as nb
import datetime
df = pd.DataFrame({
    'sensorid': [1, 1, 1, 1, 2, 2],
    'timestamp': pd.todatetime(['10:00', '10:03', '10:10', '10:11', '10:00', '10:01']),
    'temp': [22.1, np.nan, np.nan, 23.0, 19.5, np.nan]
})
def ffillingroupignoringgaps(df, maxgaps):
    # Note: algorithm below assumes dataset is sorted by timestamp
    # Could skip this if you know this property already holds
    df = df.sortvalues('timestamp')
    groupcol, timestampcol, xcol = df['sensorid'], df['timestamp'], df['temp']
    # sensorid might be sparsely populated, e.g. have sensor values of 1, 100, 101.
    # Make this dense with factorize()
    sensoridcodes, sensoridsuniques = pd.factorize(groupcol)
    numsensors = len(sensoridsuniques)
    # Convert timestamp to seconds since epoch
    # You can use .astype('int64') directly if you know the unit of your timestamp
    timestampepoch = (timestampcol - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
    # Convert arguments to known types
    timestampepoch = timestampepoch.values.astype('int64')
    lastsensorreadingts = np.zeros(numsensors, dtype='int64')
    lastsensorreadingx = np.full(numsensors, np.nan, dtype='float64')
    xcol = xcol.values.astype('float64')
    newtemp = ffillingroupignoringgapsinner(
        sensoridcodes,
        xcol,
        lastsensorreadingts,
        lastsensorreadingx,
        timestampepoch,
        maxgaps
    )
    df['temp'] = newtemp
    # Optionally you can use sortindex to restore the original order
    return df
@nb.njit()
def ffillingroupignoringgapsinner(
    sensoridcodes,
    xcol,
    lastsensorreadingts,
    lastsensorreadingx,
    timestampepoch,
    maxgaps
):
    N = len(sensoridcodes)
    ret = np.zeros(N, dtype='float64')
    for i in range(N):
        # Read inputs for this row
        sensor = sensoridcodes[i]
        now = timestampepoch[i]
        xval = xcol[i]
        lastreadingts = lastsensorreadingts[sensor]
        ffillval = lastsensorreadingx[sensor]
        if not (now - lastreadingts < maxgaps):
            # If longer than maxgaps, don't use last reading
            ffillval = np.nan
        if np.isnan(xval):
            # If x=nan, use the last reading.
            xval = ffillval
        else:
            # If x!=nan, then write to the last reading array, updating
            # x and timestamp
            lastsensorreadingx[sensor] = xval
            lastsensorreadingts[sensor] = now
        ret[i] = xval
    return retprint(ffillingroupignoringgaps(df.copy(), maxgaps=5*60))

I find that this approach is roughly 20x faster, under the following assumptions:

There are 50M rows of data, with 100 readings per sensor.

You can't guarantee that input data is sorted by timestamp, so sortvalues() is required. (If it were sorted by sensorid then timestamp, this would also work.) This step costs 40% of the algorithm's time.

You don't care about the output ordering of the rows. (There's a note about how to restore this with sortindex() in the code.)
This does not assume that sensorid is an int, or that it is densely populated. For example, you might have a sensorid which is much bigger than the number. pd.factorize() addresses such cases, but it costs 15% of the time spent.
I benchmarked under Pandas 2.2.3 and Numba 0.61.2.

Here is the benchmark I used.

Benchmark


def limitffill(group):
    return group['temp'].ffill(limit=1)def ffillslow(df):
    df = df.copy()
    df['temp'] = df.groupby('sensorid').apply(limitffill).values
    return df
def gendata(numreadings):
    # Source - https://stackoverflow.com/a/57520939
    # Posted by Fabian Bosler
    # Retrieved 2026-03-20, License - CC BY-SA 4.0
    readingspersensor = 100
    numsensors = numreadings // readingspersensor
    N = numsensors * readingspersensor
    sensors = np.tile(np.arange(numsensors), readingspersensor)
    now = datetime.datetime.now().timestamp()
    timestamp = pd.todatetime(np.linspace(now, now + 4 * 60 * readingspersensor, N, endpoint=False), unit='s')
    temp = np.random.rand(N)
    mask = np.random.rand(N) < 0.5
    temp[mask] = np.nan
    dft = pd.DataFrame({
        'sensorid': sensors,
        'timestamp': timestamp,
        'temp': temp,
    })
    return dftdft = gendata(50000000)
%timeit -n 1 -r 1 ffillingroupignoringgaps(dft, maxgaps=5*60)
%timeit -n 1 -r 1 ffillslow(dft)

Results:


4.44 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

March 21, 2026 Score: 1 Rep: 11,825 Quality: Medium Completeness: 60%

It is possible to do this using vectorized numpy operations, but I'm not sure it's very efficient and involves storing some intermediate results in memory.

import datetime
import numpy as np
import pandas as pd
Generate test data
df = gendata(1000)  # see @Nick ODell's answer for this function def
Compute time deltas in seconds (avoids unit issues)
td = df['timestamp'].diff().dt.totalseconds()
td[0] = 0.0  # overwrite the initial nan
Compute elapsed time since last valid reading and zero when valid
nullmask = pd.isnull(df['temp']).tonumpy()
cumsum = np.cumsum(td)
reset = np.where(~nullmask, cumsum, 0)
reset = np.maximum.accumulate(reset)
timedeltasaftervalid = cumsum - reset
Compute regular forward-fill
df['tempffill'] = df['temp'].ffill()
Reset fill values beyond time limit back to Nans
filllimits = 50  # maximum duration of filling (s)
df.loc[timedeltasaftervalid > filllimits, 'tempffill'] = np.nanprint(df.iloc[:15])

Output:


    sensorid                     timestamp      temp  temp_ffill
0           0 2026-03-21 04:37:00.972776890  0.036642    0.036642
1           1 2026-03-21 04:37:24.972776890  0.575373    0.575373
2           2 2026-03-21 04:37:48.972776890       NaN    0.575373
3           3 2026-03-21 04:38:12.972776890       NaN    0.575373
4           4 2026-03-21 04:38:36.972776890       NaN         NaN
5           5 2026-03-21 04:39:00.972776890       NaN         NaN
6           6 2026-03-21 04:39:24.972776890  0.576749    0.576749
7           7 2026-03-21 04:39:48.972776890       NaN    0.576749
8           8 2026-03-21 04:40:12.972776890       NaN    0.576749
9           9 2026-03-21 04:40:36.972776890       NaN         NaN
10          0 2026-03-21 04:41:00.972776890       NaN         NaN
11          1 2026-03-21 04:41:24.972776890       NaN         NaN
12          2 2026-03-21 04:41:48.972776890       NaN         NaN
13          3 2026-03-21 04:42:12.972776890  0.801779    0.801779
14          4 2026-03-21 04:42:36.972776890       NaN    0.801779

Export Question Data

Export this question and its answers for further analysis or reporting.

Back to Questions

How to ffill with a maximum time limit (GAP) across 120M+ rows?

Question Details

Tags

Answers (2)

Code

Benchmark

Generate test data

Compute time deltas in seconds (avoids unit issues)

Compute elapsed time since last valid reading and zero when valid

Compute regular forward-fill

Reset fill values beyond time limit back to Nans

Analysis Metrics

Question Information

Actions

Related Questions

Export Question Data