Question Details

No question body available.

Tags

python pandas dataframe numpy

Answers (2)

March 21, 2026 Score: 3 Rep: 28,415 Quality: Medium Completeness: 80%

I need a vectorized way (or maybe something using numpy or numba) to do this.

I would personally suggest Numba - it's generally the easiest way to obtain a very specific behavior like this.

I would suggest an algorithm along these lines:

  1. Sort data by timestamp.
  2. Iterate over rows, keeping track of the last sensor observation for each sensor.
  3. If the observation is nan, ffill it if it was recent enough.
  4. If not nan, record it as a new observation.

This allows you to do the iterative step in one pass over the data.

Code

import pandas as pd import numpy as np import numba as nb import datetime

df = pd.DataFrame({ 'sensorid': [1, 1, 1, 1, 2, 2], 'timestamp': pd.todatetime(['10:00', '10:03', '10:10', '10:11', '10:00', '10:01']), 'temp': [22.1, np.nan, np.nan, 23.0, 19.5, np.nan] })

def ffillingroupignoringgaps(df, maxgaps): # Note: algorithm below assumes dataset is sorted by timestamp # Could skip this if you know this property already holds df = df.sortvalues('timestamp') groupcol, timestampcol, xcol = df['sensorid'], df['timestamp'], df['temp'] # sensorid might be sparsely populated, e.g. have sensor values of 1, 100, 101. # Make this dense with factorize() sensoridcodes, sensoridsuniques = pd.factorize(groupcol) numsensors = len(sensoridsuniques) # Convert timestamp to seconds since epoch # You can use .astype('int64') directly if you know the unit of your timestamp timestampepoch = (timestampcol - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") # Convert arguments to known types timestampepoch = timestampepoch.values.astype('int64') lastsensorreadingts = np.zeros(numsensors, dtype='int64') lastsensorreadingx = np.full(numsensors, np.nan, dtype='float64') xcol = xcol.values.astype('float64') newtemp = ffillingroupignoringgapsinner( sensoridcodes, xcol, lastsensorreadingts, lastsensorreadingx, timestampepoch, maxgaps ) df['temp'] = newtemp # Optionally you can use sortindex to restore the original order return df

@nb.njit() def ffillingroupignoringgapsinner( sensoridcodes, xcol, lastsensorreadingts, lastsensorreadingx, timestampepoch, maxgaps ): N = len(sensoridcodes) ret = np.zeros(N, dtype='float64') for i in range(N): # Read inputs for this row sensor = sensoridcodes[i] now = timestampepoch[i] xval = xcol[i] lastreadingts = lastsensorreadingts[sensor] ffillval = lastsensorreadingx[sensor] if not (now - lastreadingts < maxgaps): # If longer than maxgaps, don't use last reading ffillval = np.nan if np.isnan(xval): # If x=nan, use the last reading. xval = ffillval else: # If x!=nan, then write to the last reading array, updating # x and timestamp lastsensorreadingx[sensor] = xval lastsensorreadingts[sensor] = now ret[i] = xval return ret

print(ffillingroupignoringgaps(df.copy(), maxgaps=5*60))

I find that this approach is roughly 20x faster, under the following assumptions:

  1. There are 50M rows of data, with 100 readings per sensor.
  2. You can't guarantee that input data is sorted by timestamp, so sortvalues() is required. (If it were sorted by sensorid then timestamp, this would also work.) This step costs 40% of the algorithm's time.
  3. You don't care about the output ordering of the rows. (There's a note about how to restore this with sortindex() in the code.)
  4. This does not assume that sensorid is an int, or that it is densely populated. For example, you might have a sensorid which is much bigger than the number. pd.factorize() addresses such cases, but it costs 15% of the time spent.
  5. I benchmarked under Pandas 2.2.3 and Numba 0.61.2.

Here is the benchmark I used.

Benchmark

def limitffill(group): return group['temp'].ffill(limit=1)

def ffillslow(df): df = df.copy() df['temp'] = df.groupby('sensorid').apply(limitffill).values return df

def gendata(numreadings): # Source - https://stackoverflow.com/a/57520939 # Posted by Fabian Bosler # Retrieved 2026-03-20, License - CC BY-SA 4.0 readingspersensor = 100 numsensors = numreadings // readingspersensor N = numsensors * readingspersensor sensors = np.tile(np.arange(numsensors), readingspersensor) now = datetime.datetime.now().timestamp() timestamp = pd.todatetime(np.linspace(now, now + 4 * 60 * readingspersensor, N, endpoint=False), unit='s') temp = np.random.rand(N) mask = np.random.rand(N) < 0.5 temp[mask] = np.nan

dft = pd.DataFrame({ 'sensorid': sensors, 'timestamp': timestamp, 'temp': temp, }) return dft

dft = gendata(50000000) %timeit -n 1 -r 1 ffillingroupignoringgaps(dft, maxgaps=5*60) %timeit -n 1 -r 1 ffillslow(dft)

Results:

4.44 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) 1min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
March 21, 2026 Score: 1 Rep: 11,825 Quality: Medium Completeness: 60%

It is possible to do this using vectorized numpy operations, but I'm not sure it's very efficient and involves storing some intermediate results in memory.

import datetime
import numpy as np
import pandas as pd

Generate test data

df = gendata(1000) # see @Nick ODell's answer for this function def

Compute time deltas in seconds (avoids unit issues)

td = df['timestamp'].diff().dt.total
seconds() td[0] = 0.0 # overwrite the initial nan

Compute elapsed time since last valid reading and zero when valid

nullmask = pd.isnull(df['temp']).tonumpy() cumsum = np.cumsum(td) reset = np.where(~nullmask, cumsum, 0) reset = np.maximum.accumulate(reset) timedeltasaftervalid = cumsum - reset

Compute regular forward-fill

df['temp
ffill'] = df['temp'].ffill()

Reset fill values beyond time limit back to Nans

filllimits = 50 # maximum duration of filling (s) df.loc[timedeltasaftervalid > filllimits, 'tempffill'] = np.nan

print(df.iloc[:15])

Output:

sensor
id timestamp temp temp_ffill 0 0 2026-03-21 04:37:00.972776890 0.036642 0.036642 1 1 2026-03-21 04:37:24.972776890 0.575373 0.575373 2 2 2026-03-21 04:37:48.972776890 NaN 0.575373 3 3 2026-03-21 04:38:12.972776890 NaN 0.575373 4 4 2026-03-21 04:38:36.972776890 NaN NaN 5 5 2026-03-21 04:39:00.972776890 NaN NaN 6 6 2026-03-21 04:39:24.972776890 0.576749 0.576749 7 7 2026-03-21 04:39:48.972776890 NaN 0.576749 8 8 2026-03-21 04:40:12.972776890 NaN 0.576749 9 9 2026-03-21 04:40:36.972776890 NaN NaN 10 0 2026-03-21 04:41:00.972776890 NaN NaN 11 1 2026-03-21 04:41:24.972776890 NaN NaN 12 2 2026-03-21 04:41:48.972776890 NaN NaN 13 3 2026-03-21 04:42:12.972776890 0.801779 0.801779 14 4 2026-03-21 04:42:36.972776890 NaN 0.801779