## Pandas - Find longest stretch without Nan values

Question

I have a pandas dataframe "df", a sample of which is below:

``````   time  x
0  1     1
1  2     Nan
2  3     3
3  4     Nan
4  5     8
5  6     7
6  7     5
7  8     Nan
``````

The real frame is much bigger. I am trying to find the longest stretch of non NaN values in the "x" series, and print out the starting and ending index for this frame. Is this possible?

Thank You

Show source
2017-01-05 21:01 4 Answers

## Answers to Pandas - Find longest stretch without Nan values ( 4 )

1. So you can get the index values of the `NaN`'s in the following way:

``````import numpy as np

index = df['x'].index[df['x'].apply(np.isnan)]
df_index = df.index.values.tolist()
[df_index.index(indexValue) for indexValue in index]

>>> [0, 1, 3, 7]
``````

Then one solution would be to see the largest difference between subsequent index values and that would give you the longest stretch of non `NaN` values.

2. Maybe a faster way would be the following (given that you say you have a long dataframe, speed matters):

``````In [19]: df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})

In [20]: index = df['x'].isnull()

In [21]: df[index].index.values
Out[21]: array([1, 3, 7])
``````
3. Here's a vectorized approach with NumPy tools -

``````a = df.x.values  # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] ))  # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)   # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()]  # Get max interval, interval limits
``````

Sample run -

``````In [474]: a
Out[474]:
array([  1.,  nan,   3.,  nan,  nan,  nan,  nan,   8.,   7.,   5.,   2.,
5.,  nan,  nan])

In [475]: start, stop
Out[475]: (7, 12)
``````

The intervals are set such that the difference between each start and stop would give us the length of each interval. So, by `ending index` if you meant to get the last index of non-zero element, we need to subtract one from `stop`.

4. pandas

``````f = dict(
Start=pd.Series.first_valid_index,
Stop=pd.Series.last_valid_index,
Stretch='count'
)

agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values

array([ 4.,  6.])
``````

numpy

``````def pir(x):
# pad with np.nan
x = np.append(np.nan, np.append(x, np.nan))
# find where null
w = np.where(np.isnan(x))[0]
# diff to find length of stretch
# argmax to find where largest stretch
a = np.diff(w).argmax()
# return original positions of boundary nulls
return w[[a, a + 1]] + np.array([0, -2])
``````

demo

``````pir(df.x.values)

array([4, 6])
``````

``````a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)

array([ 7, 11])
``````