pandas dataframe filtering row like groupby

Question

for example, I have a dataframe with these 2 columns a, and b:

a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
b = [1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1]

I am expecting filtered dataframe: [5,6,7,2,3,4,9,0,1]

Without using the groupby function (because it is take too long time with a very large dataframe, it just not usable), how do I filter with the last 3 items from each group in col. a?


Show source
| numpy   | pandas   | performance   | dataframe   2017-01-04 15:01 2 Answers

Answers ( 2 )

  1. 2017-01-04 15:01

    You can use drop_duplicates first for last rows of groups, then get previous index values and last select by loc:

    a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
    b = [1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1]
    df =  pd.DataFrame({'a': a, 'b':b})
    print (df)
        a  b
    0   1  1
    1   1  2
    2   1  3
    3   1  4
    4   1  5
    5   1  6
    6   1  7
    7   2  8
    8   2  9
    9   2  0
    10  2  1
    11  2  2
    12  2  3
    13  2  4
    14  3  5
    15  3  6
    16  3  7
    17  3  8
    18  3  9
    19  3  0
    20  3  1
    
    df1 = df.drop_duplicates('a',keep='last')
    print (df1)
        a  b
    6   1  7
    13  2  4
    20  3  1
    
    idx = sorted(df1.index.tolist() + (df1.index - 1).tolist() + (df1.index - 2).tolist())
    print (idx)
    [4, 5, 6, 11, 12, 13, 18, 19, 20]
    
    print (df.loc[idx])
        a  b
    4   1  5
    5   1  6
    6   1  7
    11  2  2
    12  2  3
    13  2  4
    18  3  9
    19  3  0
    20  3  1
    
  2. 2017-01-04 15:01

    Approach #1 : Here's a NumPy based approach -

    In [89]: a = np.array([1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3])
        ...: b = np.array([1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1])
        ...: 
    
    In [90]: idx = np.append(np.nonzero(a[1:] > a[:-1])[0], a.size-1)[:,None] - [2,1,0]
    
    In [91]: b[idx].ravel()
    Out[91]: array([5, 6, 7, 2, 3, 4, 9, 0, 1])
    

    If you are receiving those from the columns of a dataframe, df named a and b respectively, as the pre-processing step, we need to extract those as arrays, like so -

    a = df.a.values
    b = df.b.values
    

    Please note that this assumes at least three elements per group. For cases with lesser than 3 elems per group read on to the next approach.


    Approach #2 : With Scipy's binary dilation to create a mask for selecting elements off b -

    from scipy.ndimage.morphology import binary_dilation as imdilate
    def filter_lastN(a, b, N):
        mask = np.zeros(a.size,dtype=bool)
        mask[np.append(np.nonzero(a[1:] > a[:-1])[0],b.size-1)] = 1
        return b[imdilate(mask,np.ones(N),origin=(N-1)//2)]
    

    Sample run -

    In [198]: a
    Out[198]: array([1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3])
    
    In [199]: b
    Out[199]: array([5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1])
    
    In [200]: filter_lastN(a,b,3)
    Out[200]: array([5, 6, 7, 2, 3, 4, 9, 0, 1])
    
    In [201]: filter_lastN(a,b,5)
    Out[201]: array([5, 6, 7, 0, 1, 2, 3, 4, 7, 8, 9, 0, 1])
    
◀ Go back