NumPy: Create a dict() by grouping values in a column by another column values

Question

Suppose I have a 2-D NumPy array like the one below:

arr = numpy.array([[1,0], [1, 4.6], [2, 10.1], [2, 0], [2, 3.53]])
arr
Out[39]: 
array([[  1.  ,   0.  ],
       [  1.  ,   4.6 ],
       [  2.  ,  10.1 ],
       [  2.  ,   0.  ],
       [  2.  ,   3.53]])

What would be the fastest way to group the values in the 2nd column based on the values in the first column and create a dict out of it (the desired output is below)

{1: [0, 4.6], 2: [10.1, 0, 3.53]}

Currently I use a loop, and because the actual array I have is more than 1 million rows, and the first column has more than 5000 unique values, it's quite slow. I prefer not to use pandas.


Show source
| numpy   | python   | arrays   | grouping   | dictionary   2017-04-16 23:04 4 Answers

Answers to NumPy: Create a dict() by grouping values in a column by another column values ( 4 )

  1. 2017-04-16 23:04

    You may do it without numpy via using collections.defaultdict. In-fact based on the example you provided, you don't even need the numpy array. Python's list are good enough for your requirement. Below is the example:

    from collections import defaultdict
    my_list = [[1,0], [1, 4.6], [2, 10.1], [2, 0], [2, 3.53]]
    
    my_dict = defaultdict(list)
    for key, value in my_list:
        my_dict[key].append(value)
    
        # if you want the values as float in the dict, use:
        #     my_dict[float(key)].append(float(value))
    

    where final content hold by my_dict will be:

    {1: [0, 4.6], 2: [10.1, 0, 3.53]}
    
  2. 2017-04-16 23:04

    You can use np.split:

    # sort array by the first column if it isn't
    sort_arr = arr[arr[:, 0].argsort(), :]
    ‚Äč
    # split the array and construct the dictionary
    split_arr = np.split(sort_arr, np.where(np.diff(sort_arr[:,0]))[0] + 1)
    
    {s[0,0]: s[:,1].tolist() for s in split_arr}
    # {1.0: [0.0, 4.6], 2.0: [10.1, 0.0, 3.53]}
    
  3. 2017-04-16 23:04

    Here's an approach -

    def create_dict(arr):
        a = arr[arr[:,0].argsort()] # sort by col-0 if not already sorted
        s0 = np.r_[0,np.flatnonzero(a[1:,0] > a[:-1,0])+1,a.shape[0]]
        ids = a[s0[:-1],0]
        return {ids[i]:a[s0[i]:s0[i+1],1].tolist() for i in range(len(s0)-1)}
    

    Sample run -

    In [64]: arr
    Out[64]: 
    array([[  2.  ,   0.  ],
           [  1.  ,   4.6 ],
           [  2.  ,  10.1 ],
           [  4.  ,   0.5 ],
           [  1.  ,   0.  ],
           [  4.  ,   0.23],
           [  2.  ,   3.53]])
    
    In [65]: create_dict(arr)
    Out[65]: {1.0: [4.6, 0.0], 2.0: [0.0, 10.1, 3.53], 4.0: [0.5, 0.23]}
    

    Runtime test

    Other approaches -

    # @Moinuddin Quadri's soln
    def defaultdict_based(arr):
        my_list  = arr.tolist()
        my_dict = defaultdict(list)
        for key, value in my_list:
            my_dict[key].append(value)
        return my_dict
    
    # @Psidom's soln
    def numpy_split_based(arr):
        sort_arr = arr[arr[:, 0].argsort(), :]
        split_arr = np.split(sort_arr, np.where(np.diff(sort_arr[:,0]))[0] + 1) 
        return {s[0,0]: s[:,1].tolist() for s in split_arr}
    

    Timings -

    # Create sample random array with the first col having 1000000 elems
    # with 5000 unique ones as stated in the question
    In [102]: arr = np.random.randint(0,5000,(1000000,2))
    
    In [103]: %timeit defaultdict_based(arr)
         ...: %timeit numpy_split_based(arr)
         ...: %timeit create_dict(arr)
         ...: 
    1 loops, best of 3: 634 ms per loop
    1 loops, best of 3: 270 ms per loop
    1 loops, best of 3: 260 ms per loop
    

    Bottlenecks for the approaches :

    Seems like with defaultdict based approach the conversion to list with .tolist() is proving to be heavy (>50% of total runtime) -

    In [104]: %timeit arr.tolist()
    1 loops, best of 3: 372 ms per loop
    

    For the other two approaches the sorting (if needed) at the start alongwith the splitting/loop-comprehension at the end are the time-consuming portions. The sorting step has the runtime (~50% of total runtime) -

    In [106]: %timeit arr[arr[:,0].argsort()]
    10 loops, best of 3: 140 ms per loop
    
  4. 2017-04-17 01:04

    Assuming that your first column is in sorted order, this will work.

    In [165]: d = {}
    
    In [166]: uniq, idx, idxinv, counts = np.unique(arr[:, 0], return_index=True, return_inverse=True, return_counts=True)
    
    In [167]: [d.update({arr[:, 0][el]: arr[:, 1][range(ix, counts[ix])]}) for ix, el in enumerate(idx)]
    Out[167]: [None, None]
    
    In [168]: d
    Out[168]: {1.0: array([ 0. ,  4.6]), 2.0: array([  4.6,  10.1])}
    

Leave a reply to - NumPy: Create a dict() by grouping values in a column by another column values

◀ Go back