Pandas: joining items with same index

Question

I have a pandas data frame that is a vector with a value and an index, say:

row1  10
row1  11
row2  9
row2  8

However, I want to create a 2x2 matrix from this, in which the row index is actually a header (column index). Like this:

row1  row2
10    9
11    8

What is the most efficient way of doing this? This example is a simplification, but I could be dealing with thousands of data points. Does pandas have a specific function for joining items with same index into a table?

Observation: all indexes would have the same number of entries.


Show source
| pandas   | python   | dataframe   | matrix   2016-12-22 00:12 4 Answers

Answers ( 4 )

  1. 2016-12-22 01:12

    You can create an id variable for each unique index and then pivot the table to wide format:

    df.assign(id = df.groupby([0]).cumcount()).set_index(['id', 0]).unstack(level=1)
    
    #      1
    #0  row1    row2
    #id     
    # 0   10       9
    # 1   11       8
    
  2. 2016-12-22 01:12

    assign another column to the index and unstack

    s.index = [s.groupby(level=0).cumcount(), s.index]
    s.unstack()
    
    0  row1  row2
    0    10     9
    1    11     8
    

    alternative numpy approach
    still slower (blah...)

    u, inv = np.unique(s.index.values, return_inverse=True)
    data = dict(zip(u, [s.values[g] for g in (np.arange(len(u))[:, None] == inv)]))
    pd.DataFrame(data)
    
  3. 2016-12-22 01:12

    Use groupby on the index to get a list of elements for each index, use to_dict to get a dictionary, then use the pd.DataFrame constructor:

    pd.DataFrame(df.groupby(level=0)['column_name'].apply(list).to_dict())
    

    If you have a Series, say s, instead of a DataFrame you don't need to supply a column name:

    pd.DataFrame(s.groupby(level=0).apply(list).to_dict())
    

    The resulting output:

       row1  row2
    0    10     9
    1    11     8
    

    Timings

    Using the following setup to produce larger sample data, assuming the input data is a DataFrame:

    n = 10**6
    df = pd.DataFrame(np.random.random(size=n), index=['row1', 'row2']*(n//2))
    
    def pir2(s):
        s.index = [s.groupby(level=0).cumcount(), s.index]
        return s.unstack()
    

    I get the following timings:

    %timeit pd.DataFrame(df.groupby(level=0)[0].apply(list).to_dict())
    1 loop, best of 3: 210 ms per loop
    
    %timeit pir2(df.copy())
    1 loop, best of 3: 486 ms per loop
    
    %timeit df.assign(id = df.groupby([0]).cumcount()).set_index(['id', 0]).unstack(level=1)
    1 loop, best of 3: 1.34 s per loop
    
  4. 2016-12-22 02:12

    And just a little bit faster using @root's example :)

    pd.DataFrame({name:group.values for name, group in df.groupby(level=0)[0]})
    

    Timings:

    %timeit pd.DataFrame({name:group.values for name, group in df.groupby(level=0)[0]})
    10 loops, best of 3: 73.6 ms per loop
    
    %timeit pd.DataFrame(df.groupby(level=0)[0].apply(list).to_dict())
    1 loop, best of 3: 249 ms per loop
    
◀ Go back