numpy: Compressing block matrix

Question

Consider a matrix M1 giving values for all combinations x,y. Consider a partition f(x)->X and a partition g(y)->Y. Furthermore consider an operation p(A) on a set A of numbers, i.e. max(A) or sum(A).

The mappings f,g can be used to create from M1 a block matrix M2 where all x that are mapped to the same X are adjacent, and the same for all y.

This matrix M2 has a block for each combination of the 'sets' X,Y.

Now I would like to condense this matrix M2 into another matrix M3 by applying p on each block separately. M3 has one value for each combination of X,Y.

Ideally, I would like to skip the transformation of M1 into M2 using f and g on the fly.

What would be the most efficient way to perform such operation and would it be possible to deploy numpy or scipy for it?

Special case: Actually, in my case x and y are identical and there is only one function f applied to both of them. I only care about the part of M2 that is under the diagonal.


Show source
| numpy   | python   | matrix   | scipy   | block   2017-01-05 16:01 2 Answers

Answers ( 2 )

  1. 2017-01-05 17:01

    The most straightforward way I can think of to do this, although perhaps not the most efficient (especially if your matrix is huge), is to convert your matrix to a one-dimensional array, and then have corresponding arrays for the partition group indices X and Y. You can then group by the partition group indices and finally restructure the matrix back into its original form.

    For example, if your matrix is

    >>> M1 = np.arange(25).reshape((5,5))
    >>> M1
    array([[ 0,  1,  2,  3,  4],
           [ 5,  6,  7,  8,  9],
           [10, 11, 12, 13, 14],
           [15, 16, 17, 18, 19],
           [20, 21, 22, 23, 24]])
    

    and your partitions are

    >>> def f(x):
    ...     return np.array([1,1,1,2,2])[x]
    >>> def g(y):
    ...     return np.array([3,4,4,4,5])[y]
    

    From that point, there are several ways to implement the reshaping and subsequent grouping. You can do it with Pandas, for instance, by constructing a DataFrame and using its stack() method to "stack" all the rows on top of each other in a single column, indexed by their original row and column indices.

    >>> st = pd.DataFrame(M1).stack().to_frame('M1')
    >>> st
         M1
    0 0   0
      1   1
      2   2
      3   3
      4   4
    1 0   5
    ...
    4 3  23
      4  24
    

    (I have truncated the output for readability, and I trust that you can evaluate the rest of these examples yourself if you want to see their output.) You can then add columns representing the partition group indices:

    >>> st['X'] = f(st.index.get_level_values(0))
    >>> st['Y'] = g(st.index.get_level_values(1))
    

    Then you can group by those indices and apply your aggregation function of choice.

    >>> stp = st.groupby(['X', 'Y']).agg(p)
    

    You will have to define p (or find an existing definition) such that it takes a one-dimensional Numpy array and returns a single number. If you want to use something like sum(), you can just use st.groupby(...).sum() because Pandas has built-in support for that and a few other standard functions, but agg is general and works for any reduction function p you can provide.

    Finally, the unstack() method will convert the DataFrame back into the properly 2D "matrix form", and then if you want you can use the as_matrix() method to turn it back into a pure Numpy array.

    >>> M3 = stp.unstack().as_matrix()
    >>> M3
    array([[ 15,  63,  27],
           [ 35, 117,  43]])
    

    If you don't want to bring in Pandas, there are other libraries that do the same thing. You might look at numpy-groupies, for example, but as I write this it appears to be only compatible with Python 2. However I haven't found any library that does true two-dimensional grouping, which you might need if you are working with very large matrices, large enough that having an additional 2 or 3 copies of them would exhaust the available memory.

  2. 2017-01-05 18:01

    Let M1 be a numpy n x m array. You can start by determining which partitions you have. The set constructor removes repeated entries, but orders them arbitrarily. I sort them just to have a well-defined ordering:

    xs = sorted(set(f(i) for i in range(n)))
    ys = sorted(set(g(i) for i in range(m)))
    

    To build a block matrix for each X,Y you can use numpy boolean indexing along with the grid-construction helper ix_ to select only rows and columns belonging to X and Y, respectively. Finally, apply p to the selected submatrix:

    from numpy import zeros, arange, ix_
    
    ii, jj = arange(n), arange(m)
    M3 = zeros((len(xs), len(ys)))
    
    for k, X in enumerate(xs):
        for l, Y in enumerate(ys):
            M3[k,l] = p(M1[ix_(f(ii) == X, g(jj) == Y)])
    

    The partitions f and g have to apply element-wise to numpy arrays for this to work. As mentioned in the other answer the numpy.vectorize decorator can be used to achieve this.

    To give an example:

    from __future__ import division
    n = m = 5
    M1 = np.arange(25).reshape(5,5)
    f = lambda x: x // 3      # f(ii) = [0, 0, 0, 1, 1]
    g = lambda x: (x+2) // 3  # g(jj) = [0, 1, 1, 1, 2]
    p = numpy.sum
    
    M3 = [[  15.,   63.,   27.],
          [  35.,  117.,   43.]]
    
◀ Go back