## numpy: Compressing block matrix

Question

Consider a matrix `M1` giving values for all combinations `x,y`. Consider a partition `f(x)->X` and a partition `g(y)->Y`. Furthermore consider an operation `p(A)` on a set `A` of numbers, i.e. `max(A)` or `sum(A)`.

The mappings `f,g` can be used to create from `M1` a block matrix `M2` where all `x` that are mapped to the same `X` are adjacent, and the same for all `y`.

This matrix `M2` has a block for each combination of the 'sets' `X,Y`.

Now I would like to condense this matrix `M2` into another matrix `M3` by applying `p` on each block separately. `M3` has one value for each combination of `X,Y`.

Ideally, I would like to skip the transformation of `M1` into `M2` using `f` and `g` on the fly.

What would be the most efficient way to perform such operation and would it be possible to deploy `numpy` or `scipy` for it?

Special case: Actually, in my case `x` and `y` are identical and there is only one function `f` applied to both of them. I only care about the part of `M2` that is under the diagonal.

Show source

## Answers to numpy: Compressing block matrix ( 2 )

1. The most straightforward way I can think of to do this, although perhaps not the most efficient (especially if your matrix is huge), is to convert your matrix to a one-dimensional array, and then have corresponding arrays for the partition group indices `X` and `Y`. You can then group by the partition group indices and finally restructure the matrix back into its original form.

For example, if your matrix is

``````>>> M1 = np.arange(25).reshape((5,5))
>>> M1
array([[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
``````

``````>>> def f(x):
...     return np.array([1,1,1,2,2])[x]
>>> def g(y):
...     return np.array([3,4,4,4,5])[y]
``````

From that point, there are several ways to implement the reshaping and subsequent grouping. You can do it with Pandas, for instance, by constructing a `DataFrame` and using its `stack()` method to "stack" all the rows on top of each other in a single column, indexed by their original row and column indices.

``````>>> st = pd.DataFrame(M1).stack().to_frame('M1')
>>> st
M1
0 0   0
1   1
2   2
3   3
4   4
1 0   5
...
4 3  23
4  24
``````

(I have truncated the output for readability, and I trust that you can evaluate the rest of these examples yourself if you want to see their output.) You can then add columns representing the partition group indices:

``````>>> st['X'] = f(st.index.get_level_values(0))
>>> st['Y'] = g(st.index.get_level_values(1))
``````

Then you can group by those indices and apply your aggregation function of choice.

``````>>> stp = st.groupby(['X', 'Y']).agg(p)
``````

You will have to define `p` (or find an existing definition) such that it takes a one-dimensional Numpy array and returns a single number. If you want to use something like `sum()`, you can just use `st.groupby(...).sum()` because Pandas has built-in support for that and a few other standard functions, but `agg` is general and works for any reduction function `p` you can provide.

Finally, the `unstack()` method will convert the DataFrame back into the properly 2D "matrix form", and then if you want you can use the `as_matrix()` method to turn it back into a pure Numpy array.

``````>>> M3 = stp.unstack().as_matrix()
>>> M3
array([[ 15,  63,  27],
[ 35, 117,  43]])
``````

If you don't want to bring in Pandas, there are other libraries that do the same thing. You might look at numpy-groupies, for example, but as I write this it appears to be only compatible with Python 2. However I haven't found any library that does true two-dimensional grouping, which you might need if you are working with very large matrices, large enough that having an additional 2 or 3 copies of them would exhaust the available memory.

2. Let `M1` be a numpy `n` x `m` array. You can start by determining which partitions you have. The set constructor removes repeated entries, but orders them arbitrarily. I sort them just to have a well-defined ordering:

``````xs = sorted(set(f(i) for i in range(n)))
ys = sorted(set(g(i) for i in range(m)))
``````

To build a block matrix for each `X,Y` you can use numpy boolean indexing along with the grid-construction helper `ix_` to select only rows and columns belonging to `X` and `Y`, respectively. Finally, apply `p` to the selected submatrix:

``````from numpy import zeros, arange, ix_

ii, jj = arange(n), arange(m)
M3 = zeros((len(xs), len(ys)))

for k, X in enumerate(xs):
for l, Y in enumerate(ys):
M3[k,l] = p(M1[ix_(f(ii) == X, g(jj) == Y)])
``````

The partitions `f` and `g` have to apply element-wise to numpy arrays for this to work. As mentioned in the other answer the `numpy.vectorize` decorator can be used to achieve this.

To give an example:

``````from __future__ import division
n = m = 5
M1 = np.arange(25).reshape(5,5)
f = lambda x: x // 3      # f(ii) = [0, 0, 0, 1, 1]
g = lambda x: (x+2) // 3  # g(jj) = [0, 1, 1, 1, 2]
p = numpy.sum

M3 = [[  15.,   63.,   27.],
[  35.,  117.,   43.]]
``````