Cannon’s Algorithm and 2.5 D Matrix Multiplication Algorithm

This is a summary of two popular distributed Matrix multiplication algorithms, Cannon’s algorithm and 2.5 D MatrixMult. 2.5 D Matrix Multiplication Algorithm can be described as an extension to the transitional Cannon’s algorithm. I am using 2.5 D Matrix Multiplication Algorithm to demonstrate the usability of Habanero Java’s ArrayView based MPI APIs. I have included description of the algorithms, pointers to resources and my own notes on both algorithms.

The Cannon’s Algorithm takes as input two Matrices, A and B. In the simplest case, that assumes both matrices are of dimensions n by n. Assuming there are p processors.

First, we break down Matrices A and B into p equal size blocks. Each block has n^2 / p elements. We use each block as our smallest unit of sum, multiply. This is because the block operations can represent the operations that will be added to each individual element.

Now we regard Matrices A and B as dimensions n/p. For block (i,j), the final result, assuming it is stored in a Matrix P

P[i,j] = A[i, (…)] (row i of A) vector product B[(…), j] (column j of B)

Visualized below


The way the calculation is done is through accumulating all the result to the block. The intuition here is that through clever ways of initializing the positions of the blocks, we can shift the blocks in a way, that in the end, all the results needed for Pi,j would be accumulated in Pi,j.

The exact algorithm is listed below, credits to one tutorial that captured my understanding)

  1. Initially $ P_{i,j}$ begins with $ \mathsf{A}_{i,j}$ and $ \mathsf{B}_{i,j}$.
  2. Elements are moved from their initial positions to align them so that the correct submatrices are multiplied with one another. Note that submatrices on the diagonal don’t actually require alignment. Alignment is done by shifting the $ i$-th row of $ \mathsf{A}$ $ i$ positions left and the $ j$-th column of $ \mathsf{B}$ $ j$ positions up.
  3. Each process, $ P_{i,j}$ multiplies its current submatrices and adds to a cumulative sum.
  4. The submatrices are shifted left and upwards.
  5. The above two steps are repeated through the remaining submatrices.

Note: Step 2 is very important. It sets up the initial alignment. This way, with all the shifting, all the calculations needed for P(i,j) will happen can be accumulated in P(i, j).

Now, let’s move on to 2.5 D algorithm. The official reference is here by Edgar Solomonik and Prof. James Demmel(2.5DMatrixMult). Prof. Mellor-Crummey pointed  me to it at the Comp 422 HW assignment. (

The key idea behind the 2.5 D algorithm is that we can use a third dimension to form a processor mesh. The size of the third dimensions can be specified with a third constant. The intuition is that we can do cannon’s algorithm on the third dimension with some replications. If we have multiple copies of sub matrices (blocks) floating around, we can perform multiple computations in

P[i,j] = A[i, (…)] (row i of A) vector product B[(…), j] (column j of B)

in parallel. (Cannon’s algorithm does one block in the above computation every time).

Again, compare with Cannon’s algorithm this 2.5 D requires

  • replication of sub matrices in the third dimension, where Cannon’s algorithm only use two dimensions processor planes
  • changes in the initial shifting to get the sub matrices into the right positions. It needs to take into account the third dimension
  • The 2.5 D algorithm need a final reduction phase along the third dimension to accumulate the final results back to the sub matrices in the front plane.
  • the shifting scheme is still the same (leftward circular shift for A, upward circular shift for B)

To work through an example that will help understanding the algorithm, n is the dimension of input matrices n x n for A and B, p is the total number of processors, c is the number of dimensions

  • n=4 , p = 8, c =2
    •  test the initial shifting and final reduction in 2.5 D Matrix. The algorithm will return the right result performing only the initial shifting and final reduction (no circular shifting in the intermediate steps).
  • n = 4, p  = 4, c =1
    • test the algorithm in 2 D, without final reduction and the third dimension. Mostly testing the intermediate shifts. It should shift one step.
    • I also tested on larger scales, n = 6, p = 9, c = 1.

I have each rank use the rank number as the element value for all elements in the rank for Matrix A. I doubled the value of all elements in Matrix A to form Matrix B. This is because if any rank is problematic, the final result would reflect the errors.

The command to run

The documentation to the MPI Java API in the openMPI suite is here

mpirun -np 4 java -cp /Users/zhangyunming1990/Documents/Research/hj-lib-mpi/hj-lib/target/classes:/Users/zhangyunming1990/.m2/repository/org/open-mpi/openmpi/1.9.0/openmpi-1.9.0.jar edu.rice.hj.example.hjmpi.MatrixMultTwoAndHalfDAlgorithm2

-np specifies the number of processors

This entry was posted in Class, HPC. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s