How does 3D convolution really works in CAFFE, a detailed analysis

This post summarizes how 3D convolution is implemented as a 2D matrix multiplication in CAFFE and other popular CNN implementations.

we start by analyzing the code in conv_layer, specifically we look at forward_cpu code. If we follow the example of the first convolution layer, we know it is a 3D tensor operations but converted to a 2D matrix multiplication. Th important dimensions are M_, N_, K_.

Why is M_, N_, K_ important, we can see in forward CPU, it matrix A in BLAS is the weight matrix. it has M rows and K columns. B is the linearized 3D tensor in 2D matrix using a img2col operation. Its dimenstion is K by N.  (see the documentation for A B and BLAST below)

So the convolution becomes

A(M by K)  * B (K by N)

The code that calculates M_, N_, K_ are shown below

// Prepare the matrix multiplication computation.

// Each input will be convolved as a single GEMM.

M_ = num_output_ / group_;

K_ = channels_ * kernel_h_ * kernel_w_ / group_;

N_ = height_out_ * width_out_;

The call to forward CPU in convolution layer

for (int n = 0; n < num_; ++n) {

// im2col transformation: unroll input regions for filtering

// into column matrix for multplication.

im2col_cpu(bottom_data + bottom[i]->offset(n), channels_, height_,

width_, kernel_h_, kernel_w_, pad_h_, pad_w_, stride_h_, stride_w_,


// Take inner products for groups.

for (int g = 0; g < group_; ++g) {

caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, K_,

(Dtype)1., weight + weight_offset * g, col_data + col_offset * g,

(Dtype)0., top_data + (*top)[i]->offset(n) + top_offset * g);


// Add bias.

if (bias_term_) {

caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_,

N_, 1, (Dtype)1., this->blobs_[1]->cpu_data(),


(Dtype)1., top_data + (*top)[i]->offset(n));




CAFFE function call


void caffe_cpu_gemm<float>(const CBLAS_TRANSPOSE TransA,

const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,

const float alpha, const float* A, const float* B, const float beta,

float* C) {

int lda = (TransA == CblasNoTrans) ? K : M;

int ldb = (TransB == CblasNoTrans) ? N : K;

cblas_sgemm(CblasRowMajor, TransA, TransB, M, N, K, alpha, A, lda, B,

ldb, beta, C, N);


function documentation

void cblas_dgemm ( const enum CBLAS_ORDER __ Order , const enum CBLAS_TRANSPOSE __ TransA ,const enum CBLAS_TRANSPOSE __ TransB , const int __ M , const int __ N , const int __ K ,const double __ alpha , const double *__ A , const int __ lda , const double *__ B , constint __ ldb , const double __ beta , double *__ C , const int __ ldc );



Specifies row-major (C) or column-major (Fortran) data ordering.


Specifies whether to transpose matrix A.


Specifies whether to transpose matrix B.


Number of rows in matrices A and C.


Number of columns in matrices B and C.


Number of columns in matrix A; number of rows in matrix B.


Scaling factor for the product of matrices A and B.


Matrix A.


The size of the first dimention of matrix A; if you are passing a matrix A[m][n], the value should be m.


Matrix B.


The size of the first dimention of matrix B; if you are passing a matrix B[m][n], the value should be m.


Scaling factor for matrix C.


Matrix C.


The size of the first dimention of matrix C; if you are passing a matrix C[m][n], the value should be m.

So for example, for the first convolution layer, it is essentially a

96 by 363 (A) * 363 by 3025 and you get a 96 by 3025 matrix, corresponding to the 96 * 55 *55 3D tensor.

This entry was posted in Convoluted Neural Nets. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s