Here is a post summarizing my reading and research into applying “cache blocking” for SpMv (sparse matrix dense vector multiplication).
First, I want to talk a bit about why you should care about cache blocking for SpMv. SpMv’s performance largely depends on the latency of memory accesses, especially when the vector’s size is larger than cache size. As a result, to boost temporal locality, cache blocking has the potential to significantly improve the performance of SpMv.
A series of work have come from UC Berkeley in this space. I like to focus on a few good PhD thesis as they give the most detailed examples, and I like to trace the lines of work following time.
First, the basics of “cache blocking” for SpMv from one of the earlier papers on this topic with a series of paper from early 2000s.
“Optimizing Sparse Matrix Vector Multiplication on SMPs”, by Eun-Jin Im and Katherine Yelick. This series of work focused on two techniques that improves cache performance. One is “Matrix Reordering” and the other is “cache blocking”. It evaluated the feasibility of matrix reordering and cache blocking, revealing that cache blocking is useful in very limited case.
Y = X*A (X is the sparse matrix, Y is the destination vector, A is the source vector).
It has a good introduction on what is “cache blocking” , where you need to put a portion of the source vector (C cache) and a portion of the destination vector (R cache) in the cache.
It experimented with multiple parallel approaches and found cache blocking to work for a specific kind of matrix.
The results from this paper does not matter that much as it is pretty old, but it is a good paper on why cache blocking could be helpful.
I found the details of implementing cache blocking for SpMv particularly helpful. Her PhD thesis “Optimizing the Performance of Sparse Matrix-Vector Multiplication” by Eun-Jin Im, is still a pretty good read, especially the pseudo code for cache blocking in chapter 4.
Also a framework is proposed in “SPARSITY: Optimization Framework for Sparse Matrix Kernels” by Eun-Jin Im, Katherine Yelick and Richard Vuduc. But the results do not seem to be parallel performance. The claims is up to 2.2x speedup.
Sparsity focused on 1) cache blocking, but also other optimizations 2) register blocking for more regular matrices with small blocks, and 3) multiple vectors.
In 2004, another paper from Bebop group (James Demmel and Kathy Yelick) “When Cache Blocking for Sparse Matrix Vector Works and Why” presented
1) a performance model for cache blocking on SpMv
2) identified the scenarios under which it is useful to have Cache Blocking.
The scenarios are a) Source vector A does not fit in cache b) Destination vector Y does fit in cache c) the non-zers are distributed throughout the matrix and d) the non-zero density is sufficiently high.
In 2007, Afterwards, some good work are done by Samuel Williams, Richard Vuduc from Berkeley.
There is this good paper “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms” from Super Computing 2007 (SC 07), by Samuel Williams and others.
They also discussed cache blocking in the paper, again seems to be focused on single core cache blocking performance and they did not seem to focus on parallelism within L3.
In 2013, newer works, such as “Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors” seem to focus on using various form of new sparse matrix format to achieve some of the benefit of cache blocking. It is not entirely clear to me how it works, and how well it works. Apparently, a large amount of work has shifted to usign cache blocking for GPU performance.
One work that is worth pointing out, “Cache-aware Sparse Matrix Formats for Kepler GPU” , uses a new approach to cache optimization that still try to reduce cache misses and improve temporal locality. The difference here is that it is a two phase process that involves an intermediate data structure.
Phase 1) Divide the matrix into submatrices and subvectors, and perform multiplications on these sub SpMv problems, and write the results to intermediate results.
Phase 2) merge together the intermediate results.
We also experimented an approach on graph operations (baring similarities to SpMv) that uses a two phase approach. We found that this approach generally generates much greater parallelism than the original cache blocking techniques. “Optimizing Cache Performance for Graph Analytics”.