When counting flops is not enough: Optimization of data access viewed from programmers' perspective
Abstract
Time: 14:50 - 15:10Until recently, the exponential growth in processor performance as predicted by "Moore's Law'' enabled \textit{automatic speed-up} of compute-intensive applications. In the last last decade the industry has... [ view full abstract ]
Time: 14:50 - 15:10
Until recently, the exponential growth in processor performance as predicted by "Moore's Law'' enabled \textit{automatic speed-up} of compute-intensive applications. In the last last decade the industry has witnessed a shift from high-frequency single-CPU systems to ones that feature a large number of compute cores of modest processing power. Independent of whether programmers write code for sequential or parallel systems one of the fundamental challenges to achieving high performance is to do with getting data from slower memory to processors. The main purpose of this paper is to elucidate basic concepts of locality of reference to data in a hierarchical memory system. We have chosen (sequential) matrix-matrix multiplication (BLAS 3) as the representative application to illustrate the identification and exploitation of data locality. However, we depart from the usual textbook illustration of starting with triple-loop multiplication code and then showing data access pattern under different loop permutations. Instead, we express matrix-matrix multiplication in terms of elementary operations of inner-products, outer-products, and linear combinations. Triple-loop multiplication codes follow naturally thereof depicting pattern of data access such that the most convenient storage of matrix elements can be observed directly. The code is then extended to incorporate blocked multiplication to demonstrate temporal locality. As an example of structured problem we introduce storage by diagonals to perform banded matrix multiplication. From our experience, students find indexing in diagonal storage rather complicated. We use diagrams and derive indexing formulas to access the matrix elements by diagonals. The effect of memory hierarchy on performance is demonstrated by numerical experiments.
Authors
-
Shahadat Hossain
(University of Lethbridge)
-
Trond Steihaug
(University of Bergen)
Topic Area
Education in Computational Science and Engineering
Session
» Education in CSE - part III (14:50 - Tuesday, 24th October, 12th floor - Stratos)