Home >Backend Development >C++ >Why is Transposing a 512x512 Matrix Significantly Slower Than Transposing a 513x513 Matrix?

Why is Transposing a 512x512 Matrix Significantly Slower Than Transposing a 513x513 Matrix?

DDDOriginal: 2024-12-24 00:41:15843browse

Why Transposing 512x512 Matrices is Slower Than Transposing 513x513

A peculiar observation in transposing matrices arose after conducting experiments on matrices of varying sizes: transposing a matrix with dimensions 2^n is computationally more expensive than transposing one with dimensions 2^n 1. The discrepancy becomes significant when n equals 512.

The code provided for the transposition operation is as follows:

#define SAMPLES 1000
#define MATSIZE 512

#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];

void transpose()
{
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
   {
       int aux = mat[i][j];
       mat[i][j] = mat[j][i];
       mat[j][i] = aux;
   }
}

int main()
{
   //initialize matrix
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
       mat[i][j] = i+j;

   int t = clock();
   for ( int i = 0 ; i < SAMPLES ; i++ )
       transpose();
   int elapsed = clock() - t;

   std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}

By altering the MATSIZE macro, the size of the matrix can be modified. The following benchmarks illustrate the stark difference:

Size 512: Average 2.46 ms
Size 513: Average 0.75 ms

Cache Contention and Critical Stride

The reason behind this anomaly lies in cache behavior and the concept of cache contention. Here's a breakdown:

A cache is structured into sets and lines. At any given moment, only one set is accessed, and any line within that set can be utilized. The total cache size is determined by the number of lines multiplied by the size of each line.
To calculate the set to which a particular memory address belongs, the following formula is employed: set = ( address / lineSize ) % numberOfsets.
When multiple memory addresses access the same set, cache conflicts arise. In such scenarios, the least recently used line in the set is overwritten with the newly retrieved data.
The critical stride, which represents the number of memory accesses that result in a cache conflict, is calculated as criticalStride = numberOfSets * lineSize.
In the case of a 64x64 matrix with an 8kb cache, the critical stride would align perfectly with the rows of the matrix, leading to excessive cache reloads during transposition.
However, when the matrix size is increased to 65x65, the critical stride is no longer perfectly aligned, reducing the frequency of cache conflicts and improving performance.

Thus, the transposing operation becomes significantly slower for matrices with dimensions that are multiples of 2^n due to cache contention.

The above is the detailed content of Why is Transposing a 512x512 Matrix Significantly Slower Than Transposing a 513x513 Matrix?. For more information, please follow other related articles on the PHP Chinese website!

for number this Access

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：When Should You Use `std::move` for Function Return Values?Next article：When Should You Use `std::move` for Function Return Values?

See more

Why is Transposing a 512x512 Matrix Significantly Slower Than Transposing a 513x513 Matrix?

Why Transposing 512x512 Matrices is Slower Than Transposing 513x513

Cache Contention and Critical Stride

Related articles