Home >Backend Development >C++ >How to Efficiently Transpose a Matrix in C ?
How to Swiftly Transpose a Matrix in C ?
Problem:
Consider a substantial matrix with elements arranged as:
a b c d e f g h i j k l m n o p q r
The goal is to transpose this matrix, resulting in:
a g m b h n c I o d j p e k q f l r
Solution:
To transpose the matrix efficiently, consider the following approaches:
1. Naive Transpose:
void transpose(float *src, float *dst, const int N, const int M) { #pragma omp parallel for for(int n = 0; n<N*M; n++) { int i = n/N; int j = n%N; dst[n] = src[M*j + i]; } }
This straightforward method iterates through each element and copies it to the transposed position. However, it may suffer from cache misses due to unpredictable memory access patterns.
2. Transpose for Matrix Multiplication:
When performing matrix multiplication C = A*B, it can be advantageous to transpose B. This approach eliminates cache misses and significantly speeds up the computation.
transpose(B); for(int i=0; i<N; i++) { for(int j=0; j<K; j++) { float tmp = 0; for(int l=0; l<M; l++) { tmp += A[M*i+l]*B[K*j+l]; } C[K*i + j] = tmp; } } transpose(B);
3. Block Transpose Using Loop Blocking:
For large matrices, loop blocking offers exceptional performance. It divides the matrix into smaller blocks and transposes them independently.
void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) { #pragma omp parallel for for(int i=0; i<n; i+=block_size) { for(int j=0; j<m; j+=block_size) { transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size); } } }
4. Transpose Using SSE Intrinsics:
This advanced technique leverages SSE intrinsics to achieve unparalleled speed. It efficiently transposes 4x4 blocks at a time using a single instruction.
void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) { __m128 row1 = _mm_load_ps(&A[0*lda]); __m128 row2 = _mm_load_ps(&A[1*lda]); __m128 row3 = _mm_load_ps(&A[2*lda]); __m128 row4 = _mm_load_ps(&A[3*lda]); _MM_TRANSPOSE4_PS(row1, row2, row3, row4); _mm_store_ps(&B[0*ldb], row1); _mm_store_ps(&B[1*ldb], row2); _mm_store_ps(&B[2*ldb], row3); _mm_store_ps(&B[3*ldb], row4); }
5. Loop Blocking with SSE:
Combining loop blocking with SSE intrinsics further enhances performance. This approach processes 4x4 blocks of the matrix efficiently.
void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) { #pragma omp parallel for for(int i=0; i<n; i+=block_size) { for(int j=0; j<m; j+=block_size) { int max_i2 = i+block_size < n ? i + block_size : n; int max_j2 = j+block_size < m ? j + block_size : m; for(int i2=i; i2<max_i2; i2+=4) { for(int j2=j; j2<max_j2; j2+=4) { transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb); } } } } }
The above is the detailed content of How to Efficiently Transpose a Matrix in C ?. For more information, please follow other related articles on the PHP Chinese website!