Home >Backend Development >C++ >Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect
Exploring the unusual performance differences between u64 loop counters and _mm_popcnt_u64 on x86 CPUs
Introduction
I'm looking for a quick way to perform operations on large data arrays popcount method, I encountered a very strange behavior: changing the loop variable from unsigned to uint64_t caused a 50% performance drop on my PC.
Benchmark
#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr << "usage: array_size in MB" << endl; return -1; } uint64_t size = atol(argv[1])<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer = reinterpret_cast<char*>(buffer); for (unsigned i=0; i<size; ++i) charbuffer[i] = rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with unsigned for (unsigned i=0; i<size/8; i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { startP = chrono::system_clock::now(); count=0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with uint64_t for (uint64_t i=0;i<size/8;i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }
As you can see, we created a random data buffer of size x MB, where x is read from the command line. We then iterate over the buffer and perform popcount using an unrolled version of the x86 popcount intrinsic. To obtain more accurate results, we perform popcount 10,000 times. The time we measure popcount. In the first case, the inner loop variable is unsigned, in the second case, the inner loop variable is uint64_t. I thought this shouldn't make any difference, but it doesn't.
(absolutely crazy) result
I compiled it like this (g version: Ubuntu 4.8.2-19ubuntu1):
g++ -O3 -march=native -std=c++11 test.cpp -o test
This I ran the test on my Haswell Core i7-4770K CPU @ 3.50GHz Result for 1 (so 1MB of random data):
As you can see, the uint64_t version has half the throughput of the unsigned version! The problem seems to be that different assemblies are generated, but what is the reason? First, I thought it was a compiler bug, so I tried clang (Ubuntu Clang version 3.4-1ubuntu3):
clang++ -O3 -march=native -std=c++11 teest.cpp -o test
Test result 1:
So, almost getting the same result, still weird. But now it gets really weird. I replaced the buffer size read from the input with a constant 1, so I changed from:
uint64_t size = atol(argv[1]) << 20;
to:
uint64_t size = 1 << 20;
So the compiler now knows at compile time Buffer size. Maybe it can add some optimizations! Here are the numbers in g:
Both versions are now equally fast. However, velocidade becomes even slower compared to unsigned! It dropped from 26 GB/sec to 20 GB/sec, so replacing an unconventional constant with a constant value resulted in de-optimization. Seriously, I have no clue here! But now with clang and new version:
uint64_t size = atol(argv[1]) << 20;
changed to:
uint64_t size = 1 << 20;
Result:
Wait, what happened? Now, both versions are down to a low speed of 15GB/s. So replacing an unconventional constant value with a constant value even resulted in two versions of the code being slower for Clang!
I asked a colleague who uses an Ivy Bridge CPU to compile my benchmarks. He got similar results, so this doesn't seem to be unique to Haswell. Since two compilers produce strange results here, this doesn't seem to be a compiler bug either. Since we don't have an AMD CPU here, we can only use Intel for testing.
More craziness, please!
Using the first example (the one with atol(argv[1])), put a static in front of the variable, i.e.:
#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr << "usage: array_size in MB" << endl; return -1; } uint64_t size = atol(argv[1])<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer = reinterpret_cast<char*>(buffer); for (unsigned i=0; i<size; ++i) charbuffer[i] = rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with unsigned for (unsigned i=0; i<size/8; i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { startP = chrono::system_clock::now(); count=0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with uint64_t for (uint64_t i=0;i<size/8;i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }
Here is what she does Result in g:
Yay, there’s another alternative! We still have 32GB/s with u3, but we managed to get u64 at least from the 13GB/s version to the 20GB/s version! On my colleague's computer, the u64 version was even faster than the u32 version, giving the best results. Unfortunately this only works with g , clang doesn't seem to care about static.
**My question
The above is the detailed content of Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect. For more information, please follow other related articles on the PHP Chinese website!