Home >Backend Development >C++ >Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect

Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect

Linda Hamilton
Linda HamiltonOriginal
2024-12-05 10:42:15909browse

Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect this performance difference?

Exploring the unusual performance differences between u64 loop counters and _mm_popcnt_u64 on x86 CPUs

Introduction

I'm looking for a quick way to perform operations on large data arrays popcount method, I encountered a very strange behavior: changing the loop variable from unsigned to uint64_t caused a 50% performance drop on my PC.

Benchmark

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

As you can see, we created a random data buffer of size x MB, where x is read from the command line. We then iterate over the buffer and perform popcount using an unrolled version of the x86 popcount intrinsic. To obtain more accurate results, we perform popcount 10,000 times. The time we measure popcount. In the first case, the inner loop variable is unsigned, in the second case, the inner loop variable is uint64_t. I thought this shouldn't make any difference, but it doesn't.

(absolutely crazy) result

I compiled it like this (g version: Ubuntu 4.8.2-19ubuntu1):

g++ -O3 -march=native -std=c++11 test.cpp -o test

This I ran the test on my Haswell Core i7-4770K CPU @ 3.50GHz Result for 1 (so 1MB of random data):

  • unsigned 41959360000 0.401554 seconds 26.113 GB/sec
  • uint64_t 41959360000 0.759822 seconds 13.8003 GB/sec

As you can see, the uint64_t version has half the throughput of the unsigned version! The problem seems to be that different assemblies are generated, but what is the reason? First, I thought it was a compiler bug, so I tried clang (Ubuntu Clang version 3.4-1ubuntu3):

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Test result 1:

  • unsigned 41959360000 0.398293 Seconds 26.3267 GB/sec
  • uint64_t 41959360000 0.680954 sec 15.3986 GB/sec

So, almost getting the same result, still weird. But now it gets really weird. I replaced the buffer size read from the input with a constant 1, so I changed from:

uint64_t size = atol(argv[1]) << 20;

to:

uint64_t size = 1 << 20;

So the compiler now knows at compile time Buffer size. Maybe it can add some optimizations! Here are the numbers in g:

  • unsigned 41959360000 0.509156 seconds 20.5944 GB/sec
  • uint64_t 41959360000 0.508673 seconds 20.6139 GB/sec

Both versions are now equally fast. However, velocidade becomes even slower compared to unsigned! It dropped from 26 GB/sec to 20 GB/sec, so replacing an unconventional constant with a constant value resulted in de-optimization. Seriously, I have no clue here! But now with clang and new version:

uint64_t size = atol(argv[1]) << 20;

changed to:

uint64_t size = 1 << 20;

Result:

  • unsigned 41959360000 0.677009 sec 15.4884 GB/s
  • uint64_t 41959360000 0.676909 sec 15.4906 GB/s

Wait, what happened? Now, both versions are down to a low speed of 15GB/s. So replacing an unconventional constant value with a constant value even resulted in two versions of the code being slower for Clang!

I asked a colleague who uses an Ivy Bridge CPU to compile my benchmarks. He got similar results, so this doesn't seem to be unique to Haswell. Since two compilers produce strange results here, this doesn't seem to be a compiler bug either. Since we don't have an AMD CPU here, we can only use Intel for testing.

More craziness, please!

Using the first example (the one with atol(argv[1])), put a static in front of the variable, i.e.:

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

Here is what she does Result in g:

  • unsigned 41959360000 0.396728 sec 26.4306 GB/sec
  • uint64_t 41959360000 0.509484 sec 20.5811 GB/sec

Yay, there’s another alternative! We still have 32GB/s with u3, but we managed to get u64 at least from the 13GB/s version to the 20GB/s version! On my colleague's computer, the u64 version was even faster than the u32 version, giving the best results. Unfortunately this only works with g , clang doesn't seem to care about static.

**My question

The above is the detailed content of Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn