


Exploring the unusual performance differences between u64 loop counters and _mm_popcnt_u64 on x86 CPUs
Introduction
I'm looking for a quick way to perform operations on large data arrays popcount method, I encountered a very strange behavior: changing the loop variable from unsigned to uint64_t caused a 50% performance drop on my PC.
Benchmark
#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr (buffer); for (unsigned i=0; i<size charbuffer rand uint64_t count chrono::time_point> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k (endP-startP).count(); cout (endP-startP).count(); cout <p>As you can see, we created a random data buffer of size x MB, where x is read from the command line. We then iterate over the buffer and perform popcount using an unrolled version of the x86 popcount intrinsic. To obtain more accurate results, we perform popcount 10,000 times. The time we measure popcount. In the first case, the inner loop variable is unsigned, in the second case, the inner loop variable is uint64_t. I thought this shouldn't make any difference, but it doesn't. </p> <p><strong> (absolutely crazy) result </strong></p> <p>I compiled it like this (g version: Ubuntu 4.8.2-19ubuntu1): </p> <pre class="brush:php;toolbar:false">g++ -O3 -march=native -std=c++11 test.cpp -o test
This I ran the test on my Haswell Core i7-4770K CPU @ 3.50GHz Result for 1 (so 1MB of random data):
- unsigned 41959360000 0.401554 seconds 26.113 GB/sec
- uint64_t 41959360000 0.759822 seconds 13.8003 GB/sec
As you can see, the uint64_t version has half the throughput of the unsigned version! The problem seems to be that different assemblies are generated, but what is the reason? First, I thought it was a compiler bug, so I tried clang (Ubuntu Clang version 3.4-1ubuntu3):
clang++ -O3 -march=native -std=c++11 teest.cpp -o test
Test result 1:
- unsigned 41959360000 0.398293 Seconds 26.3267 GB/sec
- uint64_t 41959360000 0.680954 sec 15.3986 GB/sec
So, almost getting the same result, still weird. But now it gets really weird. I replaced the buffer size read from the input with a constant 1, so I changed from:
uint64_t size = atol(argv[1]) <p> to: </p><pre class="brush:php;toolbar:false">uint64_t size = 1 <p> So the compiler now knows at compile time Buffer size. Maybe it can add some optimizations! Here are the numbers in g: </p>
- unsigned 41959360000 0.509156 seconds 20.5944 GB/sec
- uint64_t 41959360000 0.508673 seconds 20.6139 GB/sec
Both versions are now equally fast. However, velocidade becomes even slower compared to unsigned! It dropped from 26 GB/sec to 20 GB/sec, so replacing an unconventional constant with a constant value resulted in de-optimization. Seriously, I have no clue here! But now with clang and new version:
uint64_t size = atol(argv[1]) <p> changed to: </p><pre class="brush:php;toolbar:false">uint64_t size = 1 <p> Result: </p>
- unsigned 41959360000 0.677009 sec 15.4884 GB/s
- uint64_t 41959360000 0.676909 sec 15.4906 GB/s
Wait, what happened? Now, both versions are down to a low speed of 15GB/s. So replacing an unconventional constant value with a constant value even resulted in two versions of the code being slower for Clang!
I asked a colleague who uses an Ivy Bridge CPU to compile my benchmarks. He got similar results, so this doesn't seem to be unique to Haswell. Since two compilers produce strange results here, this doesn't seem to be a compiler bug either. Since we don't have an AMD CPU here, we can only use Intel for testing.
More craziness, please!
Using the first example (the one with atol(argv[1])), put a static in front of the variable, i.e.:
#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr (buffer); for (unsigned i=0; i<size charbuffer rand uint64_t count chrono::time_point> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k (endP-startP).count(); cout (endP-startP).count(); cout <p>Here is what she does Result in g: </p> <ul> <li>unsigned 41959360000 0.396728 sec 26.4306 GB/sec </li> <li>uint64_t 41959360000 0.509484 sec 20.5811 GB/sec </li> </ul> <p>Yay, there’s another alternative! We still have 32GB/s with u3, but we managed to get u64 at least from the 13GB/s version to the 20GB/s version! On my colleague's computer, the u64 version was even faster than the u32 version, giving the best results. Unfortunately this only works with g , clang doesn't seem to care about static. </p> <p>**My question</p></size></x86intrin.h></chrono></iostream>
The above is the detailed content of Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect. For more information, please follow other related articles on the PHP Chinese website!

This article explains the C Standard Template Library (STL), focusing on its core components: containers, iterators, algorithms, and functors. It details how these interact to enable generic programming, improving code efficiency and readability t

This article details efficient STL algorithm usage in C . It emphasizes data structure choice (vectors vs. lists), algorithm complexity analysis (e.g., std::sort vs. std::partial_sort), iterator usage, and parallel execution. Common pitfalls like

The article discusses dynamic dispatch in C , its performance costs, and optimization strategies. It highlights scenarios where dynamic dispatch impacts performance and compares it with static dispatch, emphasizing trade-offs between performance and

The article discusses using move semantics in C to enhance performance by avoiding unnecessary copying. It covers implementing move constructors and assignment operators, using std::move, and identifies key scenarios and pitfalls for effective appl

C 20 ranges enhance data manipulation with expressiveness, composability, and efficiency. They simplify complex transformations and integrate into existing codebases for better performance and maintainability.

This article details effective exception handling in C , covering try, catch, and throw mechanics. It emphasizes best practices like RAII, avoiding unnecessary catch blocks, and logging exceptions for robust code. The article also addresses perf

Article discusses effective use of rvalue references in C for move semantics, perfect forwarding, and resource management, highlighting best practices and performance improvements.(159 characters)

C memory management uses new, delete, and smart pointers. The article discusses manual vs. automated management and how smart pointers prevent memory leaks.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Notepad++7.3.1
Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Mac version
God-level code editing software (SublimeText3)
