Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect-C++-php.cn

Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect

Linda Hamilton

Dec 05, 2024 am 10:42 AM

Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect this performance difference?

Exploring the unusual performance differences between u64 loop counters and _mm_popcnt_u64 on x86 CPUs

Introduction

I'm looking for a quick way to perform operations on large data arrays popcount method, I encountered a very strange behavior: changing the loop variable from unsigned to uint64_t caused a 50% performance drop on my PC.

Benchmark

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr (buffer);
    for (unsigned i=0; i<size charbuffer rand uint64_t count chrono::time_point> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k (endP-startP).count();
        cout (endP-startP).count();
        cout <p>As you can see, we created a random data buffer of size x MB, where x is read from the command line. We then iterate over the buffer and perform popcount using an unrolled version of the x86 popcount intrinsic. To obtain more accurate results, we perform popcount 10,000 times. The time we measure popcount. In the first case, the inner loop variable is unsigned, in the second case, the inner loop variable is uint64_t. I thought this shouldn't make any difference, but it doesn't. </p>
<p><strong> (absolutely crazy) result </strong></p>
<p>I compiled it like this (g version: Ubuntu 4.8.2-19ubuntu1): </p>
<pre class="brush:php;toolbar:false">g++ -O3 -march=native -std=c++11 test.cpp -o test

This I ran the test on my Haswell Core i7-4770K CPU @ 3.50GHz Result for 1 (so 1MB of random data):

unsigned 41959360000 0.401554 seconds 26.113 GB/sec
uint64_t 41959360000 0.759822 seconds 13.8003 GB/sec

As you can see, the uint64_t version has half the throughput of the unsigned version! The problem seems to be that different assemblies are generated, but what is the reason? First, I thought it was a compiler bug, so I tried clang (Ubuntu Clang version 3.4-1ubuntu3):

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Test result 1:

unsigned 41959360000 0.398293 Seconds 26.3267 GB/sec
uint64_t 41959360000 0.680954 sec 15.3986 GB/sec

So, almost getting the same result, still weird. But now it gets really weird. I replaced the buffer size read from the input with a constant 1, so I changed from:

uint64_t size = atol(argv[1]) <p> to: </p><pre class="brush:php;toolbar:false">uint64_t size = 1 <p> So the compiler now knows at compile time Buffer size. Maybe it can add some optimizations! Here are the numbers in g: </p>

unsigned 41959360000 0.509156 seconds 20.5944 GB/sec
uint64_t 41959360000 0.508673 seconds 20.6139 GB/sec

Both versions are now equally fast. However, velocidade becomes even slower compared to unsigned! It dropped from 26 GB/sec to 20 GB/sec, so replacing an unconventional constant with a constant value resulted in de-optimization. Seriously, I have no clue here! But now with clang and new version:

uint64_t size = atol(argv[1]) <p> changed to: </p><pre class="brush:php;toolbar:false">uint64_t size = 1 <p> Result: </p>

unsigned 41959360000 0.677009 sec 15.4884 GB/s
uint64_t 41959360000 0.676909 sec 15.4906 GB/s

Wait, what happened? Now, both versions are down to a low speed of 15GB/s. So replacing an unconventional constant value with a constant value even resulted in two versions of the code being slower for Clang!

I asked a colleague who uses an Ivy Bridge CPU to compile my benchmarks. He got similar results, so this doesn't seem to be unique to Haswell. Since two compilers produce strange results here, this doesn't seem to be a compiler bug either. Since we don't have an AMD CPU here, we can only use Intel for testing.

More craziness, please!

Using the first example (the one with atol(argv[1])), put a static in front of the variable, i.e.:

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr (buffer);
    for (unsigned i=0; i<size charbuffer rand uint64_t count chrono::time_point> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k (endP-startP).count();
        cout (endP-startP).count();
        cout <p>Here is what she does Result in g: </p>
<ul>
<li>unsigned 41959360000 0.396728 sec 26.4306 GB/sec </li>
<li>uint64_t 41959360000 0.509484 sec 20.5811 GB/sec </li>
</ul>
<p>Yay, there’s another alternative! We still have 32GB/s with u3, but we managed to get u64 at least from the 13GB/s version to the 20GB/s version! On my colleague's computer, the u64 version was even faster than the u32 version, giving the best results. Unfortunately this only works with g , clang doesn't seem to care about static. </p>
<p>**My question</p></size></x86intrin.h></chrono></iostream>

The above is the detailed content of Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Building XML Applications with C : Practical ExamplesMay 03, 2025 am 12:16 AM

You can use the TinyXML, Pugixml, or libxml2 libraries to process XML data in C. 1) Parse XML files: Use DOM or SAX methods, DOM is suitable for small files, and SAX is suitable for large files. 2) Generate XML file: convert the data structure into XML format and write to the file. Through these steps, XML data can be effectively managed and manipulated.

XML in C : Handling Complex Data StructuresMay 02, 2025 am 12:04 AM

Working with XML data structures in C can use the TinyXML or pugixml library. 1) Use the pugixml library to parse and generate XML files. 2) Handle complex nested XML elements, such as book information. 3) Optimize XML processing code, and it is recommended to use efficient libraries and streaming parsing. Through these steps, XML data can be processed efficiently.

C and Performance: Where It Still DominatesMay 01, 2025 am 12:14 AM

C still dominates performance optimization because its low-level memory management and efficient execution capabilities make it indispensable in game development, financial transaction systems and embedded systems. Specifically, it is manifested as: 1) In game development, C's low-level memory management and efficient execution capabilities make it the preferred language for game engine development; 2) In financial transaction systems, C's performance advantages ensure extremely low latency and high throughput; 3) In embedded systems, C's low-level memory management and efficient execution capabilities make it very popular in resource-constrained environments.

C XML Frameworks: Choosing the Right One for YouApr 30, 2025 am 12:01 AM

The choice of C XML framework should be based on project requirements. 1) TinyXML is suitable for resource-constrained environments, 2) pugixml is suitable for high-performance requirements, 3) Xerces-C supports complex XMLSchema verification, and performance, ease of use and licenses must be considered when choosing.

C# vs. C : Choosing the Right Language for Your ProjectApr 29, 2025 am 12:51 AM

C# is suitable for projects that require development efficiency and type safety, while C is suitable for projects that require high performance and hardware control. 1) C# provides garbage collection and LINQ, suitable for enterprise applications and Windows development. 2)C is known for its high performance and underlying control, and is widely used in gaming and system programming.

How to optimize codeApr 28, 2025 pm 10:27 PM

C code optimization can be achieved through the following strategies: 1. Manually manage memory for optimization use; 2. Write code that complies with compiler optimization rules; 3. Select appropriate algorithms and data structures; 4. Use inline functions to reduce call overhead; 5. Apply template metaprogramming to optimize at compile time; 6. Avoid unnecessary copying, use moving semantics and reference parameters; 7. Use const correctly to help compiler optimization; 8. Select appropriate data structures, such as std::vector.

How to understand the volatile keyword in C?Apr 28, 2025 pm 10:24 PM

The volatile keyword in C is used to inform the compiler that the value of the variable may be changed outside of code control and therefore cannot be optimized. 1) It is often used to read variables that may be modified by hardware or interrupt service programs, such as sensor state. 2) Volatile cannot guarantee multi-thread safety, and should use mutex locks or atomic operations. 3) Using volatile may cause performance slight to decrease, but ensure program correctness.

How to measure thread performance in C?Apr 28, 2025 pm 10:21 PM

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

3 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks agoByDDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Blue Prince: How To Get To The Basement

3 weeks agoByDDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Linux new version

SublimeText3 Linux latest version

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Hot Topics

1653

1413

1304

1251

1224