Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs
Introduction
The task is to reduce the efficiency of a Monte-Carlo simulation program by exploiting the Intel Sandybridge processor architecture. This processor has an out-of-order pipeline with features like register renaming and store buffering, making it challenging to reduce instruction-level parallelism (ILP) and introduce hazards.
Program Analysis
The program is a Monte-Carlo simulation that calculates the price of European vanilla call and put options. The key components of the program are:
- A loop that iterates a specified number of times
- Gaussian random number generation
- Black-Scholes Option Pricing Formula
Optimization Techniques
The following techniques can be used to reduce program efficiency:
-
False dependencies: Introduce unnecessary dependencies between instructions to increase hazard stalls.
-
Memory bottlenecks: Cause cache misses and memory access delays by misaligning data or using non-contiguous memory access patterns.
-
Delayed instructions: Use instructions that have longer latencies and can be delayed by the pipeline.
-
Less efficient operations: Use less efficient mathematical operations like division instead of multiplication.
-
Branch mispredictions: Introduce unpredictable branches to cause pipeline flushes.
-
Store-forwarding stalls: Use techniques like XORing high bytes of doubles to cause store-forwarding stalls.
-
Instruction cache misses: Break up routines into small chunks to cause instruction cache misses.
Specific Suggestions
Based on the above techniques, here are some specific suggestions to pessimize the program:
- Use std::atomic for loop counters and misalign them.
- Induce false sharing among non-atomic variables.
- Multi-thread with a single shared std::atomicloop counter.
- Rewrite expressions with associative/distributive equivalents to increase work.
- Use intrinsic functions carefully to avoid pipeline stalls.
- Use inline assembly to break up the uop cache.
- Use CPUID/RDTSC to time each iteration and induce serialization.
- Traverse arrays in non-contiguous order and use arrays with padding and misaligned elements.
- Use double precision instead of float to increase latency.
- Force conversions from integer to float and back again.
-
Disable compiler optimizations with -O0 and use -march=i386 for slower instructions.
- Set CPU affinity frequently to different CPUs.
The above is the detailed content of How Can We Deoptimize a Monte-Carlo Simulation for Intel Sandybridge Processors?. For more information, please follow other related articles on the PHP Chinese website!
Statement:The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn