Home  >  Article  >  System Tutorial  >  Optimizing performance at the x86 pipeline level

Optimizing performance at the x86 pipeline level

WBOY
WBOYforward
2024-01-03 18:32:261103browse
Introduction How to choose the right path when facing a branch? If the instruction selection is wrong, the entire pipeline needs to wait for the remaining instructions to be executed, clear them, and then start again from the correct position. The deeper the level of the pipeline, the greater the damage caused.
Preface

The key to performance optimization is to serve the CPU well. As a programmer who pursues the ultimate in performance, understanding the internal mechanisms of the CPU is an unavoidable topic. This is a continuous process that requires accumulation over time, but it does not need to go deep into digital circuits. Just like an expert in designing CPUs does not necessarily have to be proficient in software design, you do not need to be a CPU expert to write high-level software. performance software.

As a precious gift from a small group of human elites to the general public, CPUs that can be purchased at will in the market actually represent the most cutting-edge technological level of mankind just like nuclear weapons that cannot be purchased. Even an x86 CPU expert can only speak in detail about what he specializes in. For us, although it is impossible to understand everything, there are three parts that are very critical: pipeline, cache and instruction set. Among these three parts, "assembly line" can be used as a running clue. Therefore, following the example from the previous article, let's first take a look at the pipeline.
basic concept

The main job of PU is to perform operations on data according to instructions. This sentence basically explains what an assembly line is. I know that no one who can click on this article knows anything about the concept of "assembly line". I don't want to lay out a large textbook-like text at the beginning and list the definitions of various concepts. This is completely Wholeheartedly abandon the fundamentals and pursue the inferior. The development of technology is just a form of movement of contradictions in things. This time we will try to introduce the various components of the pipeline from the perspective of the historical evolution of the CPU.

From the time Intel produced the first 8086 processor 40 years ago until today, the changes in CPUs have made you feel that previous processors can only be called "single-chip computers." But even if it is a single-chip microcomputer that only costs a few cents a piece on Taobao, it still has some similarities with today’s i7 processor. The 8086 processor has 14 registers that are still in use today: 4 general purpose registers (General Purpose Register), 4 segment registers (Segment Register), 4 index registers (Index Register), and 1 flag register (EFLAGS Register) It is used to mark the CPU status, and the last one, the Instruction Pointer Register, is used to save the address of the next instruction that needs to be executed. This instruction pointer register is directly related to the operation process of the pipeline. Its continued existence also shows the time consistency of the basic principles of the pipeline.

From 40 years ago to the present, all instructions executed by the CPU follow the following process: the CPU first obtains (Fetch) the address of the instruction to be executed in the code segment based on the instruction pointer, and then decodes (Decode) the address at the address. instruction. After decoding, it will enter the actual execution (Execute) phase, followed by the "Write Back" phase, where the final result of the processing will be written back to the memory or register, and the instruction pointer register will be updated to point to the next instruction. This is basically a design solution that is completely consistent with human logic.

Initially, and most naturally, the CPU will process all instructions one after another. Each instruction is executed according to the above process, and then the next instruction is executed. The main contradiction at that time was the contradiction between the growing performance requirements of software and the backward CPU processing speed. Under the correct guidance of Moore's Law, CPU construction work has achieved historic results, and the main contradiction has shifted: the execution speed of the CPU has slowly exceeded the speed of memory reading and writing. So fetching instructions from memory every time became increasingly unbearable, so in 1982, an instruction cache was introduced in the processor.

As CPUs become faster and faster, data caching is also introduced into the processor as a compromise between the conflicting parties. But these are not permanent solutions. The main aspect of the contradiction is that the CPU is not running at saturation. So in 1989, the i486 processor constructively introduced a five-stage pipeline. The idea is to digest the excess capacity of the CPU by stimulating domestic demand: instead of processing only one instruction at a time, it can process five instructions at a time.

Optimizing performance at the x86 pipeline level

From the x86 pipeline level, let’s talk about how to optimize performance

I don’t know what you think, but I always have difficulty understanding this picture. To provide a simple understanding: imagine each instruction as a product to be processed, flowing into an assembly line with 5 processing steps. This allows each process of the CPU to always maintain a saturated workload, which fundamentally improves instruction throughput and program performance.
Problems introduced by the pipeline

If each line of code is simply abstracted into an XOR instruction, according to the i486 pipeline diagram above, the first instruction enters the Fetch stage of the pipeline, and then enters the D1 stage, at which time the second instruction enters Fetch. On the next machine cycle, the first instruction goes into D2, the second into D1, and the third instruction is Fetched. So far everything is normal, but in the next machine cycle, when the first instruction enters the Execute stage, the second instruction cannot continue to enter the next stage, because the final result of variable a it requires must be in the first It can only be obtained after the instruction is executed. Therefore, the second instruction will be blocked on the pipeline and will not continue until the first instruction is completed. During the execution of the second instruction, the third instruction will have a similar encounter. When pipeline blocking occurs, the pipeline execution of instructions will be separated from individual execution, which is called a pipeline "bubble".

Clock cycle: also called oscillation cycle. It is the reciprocal of the clock frequency (main frequency) and the minimum time period
Machine cycle: Each stage in the pipeline is called a basic operation, and the time required to complete a basic operation is machine cycle
Instruction cycle: the time required to execute an instruction, generally composed of multiple machine cycles

In addition to the above situations, there is another common reason for the generation of bubbles. The time required to execute each instruction (instruction cycle) is different. When a simple instruction is preceded by a complex instruction that takes a long time, the simple instruction has to wait for the complex instruction. In addition, what if there is a branch like if in the program? These situations will cause the pipeline to be unable to work at full capacity, resulting in a relative decrease in performance.

When facing a problem, people always tend to introduce a more complex mechanism to solve the problem. The multi-stage assembly line is an example. Complexity can reflect technological improvements, but "complexity" itself is a new problem. This may be why contradictions will never disappear and technology will never stop progressing. But "the more we learn, the more we lose for Tao." The increasingly complex mechanism will always have a major breakthrough at a certain opportunity, but maybe the time has not come yet. Faced with the "bubble" problem, the processor introduced a more complex solution-when Intel released the Pentium Pro processor in 1995, it added an out-of-order core (OOO core).

Out-of-order execution core (OOO core)

In fact, the idea of ​​out-of-order execution is very simple: when the next instruction is blocked, just find another executable instruction from the following instructions. But getting this done is quite complicated. First of all, it is necessary to ensure that the final result of the program is consistent with the sequential execution, and at the same time, various data dependencies must be identified. To achieve the desired effect, in addition to parallel execution, the granularity of the instructions also needs to be further refined to achieve the effect of using no thickness to achieve the desired effect. In this way, "micro-operations" (micro-ops) are introduced. the concept of. In the Decode stage of the pipeline, the assembly instructions are further disassembled, and the final product is a series of micro-operations.

The instruction μ-ops processing flow after the out-of-order processing core is introduced. Modules of different colors correspond to different colored pipeline processing stages in the first picture.

There are not many changes in the Fetch stage. In the Decode stage, four instructions can be decoded in parallel, and the final product of decoding is the μ-ops mentioned above. The following Register Alias ​​Table and Reorder Buffer can be regarded as the preprocessing stage of the out-of-order execution core.

For micro-operations executed in parallel, or operations executed out of order, it is very likely that the same register will be read and written at the same time. Therefore, within the processor, the original registers are "aliased" as internal registers that are invisible to software engineers, so that operations originally performed on the same register can be performed on temporarily different registers, regardless of Reading and writing do not interfere with each other (note: this requires that the two operations have no data dependencies). The operands of the corresponding micro-operations have also been changed into temporary alias registers, which is equivalent to a space-for-time strategy, and at the same time, the micro-instructions are translated based on the alias registers.

Then the micro-operation enters the Reorder Buffer. At this point, the microinstructions are ready. They are put into the Reservation Station(RS) and executed in parallel. From the diagram you can see quite a few execution units (Port X). Each execution unit performs a specific task, such as reading (Load), writing (Store), integer calculation (ALU, SEE), etc. Each related microinstruction can be executed after the data it requires is ready. Although such long-consuming instructions and instructions with data dependencies have no change from their own perspective, the blocking overhead they bring is offset by the parallelism and out-of-order (advance) of subsequent instructions. The execution is divided into parts, which improves the overall throughput.

The magic of the out-of-order execution core is that it can maximize the efficiency of this mechanism, and from the outside world, instructions are executed in order. The detailed details are beyond the scope of this article. But the out-of-order execution core is so successful that even under a large workload, the out-of-order execution core of the CPU that introduces this mechanism will still be idle most of the time and is far from saturated. Therefore, another front-end (Front-end, including Fetch and Decode) is introduced to deliver μ-ops to the core. From the system's perspective, it can be abstracted into two processing cores, which is Hyper-thread. The origin of N physical cores and 2N logical cores.

Out-of-order execution does not necessarily achieve the effect of sequential code execution 100%. Sometimes programmers do need to introduce memory barriers to ensure the order of execution.

But complex things always introduce new problems, and this time the conflict was transferred to the Fetch stage. How to choose the right path when facing a branch? If the instruction selection is wrong, the entire pipeline needs to wait for the remaining instructions to be executed, clear them, and then start again from the correct position. The deeper the level of the pipeline, the greater the damage caused. Subsequent articles will introduce some optimization methods at the programming level.
about the author

Zhang Pan, a Yunshan network engineer, focuses on the development and performance optimization of x86 network software. He is deeply involved in organizations and communities such as ONF/OPNFV/ONOS. He once served as vice chairman of the ONF testing working group.

The above is the detailed content of Optimizing performance at the x86 pipeline level. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:linuxprobe.com. If there is any infringement, please contact admin@php.cn delete