Python 字节码优化问题

Question

问题背景： Python在执行的时候会加载每一个模块的PyCodeObject,其中这个对象就包含有opcode,也就是这个模块所有的指令集合，具体定义在源码目录的 /include/opcode.h 中定义了所有的指令集合，在执行的时候通过...

PHP中文网 · Answer

Modify the plan

I did a test. Based on python 2.7.3, I changed all the cases in the PyEval_EvalFrameEx function to labels, and then used the labels as values feature of gcc to form an array with the 118 opcodes used in it and the corresponding labels. :

static void *label_hash[256] = {NULL};
static int initialized = 0; 
if (!initialized)
{    
    #include "opcodes.c"
    #include "labels.c"
    int i, n_opcode = sizeof(opcode_list) / sizeof(*opcode_list);
    for (i = 0; i < n_opcode; i++) 
        label_hash[opcode_list[i]] = label_list[i];
    initialized = 1; 
}

Then change switch (opcodes) to

void *label = label_hash[opcode];
if (label == NULL)
    goto default_opcode;
goto *label;

And replace the break in each case one by one.

After compilation, it passed all tests (except test_gdb, which is the same as the unmodified version, without sys.pydebug), which means that the modification is correct.

Performance Test

The next question is performance...how to test this...I don’t have any good ideas, so I just found two pieces of code:

Direct loop 5kw times:

i = 50000000
while i > 0:
    i -= 1

Run 4 times before modification: [4.858, 4.851, 4.877, 4.850], remove the largest one, average 4.853s

After modification, run 4 times: [4.558, 4.546, 4.550, 4.560], remove the largest one, average 4.551s

Performance improvement (100% - (4.551 / 4.853)) = 6.22%

Recursive Fibonacci, calculate the 38th

def fib(n):
    return n if n <= 2 else fib(n - 2) + fib(n - 1)
print fib(37)

Before modification [6.227, 6.232, 6.311, 6.241], remove the largest one, average 6.233s

After modification [5.962, 5.945, 6.005, 6.037], removing the largest one, the average is 5.971s

Performance improvement (100% - (5.971 / 6.233)) = 4.20%

Conclusion

Taken together, this small change can indeed improve performance by about 5%. I don’t know how meaningful it is to you...

However, because it uses non-C standard features that are only supported by GCC, it is not convenient to transplant. According to this StackOverflow post, MSVC can support it to a certain extent, but it seems very tricky. ~~I wonder if anyone has made such an attempt during the development of Python. Maybe the official will prefer portability? Maybe I can make a post and ask if I have time.~~ According to @方泽图 (see the comments in his answer), Python 3.0+ introduced this optimization. See his answer for details.

PHP中文网 · Answer

The relatively sparse switch (meaning that the case values are relatively different) does need to be compared again and again to determine which branch should be entered. However, the switch in CPython's ceval.c is very dense, and the value difference between cases is 1. Therefore, popular compilers (gcc/msvc/llvm-clang) can convert this switch into a simple indirect branch. , take x86-64, linux, gas syntax as an example:

cmp $MAX_OP, %opcode
ja .handle_max_op
jmp *.op_dispatch_table(,%opcode,8)

If

is translated into C, it roughly means:

static void *op_dispatch_table[] = {
    &&handle_NOP,
    &&handle_LOAD_FAST,
    // etc etc...
};

if (opcode > MAX_OP) {
    goto *handle_max_op;
}
else {
    goto *op_dispatch_table[opcode];
}

So it’s not actually a case-by-case comparison like you said.

The optimization of Interpreter is very interesting. switch seems efficient, but in fact since the generated code will make all indirect branches in the same place (the branch target can be anywhere), the processor's branch prediction becomes useless.

Branch prediction is very important in a stack-based interpreter like CPython. This is because most OPCODEs are short, and opcode dispatch (that is, branch prediction) often takes up the majority of the time. In the commonly used Sandy Bridge processor, branch prediction failure is equivalent to 15 cycles (source), and IPC (CPU instructions that can be executed per cycle) can generally reach 3 when branch prediction is successful. In contrast, a small OPCODE like LOAD_FAST only uses less than 10 CPU instructions. In other words, the time spent on branch prediction can even account for 80% of the entire code running time.

Therefore, CPython uses two other optimization techniques, one is prediction of commonly used consecutive instructions, and the other is the use of indirect threading if the compiler supports it.

The prediction of continuous instructions refers to the fact that in Python, many instructions often appear in pairs (for example, if x < y then xxx else xxx will appear in COMPARE_OP -> POP_JUMP_IF_FALSE). In this case, the previous OPCODE does not need to rely on switch to execute the next OPCODE. It can jump to the next OPCODE by itself. All it needs to do is to check whether the latter OPCODE is what it wants. Two branches need to be added here, but one of the two branches is a conditional judgment and the other is a direct jump, so the processor's branch prediction can play a role here. In ceval.c, if you see the words PREDICT(...), it means there is prediction of continuous instructions.

indirect threading refers to placing the indirect branch at the end of each OPCODE processing. In this way, each OPCODE will obtain the processor's branch prediction for its next instruction. Although this is still indirect branch, since each OPCODE is predicted separately, this greatly improves the accuracy of the prediction. CPython2.7 does not use this optimization. CPython3+ will switch this option on and off depending on whether the compiler supports it.

CPython’s interpreter has just been optimized after years of polishing.