问题背景: Python在执行的时候会加载每一个模块的PyCodeObject,其中这个对象就包含有opcode,也就是这个模块所有的指令集合,具体定义在源码目录的 /include/opcode.h 中定义了所有的指令集合,在执行的时候通过加载opcode完成指令的流水线执行过程,opcode也就是所有指令集合生成的字符串。执行体位于源码目录的 /Python/ceavl.c 中PyEval_EvalFrameEx()函数就是虚拟机的执行体函数,它会加载指令集合并完成运算。
问题描述: 在PyEval_EvalFrameEx()函数中,同样是通过标准状态机模型完成的指令解析,一个巨大无比的switch结构,类似这样:
在C中,switch语句的执行是逐条对比的,也就是说每一条指令在执行的时候都需要从头对比,因为这里的指令集合是不平均分布的,但是我们可以假设每个指令平均需要匹配n次,n > 1,其实是远远大于1的。
具体问题: 是否可以做优化,为什么作者没有做优化? 如果不采用switch状态机,因为指令码也是有编号的,是否可以直接采用类hashtable的形式来做?
附注: 如果此问题很2请亲提出宝贵的意见
PHP中文网2017-04-17 11:26:05
I did a test. Based on python 2.7.3, I changed all the cases in the PyEval_EvalFrameEx function to labels, and then used the labels as values feature of gcc to form an array with the 118 opcodes used in it and the corresponding labels. :
static void *label_hash[256] = {NULL};
static int initialized = 0;
if (!initialized)
{
#include "opcodes.c"
#include "labels.c"
int i, n_opcode = sizeof(opcode_list) / sizeof(*opcode_list);
for (i = 0; i < n_opcode; i++)
label_hash[opcode_list[i]] = label_list[i];
initialized = 1;
}
Then change switch (opcodes)
to
void *label = label_hash[opcode];
if (label == NULL)
goto default_opcode;
goto *label;
And replace the break in each case one by one.
After compilation, it passed all tests (except test_gdb, which is the same as the unmodified version, without sys.pydebug), which means that the modification is correct.
The next question is performance...how to test this...I don’t have any good ideas, so I just found two pieces of code:
i = 50000000
while i > 0:
i -= 1
Run 4 times before modification: [4.858, 4.851, 4.877, 4.850], remove the largest one, average 4.853s
After modification, run 4 times: [4.558, 4.546, 4.550, 4.560], remove the largest one, average 4.551s
Performance improvement (100% - (4.551 / 4.853)) = 6.22%
def fib(n):
return n if n <= 2 else fib(n - 2) + fib(n - 1)
print fib(37)
Before modification [6.227, 6.232, 6.311, 6.241], remove the largest one, average 6.233s
After modification [5.962, 5.945, 6.005, 6.037], removing the largest one, the average is 5.971s
Performance improvement (100% - (5.971 / 6.233)) = 4.20%
Taken together, this small change can indeed improve performance by about 5%. I don’t know how meaningful it is to you...
However, because it uses non-C standard features that are only supported by GCC, it is not convenient to transplant. According to this StackOverflow post, MSVC can support it to a certain extent, but it seems very tricky. I wonder if anyone has made such an attempt during the development of Python. Maybe the official will prefer portability? Maybe I can make a post and ask if I have time. According to @方泽图 (see the comments in his answer), Python 3.0+ introduced this optimization. See his answer for details.
PHP中文网2017-04-17 11:26:05
The relatively sparse switch (meaning that the case values are relatively different) does need to be compared again and again to determine which branch should be entered. However, the switch in CPython's ceval.c is very dense, and the value difference between cases is 1. Therefore, popular compilers (gcc/msvc/llvm-clang) can convert this switch into a simple indirect branch. , take x86-64, linux, gas syntax as an example:
cmp $MAX_OP, %opcode
ja .handle_max_op
jmp *.op_dispatch_table(,%opcode,8)
If is translated into C, it roughly means:
static void *op_dispatch_table[] = {
&&handle_NOP,
&&handle_LOAD_FAST,
// etc etc...
};
if (opcode > MAX_OP) {
goto *handle_max_op;
}
else {
goto *op_dispatch_table[opcode];
}
So it’s not actually a case-by-case comparison like you said.
The optimization of Interpreter is very interesting. switch seems efficient, but in fact since the generated code will make all indirect branches in the same place (the branch target can be anywhere), the processor's branch prediction becomes useless.
Branch prediction is very important in a stack-based interpreter like CPython. This is because most OPCODEs are short, and opcode dispatch (that is, branch prediction) often takes up the majority of the time. In the commonly used Sandy Bridge processor, branch prediction failure is equivalent to 15 cycles (source), and IPC (CPU instructions that can be executed per cycle) can generally reach 3 when branch prediction is successful. In contrast, a small OPCODE like LOAD_FAST only uses less than 10 CPU instructions. In other words, the time spent on branch prediction can even account for 80% of the entire code running time.
Therefore, CPython uses two other optimization techniques, one is prediction of commonly used consecutive instructions, and the other is the use of indirect threading if the compiler supports it.
The prediction of continuous instructions refers to the fact that in Python, many instructions often appear in pairs (for example, if x < y then xxx else xxx
will appear in COMPARE_OP -> POP_JUMP_IF_FALSE
). In this case, the previous OPCODE does not need to rely on switch to execute the next OPCODE. It can jump to the next OPCODE by itself. All it needs to do is to check whether the latter OPCODE is what it wants. Two branches need to be added here, but one of the two branches is a conditional judgment and the other is a direct jump, so the processor's branch prediction can play a role here. In ceval.c
, if you see the words PREDICT(...)
, it means there is prediction of continuous instructions.
indirect threading
refers to placing the indirect branch at the end of each OPCODE processing. In this way, each OPCODE will obtain the processor's branch prediction for its next instruction. Although this is still indirect branch
, since each OPCODE is predicted separately, this greatly improves the accuracy of the prediction. CPython2.7 does not use this optimization. CPython3+ will switch this option on and off depending on whether the compiler supports it.
CPython’s interpreter has just been optimized after years of polishing.