Home  >  Article  >  Backend Development  >  What is the pyc file structure of the python virtual machine?

What is the pyc file structure of the python virtual machine?

王林
王林forward
2023-05-27 21:01:581423browse

    PYC file

    The pyc file is a bytecode file generated by Python when interpreting and executing the source code. It contains the compilation results of the source code. and related metadata information so that Python can load and execute code faster.

    Different from compiled languages, Python is an interpreted language and does not directly compile source code into machine code and execute it. Before running the code, the Python interpreter will first compile the source code into bytecode, and then interpret the bytecode for execution. The .pyc file is the bytecode file generated during this process.

    When the Python interpreter executes a .py file for the first time, it will generate a corresponding .pyc file in the same directory so that it can be executed faster the next time the file is loaded. When the source file is modified and reloaded, the interpreter regenerates the .pyc file to update the cached bytecode.

    Generate PYC file

    Normal python files need to be turned into bytecode through the compiler, and then the bytecode is handed over to the python virtual machine, and then the python virtual machine executes the bytecode. The overall process is as follows:

    What is the pyc file structure of the python virtual machine?

    #We can directly use the compile all module to generate the pyc file of the corresponding file.

    ➜  pvm ls
    demo.py  hello.py
    ➜  pvm python -m compileall .
    Listing '.'...
    Listing './.idea'...
    Listing './.idea/inspectionProfiles'...
    Compiling './demo.py'...
    Compiling './hello.py'...
    ➜  pvm ls
    __pycache__ demo.py     hello.py
    ➜  pvm ls __pycache__ 
    demo.cpython-310.pyc  hello.cpython-310.pyc

    python -m compileall . The command will recursively scan the py files in the current directory and generate the pyc file of the corresponding file.

    PYC file layout

    What is the pyc file structure of the python virtual machine?

    ##The first part The magic number consists of two parts:

    What is the pyc file structure of the python virtual machine?

    The first part The magic is composed of a 2-byte integer and two other characters, carriage return and line feed. "\r\n" also occupies two bytes, making a total of four bytes. This two-byte integer is different in different python versions. For example, in python3.5, this value is 3351, etc., and in python3.9, this value is 3420, 3421, 3422, 3423, 3424, etc. ( in a minor version of Python 3.9).

    Part 2 Bit Field The main purpose of this field is to enable reproducible compilation results in the future, but in python3.9a2, the values ​​of this field are still all 0. Please refer to PEP552 - Deterministic pyc for details. This field does not exist in early versions of python2 and python3 (not yet in python3.5). This field only appears in later versions of python3.

    The third part is the size of the entire py source file.

    The fourth part is also the most important part of the entire pyc file. The last part is the data after serialization of a CodeObject object. We will carefully analyze the data related to this object later.

    Let’s now analyze a pyc file in detail. The corresponding python code is:

    def f():
        x = 1
        return 2

    The hexadecimal form of the pyc file is as follows:

    ➜  __pycache__ hexdump -C hello.cpython-310.pyc
    00000000  6f 0d 0d 0a 00 00 00 00  b9 48 21 64 20 00 00 00  |o........H!d ...|
    00000010  e3 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    00000020  00 02 00 00 00 40 00 00  00 73 0c 00 00 00 64 00  |.....@...s....d.|
    00000030  64 01 84 00 5a 00 64 02  53 00 29 03 63 00 00 00  |d...Z.d.S.).c...|
    00000040  00 00 00 00 00 00 00 00  00 01 00 00 00 01 00 00  |................|
    00000050  00 43 00 00 00 73 08 00  00 00 64 01 7d 00 64 02  |.C...s....d.}.d.|
    00000060  53 00 29 03 4e e9 01 00  00 00 e9 02 00 00 00 a9  |S.).N...........|
    00000070  00 29 01 da 01 78 72 03  00 00 00 72 03 00 00 00  |.)...xr....r....|
    00000080  fa 0a 2e 2f 68 65 6c 6c  6f 2e 70 79 da 01 66 01  |.../hello.py..f.|
    00000090  00 00 00 73 04 00 00 00  04 01 04 01 72 06 00 00  |...s........r...|
    000000a0  00 4e 29 01 72 06 00 00  00 72 03 00 00 00 72 03  |.N).r....r....r.|
    000000b0  00 00 00 72 03 00 00 00  72 05 00 00 00 da 08 3c  |...r....r......<|
    000000c0  6d 6f 64 75 6c 65 3e 01  00 00 00 73 02 00 00 00  |module>....s....|
    000000d0  0c 00                                             |..|
    000000d2

    Because of data usage Little endian representation, so for the above data:

    • The first part of the magic number is: 0xa0d0d6f.

    • The second part Bit Field is: 0x0.

    • The last modification date of the third part is: 0x642148b9.

    • The file size of the fourth part is: 0x20 bytes, which means the size of the hello.py file is 32 bytes.

    The following is a small code snippet for reading the header meta information of the pyc file:

    import struct
    import time
    import binascii
    fname = "./__pycache__/hello.cpython-310.pyc"
    f = open(fname, "rb")
    magic = struct.unpack(&#39;<l&#39;, f.read(4))[0]
    bit_filed = f.read(4)
    print(f"bit field = {binascii.hexlify(bit_filed)}")
    moddate = f.read(4)
    filesz = f.read(4)
    modtime = time.asctime(time.localtime(struct.unpack(&#39;<l&#39;, moddate)[0]))
    filesz = struct.unpack(&#39;<L&#39;, filesz)
    print("magic %s" % (hex(magic)))
    print("moddate (%s)" % (modtime))
    print("File Size %d" % filesz)
    f.close()

    The output of the above code is as follows:

    bit field = b'00000000'

    magic 0xa0d0d6f
    moddate (Mon Mar 27 15:41:45 2023)
    File Size 32

    About pyc For detailed file operations, please view the source code of the python standard library importlib/_bootstrap_external.py file.

    CODEOBJECT

    In CPython,

    CodeObject is an object, which contains the bytecode, constants, variables, positional parameters, keyword parameters and other information of the Python code , and some metadata used to run the code, such as file name, code line number, etc.

    In CPython, when we execute a Python module or function, the interpreter will first compile its code into

    CodeObject and then execute it. During compilation, the interpreter converts Python code into bytecode and saves it in a CodeObject object. Thereafter, whenever we call that module or function, the interpreter will use the bytecode in CodeObject to execute the code.

    CodeObject Objects are immutable and cannot be modified once created. This is because the bytecode of Python code is immutable, and the CodeObject object contains these bytecodes, so it is also immutable.

    This article mainly introduces the main contents of the code object and briefly introduces their functions. In subsequent articles, the source code corresponding to the code object and the detailed functions of the corresponding fields will be carefully analyzed.

    现在举一个例子来分析一下 pycdemo.py 的 pyc 文件,pycdemo.py 的源程序如下所示:

    if __name__ == &#39;__main__&#39;:
        a = 100
        print(a)

    下面的代码是一个用于加载 pycdemo01.cpython-39.pyc 文件(也就是 hello.py 对应的 pyc 文件)的代码,使用 marshal 读取 pyc 文件里面的 code object 。

    import marshal
    import dis
    import struct
    import time
    import types
    import binascii
    def print_metadata(fp):
        magic = struct.unpack(&#39;<l&#39;, fp.read(4))[0]
        print(f"magic number = {hex(magic)}")
        bit_field = struct.unpack(&#39;<l&#39;, fp.read(4))[0]
        print(f"bit filed = {bit_field}")
        t = struct.unpack(&#39;<l&#39;, fp.read(4))[0]
        print(f"time = {time.asctime(time.localtime(t))}")
        file_size = struct.unpack(&#39;<l&#39;, fp.read(4))[0]
        print(f"file size = {file_size}")
    def show_code(code, indent=&#39;&#39;):
        print ("%scode" % indent)
        indent += &#39;   &#39;
        print ("%sargcount %d" % (indent, code.co_argcount))
        print ("%snlocals %d" % (indent, code.co_nlocals))
        print ("%sstacksize %d" % (indent, code.co_stacksize))
        print ("%sflags %04x" % (indent, code.co_flags))
        show_hex("code", code.co_code, indent=indent)
        dis.disassemble(code)
        print ("%sconsts" % indent)
        for const in code.co_consts:
            if type(const) == types.CodeType:
                show_code(const, indent+&#39;   &#39;)
            else:
                print("   %s%r" % (indent, const))
        print("%snames %r" % (indent, code.co_names))
        print("%svarnames %r" % (indent, code.co_varnames))
        print("%sfreevars %r" % (indent, code.co_freevars))
        print("%scellvars %r" % (indent, code.co_cellvars))
        print("%sfilename %r" % (indent, code.co_filename))
        print("%sname %r" % (indent, code.co_name))
        print("%sfirstlineno %d" % (indent, code.co_firstlineno))
        show_hex("lnotab", code.co_lnotab, indent=indent)
    def show_hex(label, h, indent):
        h = binascii.hexlify(h)
        if len(h) < 60:
            print("%s%s %s" % (indent, label, h))
        else:
            print("%s%s" % (indent, label))
            for i in range(0, len(h), 60):
                print("%s   %s" % (indent, h[i:i+60]))
    if __name__ == &#39;__main__&#39;:
        filename = "./__pycache__/pycdemo01.cpython-39.pyc"
        with open(filename, "rb") as fp:
            print_metadata(fp)
            code_object = marshal.load(fp)
            show_code(code_object)

    执行上面的程序输出结果如下所示:

    magic number = 0xa0d0d61
    bit filed = 0
    time = Tue Mar 28 02:40:20 2023
    file size = 54
    code
       argcount 0
       nlocals 0
       stacksize 2
       flags 0040
       code b&#39;650064006b02721464015a01650265018301010064025300&#39;
      3           0 LOAD_NAME                0 (__name__)
                  2 LOAD_CONST               0 (&#39;__main__&#39;)
                  4 COMPARE_OP               2 (==)
                  6 POP_JUMP_IF_FALSE       20
      4           8 LOAD_CONST               1 (100)
                 10 STORE_NAME               1 (a)
      5          12 LOAD_NAME                2 (print)
                 14 LOAD_NAME                1 (a)
                 16 CALL_FUNCTION            1
                 18 POP_TOP
            >>   20 LOAD_CONST               2 (None)
                 22 RETURN_VALUE
       consts
          &#39;__main__&#39;
          100
          None
       names (&#39;__name__&#39;, &#39;a&#39;, &#39;print&#39;)
       varnames ()
       freevars ()
       cellvars ()
       filename &#39;./pycdemo01.py&#39;
       name &#39;<module>&#39;
       firstlineno 3
       lnotab b&#39;08010401&#39;

    下面是 code object 当中各个字段的作用:

    • 首先需要了解一下代码块这个概念,所谓代码块就是一个小的 python 代码,被当做一个小的单元整体执行。在 Python 中常见的代码块包括函数体、类的定义和模块。

    • argcount,这个表示一个代码块的参数个数,这个参数只对函数体代码块有用,因为函数可能会有参数,比如上面的 pycdemo.py 是一个模块而不是一个函数,因此这个参数对应的值为 0 。

    • co_code,这个对象的具体内容就是一个字节序列,存储真实的 python 字节码,主要是用于 python 虚拟机执行的,在本篇文章当中暂时不详细分析。

    • co_consts,这个字段是一个列表类型的字段,主要是包含一些字符串常量和数值常量,比如上面的 ";main" 和 100 。

    • co_filename,这个字段的含义就是对应的源文件的文件名。

    • co_firstlineno,这个字段的含义为在 python 源文件当中第一行代码出现的行数,这个字段在进行调试的时候非常重要。

    • 主要含义是标识该 code object 的类型的字段是 co_flags。0x0080 表示这个 block 是一个协程,0x0010 表示这个 code object 是嵌套的等等。

    • co_lnotab,这个字段的含义主要是用于计算每个字节码指令对应的源代码行数。

    • The main purpose of the field "co_varnames" is to indicate a name defined locally in a code object.。

    • co_names,和 co_varnames 相反,表示非本地定义但是在 code object 当中使用的名字。

    • co_nlocals,这个字段表示在一个 code object 当中本地使用的变量个数。

    • co_stackszie,因为 python 虚拟机是一个栈式计算机,这个参数的值表示这个栈需要的最大的值。

    • co_cellvars,co_freevars,这两个字段主要和嵌套函数和函数闭包有关。

    The above is the detailed content of What is the pyc file structure of the python virtual machine?. For more information, please follow other related articles on the PHP Chinese website!

    Statement:
    This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete