When we first came into contact with programming, the first small project we completed was "hello world". In a short time, we were able to write its hello world in this language. But don’t look at it, it’s just a few letters. However, most people still can’t explain the internal operating mechanism of the simple program hello world, so today we will tell you about the operating mechanism of the program.
hello world How is this information displayed on the monitor? The code executed by the CPU is definitely different from the code we write in the program. What does it look like? How does it change from the code we wrote to code that the CPU can execute? Where is the code when the program is running? How are they organized? Where are the variables in the program stored? How do function calls appear? This article will briefly discuss how the program works.
Hidden process of development platform
Every language has its own development platform, and most of our programs are born here. The conversion process from program source code to executable file is actually divided into many steps and is very complicated. However, the current development platform takes all these things by itself, which not only brings convenience to us, but also hides it. Lots of implementation details. Therefore, most programmers are only responsible for writing code, and other complex conversion work is silently completed by the development platform.
According to my understanding, simply speaking, the process from source code to executable file can be divided into the following stages:
1. From source code to machine language and the resulting machine Language is organized according to certain rules. Let's call it file A for now.
2. Link file A with file B (such as library function) required to run A to form file A+
3. Load file A+ into memory and run file
(In fact, if you read reference books or other materials, there may be more than these steps, but to simplify it, I summarize it into 3 steps)
These are the key steps to form an executable file, and none of them can be missing. . Now you see that you are "blinded" by the development platform. The following sections will clear up the fog and give you the true face of your development platform.
Object file
There is a classic saying in the computer field:
“Any problem in computer science can be sloved by another layer of indirecition”
"Any problem in the field of computer science can be solved by adding an intermediate layer"
For example, to convert from A to B, you can first convert A to file A+, and then convert file A+ to us Required Document B. (In fact, this method is also described in Polya's "how to love it". When solving problems, you can simplify the problem by adding an intermediate layer)
So from source code to executable file The process can be understood this way. The same goes for going from source code to executable files, by (constantly) adding intermediate layers between them to solve the problem.
As mentioned above, first convert the source program into intermediate file A, and then convert the intermediate file into the target file we need.
This is the way to go when processing files.
In fact, the more professional term for file A mentioned above is: target file. It is not an executable program and needs to be linked and loaded with other target files before it can be executed. For a source program, the first thing the development platform needs to do is to translate the source program into machine language. A very important part of it is compilation. I believe many people know that it is to translate source code into machine language (actually a bunch of binary codes). Compilation knowledge is very important, but it is not the focus of this article. If you are interested, you can google it yourself.
Target file format:
Now let’s take a look at how the target file mentioned above is organized (that is, the storage structure).
Origin:
Imagine how you would organize these binary codes if you were the one designing it? Just like the items on the desk need to be classified and placed neatly, in order to facilitate management, the translated binary codes should also be stored in categories, with those representing code put together and those representing data together. In this way, the binary code is divided into different blocks for storage. Such an area is called a segment.
Standards:
Like many things in computer science, in order to facilitate people's communication, program compatibility and other issues. A standard was also developed for this binary storage method, so COFF (common object file format) was born. The target file format under current mainstream operating systems such as Windows and Linux is similar to COFF, and can be considered as a variant of it.
a.out:
a.out is the default name of the target file. In other words, when compiling a file, if the compiled target file is not renamed, a file named a.out will be generated after compilation.
I won’t delve into the specific reasons why this name is used. If you are interested, you can google it yourself.
The following picture can give you a more intuitive understanding of the target file:
The above picture is a typical target file The structure and actual situation may be different, but they are all derived on this basis.
ELF file header: the first segment in the picture above. The header is the header of the target file, which contains some basic information about the target file. Such as the file version, target machine model, program entry address, etc.
Text segment: The data inside is mainly the code part of the program.
Data segment: The data part in the program, such as variables.
Relocation segment:
The relocation segment includes text relocation and data relocation, which contains relocation information. Generally speaking, there will be references to external functions or variables in the code. Since it is a reference, these functions and variables do not exist in the target file. When using them, their actual address must be given (this process occurs during linking). It is these relocation tables that provide the information to find these actual addresses. After understanding the above, text relocation and data relocation are not difficult to understand.
Symbol table: The symbol table contains all symbol information in the source code. Include every variable name, function name, etc. The information of each symbol is recorded in it. For example, if there is the symbol "student" in the code, the corresponding information of this symbol is included in the symbol table. Including the segment where this symbol is located, its attributes (read and write permissions) and other related information.
In fact, the original source of the symbol table can be said to be in the lexical analysis stage of compilation. When doing lexical analysis, each symbol and its attributes in the code are recorded in the symbol table.
String table: It has similar functions to the symbol table and stores some string information.
One more thing to say is: the target files are all stored in binary, which itself is a binary file.
The target file in reality will be more complicated than this model, but its idea is the same, that is, it is stored according to the type, plus some sections describing the target file information and the information needed in the link .
a.out segmentation
Hello World
Nothing to prove, let’s now study the object file formed after hello world is compiled, which is described here in C.
Simple hello world source code:
In order to have data to put in the data segment, " int a=5”.
If you are on VC, click Run to see the results.
In order to see clearly how it is processed internally, we use GCC to compile.
Run
gcc hello.c
Looking at our directory, there is an additional target file a.out.
What we want to do now is to see what is in a.out. There may be children who recall using vim text to view it. I thought so naively at the time. But what kind of thing a.out is, how can it be exposed so easily. Yes, vim doesn't work. "Most of the problems we encountered have been encountered and solved by predecessors." Yes, there is a very powerful tool called objdump. With it, we can thoroughly understand various details of the target file. Of course, there is also a very useful one called readelf, which will be introduced later.
These two tools are generally included in Linux. You can google them by yourself
Note: The code here is mainly compiled with GCC under Linux. Objdump and Objdump are used to view the target files. readelf. But I will put all the running results in the picture, so if you have not been exposed to Linux before, it will be no problem to read the following content. I use ubuntu, and it feels pretty good~
The following is the organizational structure of a.out: (starting address of each segment, size, etc.)
The command to view the target file is objdump -h a.out
is the same as the format of the target file described above. It can be seen that it is stored in categories. The target file is divided into 6 sections.
From left to right, the first column (Idx Name) is the name of the segment, the second column (Size) is the size, VMA is the virtual address, LMA is the physical address, and File off is the offset within the file. . That is, the distance of this paragraph relative to a reference in the paragraph (usually the beginning of the paragraph). The last Algn is a description of the segment attributes. Ignore the
"text" segment: code segment for now.
"data" segment: This is the data segment mentioned above, which saves the data in the source code, usually initialized data.
"bss" segment: It is also a data segment, which stores uninitialized data. Because these data have not yet been allocated space, they are stored separately.
"rodata" segment: read-only data segment, the data stored in it is read-only.
"cmment" stores compiler version information.
The remaining two paragraphs have no practical significance for our discussion and will not be introduced again. Just think that they contain some linking, compilation, and installation information.
Note:
The target file format here only lists the main parts of the actual situation. There are some actual situations that are not listed in the table. If you are also using Linux, you can use objdump -X to list more detailed segment contents.
In-depth a.out
The above part describes the typical segments in the target file through examples, mainly the segment information, such as size and other related attributes.
So what exactly is in these segments? What exactly is stored in the "text" segment? Let's use our objdump.
objdump -s a.out You can view the hexadecimal format of the target file through the -s option.
View the results as follows:
As shown in the figure above, the hexadecimal representation of each segment is listed form. It can be seen that the figure is divided into two columns. The column on the left is the hexadecimal representation, and the column on the right displays the corresponding information.
The more obvious ones are "hello world" in the "rodata" read-only data segment. . Sigh, it seems that "hello" in the program is typed incorrectly, and an extra "w" is added at the end. It's troublesome to take screenshots. Forgive me.
You can also check the ASCII value of "hello world". The corresponding hexadecimal value is the content inside.
"comment" The paragraph mentioned above contains some compiler version information. The content after this paragraph is: GCC compiler, followed by the version number.
a.out disassembly
The compilation process always first converts the source text into assembly form, and then translates it into machine language. (Add a middle layer) After seeing so many a.out, it is necessary to study its assembly form.
objdump -d a.out can list the assembly form of the file. However, only the main part is listed here, that is, the main function part. In fact, there is still a lot of work to be done at the beginning of the execution of the main function and after the execution of the main function.
That is, initialize the function execution environment and release the space occupied by the function, etc.
In the above picture, the left side is the hexadecimal form of the code, and the left side is the assembly form. Children who are familiar with assembly should be able to understand most of it, so I won’t go into details here.
a.out header file
When introducing the target file format, the concept of header file was mentioned, which contains some basic information about the target file. Such as the version of the file, target machine model, program entry address, etc.
The following picture is the format of the file header:
You can use readelf -h to view it. (What is viewed in the picture below is hello.o, which is a file compiled but not linked by the source file hello.c. This is mostly the same as viewing a.out)
The picture is divided into two columns. The left column represents the attributes, and the right column represents the attribute values. The first row is often called the magic number. What follows is a series of numbers. I won’t go into details about their specific meanings. You can google them yourself.
The following is some information related to the target file. Since it is not closely related to the issue we want to discuss, we will not discuss it here.
The above content uses specific examples to describe the internal organization form of the target file. The target file is just an intermediate process in the process of generating the executable file. How the program runs has not been discussed. The target file is How to convert it into an executable file and how the executable file is executed will be discussed in the following sections
A simple understanding of links
In layman's terms, a link means putting together several executable files.
If program A references a function defined in file B, in order for the function in A to execute normally, the function part in B needs to be placed in the source code of A, then A and B The process of merging into one file is linking.
There is a special process used to link programs, called a linker. He processes some input target files and synthesizes them into an output file. These target files often have mutual data and function references.
Above we have seen the disassembly form of hello world, which is a file that has not been linked, which means that when referencing an external function, its address is not known:
As shown below:
In the above picture, the cal instruction calls the printf() function, because the printf() function is not present at this time. In this file, its address cannot be determined. In hexadecimal, "ff ff ff" is used to represent its address. After the link, this address will become the actual address of the function, because the function has been loaded into the file after the link.
Classification of links: Links can be divided into static links and dynamic links according to the order in which A-related data or functions are merged into one file.
Static link:
Complete the linking work before the program is executed. That is, the file cannot be executed until the link is completed. But this has an obvious disadvantage, such as library functions. If both file A and file B need to use a certain library function, after the link is completed, the linked files will have this library function. When A and B are executed at the same time, there are two copies of the library function in the memory, which undoubtedly wastes storage space. This waste becomes especially apparent when scale increases. Static links also have the disadvantage of being difficult to upgrade. In order to solve these problems, many programs today use dynamic linking.
Dynamic linking: Unlike static linking, dynamic linking is performed when the program is executed. That is when the program is loaded and executed. Still in the above example, if both A and B use the library function Fun(), only one copy of Fun() needs to be in the memory when A and B are executed.
There is still a lot of knowledge about links, which will be discussed in a special article in the future. I won’t go into details here.
A simple explanation of loading
We know that in order for a program to run, it must be loaded into memory. In the past machines, the entire program was loaded into physical memory. Nowadays, a virtual storage mechanism is generally used, that is, each process has a complete address space, giving the impression that each process can use it. Memory. A memory manager then maps the virtual addresses to actual physical memory addresses.
According to the above description, the address of the program can be divided into virtual address and real address. The virtual address is her address in her virtual memory space, and the physical address is the actual address where she is loaded.
Perhaps you have noticed when viewing the segments above that since the file is not linked or loaded, each The virtual address and physical address of the segment are both 0.
The loading process can be understood like this: first allocate virtual addresses to each part of the program, and then establish a mapping from the virtual address to the physical address. In fact, the key part is the mapping process from virtual address to physical address. After the program is installed, the program counter pc of the CPU points to the starting position of the code in the file, and then the program is executed in sequence.
The purpose of writing this article is to sort out the mechanism of program operation and what is hidden behind the execution of an executable file. From source code to executable file usually goes through many intermediate steps, each intermediate step generates an intermediate file. It's just that the current integrated development environment has hidden these steps. We who are accustomed to the integrated development environment have gradually ignored these important technical insiders. This article only introduces the main line of this process. Each of the details can be discussed in an article.
I hope that after reading this article, everyone will not think that "hello world" is just a simple experiment. I also hope that through this article, everyone will understand what is the operating mechanism of the program and what is it. How it works.
Related recommendations:
PHP’s underlying operating mechanism and principles
##Exploring PHP’s function operating mechanism_PHP tutorial
JavaScript running mechanism sample code analysis
The above is the detailed content of Talking about the program operating mechanism from hello world. For more information, please follow other related articles on the PHP Chinese website!

PHP是一种流行的开源服务器端脚本语言,大量被用于Web开发。它能够处理动态数据以及控制HTML的输出,但是,如何实现这一切?那么,本文将会介绍PHP的核心运行机制和实现原理,并利用具体的代码示例,进一步说明其运行过程。PHP源码解读PHP源码是一个由C语言编写的程序,经过编译后生成可执行文件php.exe,而对于Web开发中使用的PHP,在执行时一般通过A

在Go语言中,goroutine是一种轻量级的线程,用于并发执行代码片段。与传统的线程相比,goroutine更加高效,具有更低的内存消耗和更快的启动速度。在本文中,我们将深度解析Go语言中goroutine的本质和运行机制,同时会提供具体的代码示例来帮助读者更好地理解。1.Goroutine的本质在Go语言中,goroutine是由Go运行时管理的轻量级

Swoole是一个基于PHP的协程框架,它的异步IO性能非常出色。Swoole的核心是协程,协程是一种比线程更轻量级的并发机制,可以在同一线程中切换任务来实现并发执行。本文将会探究Swoole中协程的运行机制。一、协程的概念协程,又称微线程,是一种比线程更细粒度的并发机制。协程与线程的区别在于,协程通过时间片轮转来实现任务切换,而线程由操作系统调度器负责切换

ApacheTomcat是一个开源的JavaServlet容器,由Apache软件基金会开发和维护。它是最流行的用于Java应用程序开发的Servlet容器之一,广泛用于企业级Web应用程序的部署。本文将详细解析ApacheTomcat的原理及运行机制,并提供具体的代码示例。Tomcat的架构ApacheTomcat采用了基于组件的架构,由多个模块组

解密Tomcat中间件的运行机制和内部工作原理摘要:Tomcat是一个广泛用于JavaWeb应用程序的开源HTTP服务器和Servlet容器。它提供了丰富的功能,如处理HTTP请求、管理Web应用程序和Servlet生命周期管理等。本文将深入探讨Tomcat中间件的运行机制和内部工作原理,包括掌握Tomcat的核心组件、请求处理流程、类加载机制、Servl

world是由中国西坡软件开发工作室开发的一款新的计算机管理系统,主要针对了Windows的不足以及不完整做出了改进。world软件对Windows操作系统进行了必要的改进与补充,方便对计算机的管理,同时开发了如计算器、时钟等小应用,尽量减少需要下载的应用量。

Linux内核功能详解:五大部分的全面解读Linux内核是一个开源的操作系统内核,负责管理计算机的硬件资源,并提供进程管理、文件系统和设备驱动等功能。Linux内核由许多不同的部分组成,每个部分拥有特定的功能和责任。本文将对Linux内核的五大部分进行全面解读,并提供具体的代码示例帮助读者更好地理解。1.进程管理进程管理是Linux内核的核心功能之一,负责

了解ZendFramework中间件的运行机制和原理随着互联网的不断发展,web应用程序的复杂性也在不断增加。为了解决这些问题,中间件的概念应运而生。中间件是一个非常重要的技术,在ZendFramework中也得到了广泛的应用。本文将介绍ZendFramework中间件的运行机制和原理,并通过示例代码来详细说明。首先,什么是中间件?中间件是一种可以对请


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Dreamweaver Mac version
Visual web development tools

Notepad++7.3.1
Easy-to-use and free code editor

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft
