Home > Article > Backend Development > How Python identifies malware
To perform static malware analysis, you need to understand the Windows PE file format, which describes today's Windows programs such as .exe, .dll, and .sys The structure of files and defines how they store data. PE files contain x86 instructions, data such as images and text, as well as metadata required for the program to run.
The PE format was originally designed to perform the following operations.
1) Tell Windows how to load a program into memory
The PE format describes which blocks of a file should be loaded into memory, and where. It also tells you where in the program code Windows should start executing the program and which dynamically linked code libraries should be loaded into memory.
2) Provide the running program with media (or resources) that may be used during execution
These resources can include strings, such as GUI dialog boxes or consoles A string of output, as well as an image or video.
3) Provide security data, such as digital code signatures
Windows uses this security data to ensure that code comes from a trusted source.
PE format accomplishes the above work by utilizing a series of structures shown in Figure 1-1.
▲Figure 1-1 PE file format
As shown in Figure 1-1, the PE file format includes a series of headers for Tells the operating system how to load a program into memory. It also contains a series of sections that contain the actual program data. Windows loads these sections into memory so that their offsets in memory correspond to where they appear on disk.
Let’s explore this file structure in more detail, starting with the PE header. We'll skip the discussion of the DOS header, which is a holdover from the 1980s Microsoft DOS operating system and exists solely for compatibility reasons.
1. PE header
As shown at the bottom of Figure 1-1, above the DOS header ❶ is the PE header ❷, which defines the general attributes of the program. Such as binary code, images, compressed data and other program properties. It also tells us whether the program is designed for 32-bit or 64-bit systems.
The PE header provides basic but useful contextual information for malware analysts. For example, the header includes a timestamp field that gives the time the malware author compiled the file. Usually malware authors will replace this field with a fake value, but sometimes malware authors forget to replace it and this happens.
2. Optional headers
Optional headers ❸ are actually everywhere in today’s PE executables, exactly the opposite of what their name implies. It defines the location of the program entry point in the PE file, which refers to the first instruction that is run after the program is loaded.
It also defines the size of data that Windows loads into memory when loading PE files, the Windows subsystem, target programs (such as the Windows GUI or the Windows command line), and other high-level details about the program. Since the entry point of the program tells the reverse engineer where to start reverse engineering, this header information is extremely valuable to the reverse engineer.
3. Section header
The section header ❹ describes the data section contained in the PE file. A section in a PE file is a piece of data that will be mapped into memory when the operating system loads a program, or contains instructions on how to load the program into memory.
In other words, a section is a sequence of bytes on disk that either becomes a string of contiguous bytes in memory or tells the operating system about some aspect of the loading process.
The section header also tells Windows what permissions should be granted to the section, such as whether the program should be readable, writable, or executable when it is executed. For example, .text sections containing x86 code are often marked readable and executable, but not writable, to prevent the program code from accidentally modifying itself during execution.
Figure 1-1 depicts many sections, such as .text and .rsrc. When PE files are executed, they are mapped into memory. Other special sections such as the .reloc section are not mapped into memory and we will discuss these sections as well. Let's explore the sections shown in Figure 1-1.
1).text section
Every PE program contains at least one section of x86 code marked as executable in its section header; these sections almost always Name it .text❺.
2).idata section
.idata section ❻, also known as the import section, contains the import address table (IAT), which lists the dynamic link libraries and their functions. The IAT is one of the most important PE structures that needs to be looked at during the initial analysis of the PE binary, as it points out the libraries that the program calls, however these calls may in turn reveal the malware's advanced functionality.
3) Data section
The data section in the PE file structure can include .rsrc, .data and .rdata sections, which store the mouse cursor used by the program Images, button icons, audio and other media, etc. For example, the .rsrc section in Figure 1-1 contains printable strings that the program uses to render text as a string.
The information in the.rsrc (resources) section is very important to malware analysts because by examining printable strings, graphic images, and other assets in PE files, they can gain important clues about the file's functionality.
In Section 03, you will learn how to use the icoutils toolkit (including icotool and wrestool) to extract graphic images from the resource section of a malware binary. Then, in Section 04, you will learn how to extract printable strings from malware resource sections.
4).reloc section
The code of the PE binary is not position independent, which means that if it is moved from the expected memory location to a new memory location, it will not execute correctly. .reloc❽ solves this problem by allowing code to be moved without breaking it.
If the code of a PE file has been moved, it tells the Windows operating system to perform memory address translation in the code of the file so that the code can still run correctly. These conversions typically involve adding or subtracting offsets to memory addresses.
The Python module pefile written and maintained by Ero Carerra has become an industry-standard malware analysis library for parsing PE files. In this section, I will show you how to use pefile to parse ircbot.exe. Code Listing 1-1 assumes that ircbot.exe is already located in your current working directory.
Enter the following command to install the pefile library so that we can import it in Python:
$ pip install pefile
Now, start Python using the commands in Listing 1-1, import the pefile module, and then use pefile Open and parse the PE file ircbot.exe.
Code List 1-1 Load the pefile module and parse the PE file (ircbot.exe)
$ python >>> import pefile >>> pe = pefile.PE("ircbot.exe")
Our example Change pefile.PE, which is the core class implemented by the PE module. It parses PE files so that we can view their properties. By calling the PE constructor, we load and parse the specified PE file, in this case ircbot.exe. Now that we have loaded and parsed this file, run the code in Listing 1-2 to extract the information from the pe field of ircbot.exe.
Code Listing 1-2 Iterate through the various sections of the PE file and print information about them
#基于 Ero Carrera的示例代码(pefile库的作者) for section in pe.sections: print(section.Name, hex(section.VirtualAddress), hex(section.Misc_VirtualSize), section.SizeOfRawData)
Code Listing 1-3 shows the contents of the printed output.
Code Listing 1-3 Use Python’s pefile module to extract section data from ircbot.exe
We extract section data from PE file 5 The data is extracted from three different sections: .text, .rdata, .data, .idata, and .reloc. The output is given in the form of quintuple, one element for each PE section extracted. The first entry on each line identifies the PE section. (You can ignore the series of \\x00 null bytes, which are just C-style empty string terminators.) The remaining fields tell us that once each section is loaded into memory, its memory utilization will be How much, and where in memory it will be found once loaded.
For example, 0x1000❶ is the base virtual memory address for loading these sections, which can also be regarded as the base memory address of the section. The 0x32830❷ in the virtual size field specifies the memory size required after the section is loaded. 207360❸ in the third field indicates the amount of data that this section will occupy in this memory block.
In addition to using pefile to parse the program's sections, we can also use it to list the DLL files that the binary will load, and the function calls it will request in those DLL files. We can achieve this by mirroring (dumping) the IAT of the PE file. Code Listing 1-4 shows how to use the pefile to mirror the IAT of ircbot.exe.
Code Listing 1-4 Extract import information from ircbot.exe
$ python pe = pefile.PE("ircbot.exe") for entry in pe.DIRECTORY_ENTRY_IMPORT: print entry.dll for function in entry.imports: print '\t', function.name
Code Listing 1-4 will Produces the output shown in Listing 1-5 (output truncated for brevity).
Code Listing 1-5 The contents of the IAT table of ircbot.exe, which shows the library functions used by this malware
As shown in Listing 1-5, this output is valuable for malware analysis because it lists the malware declaration and a rich array of functions that will be referenced.
For example, the first few lines of the output tell us that the malware will use WriteFile❶ to write to the file, CreateFileA❷ to open the file, and CreateProcessA❸ to create a new process. While these are just basic information about the malware, they are a start to understanding its more detailed behavior.
To understand how the malware is designed to trick its target, let’s look at the icons contained in its .rsrc section. For example, malware binary files are often designed to disguise icons of commonly used software such as Word documents, game installers, PDF files, etc. to trick users into clicking on them.
你还可以在恶意软件中找到攻击者自己感兴趣程序中的图像,例如攻击者为远程控制受感染机器而运行的网络攻击工具和程序。
回到我们的样本图像分析,你可以在本文的数据目录中找到名为fakepdfmalware.exe的这个恶意软件样本。这个样本使用Adobe Acrobat图标诱骗用户认为它是一个Adobe Acrobat文档,而实际上它是一个恶意的PE可执行文件。
在我们使用Linux命令行工具wrestool从二进制文件fakepdfmalware.exe中提取图像之前,我们首先需要创建一个目录来保存我们将提取的图像。代码清单1-6显示了如何完成所有这些操作。
代码清单1-6 从恶意软件样本中提取图像的Shell命令
$ mkdir images $ wrestool -x fakepdfmalware.exe -output=images $ icotool -x -o images images/*.ico
我们首先使用mkdir images创建一个目录来保存提取的图像。接下来,我们使用wrestool从fakepdfmalware.exe中提取图像资源(-x)到/images目录,然后使用icotool提取(-x)并将Adobe中.ico图标格式中的所有资源转换(-o)为.png图形,以便我们可以使用标准的图像浏览工具查看们。
如果你的系统上没有安装wrestool,你可以从这里下载:
http://www.nongnu.org/icoutils/
一旦你使用wrestool将目标可执行文件中的图像转换为PNG格式,你就可以在你喜欢的图像浏览工具中打开它们,并以各种分辨率查看Adobe Acrobat图标。
正如我在这里给出的例子所示,从PE文件中提取图像和图标相对简单,可以快速显示与恶意软件二进制文件相关的有趣且又有用的信息。同样地,我们可以轻松地从恶意软件中提取可打印字符串来获取更多信息,我们接下来会做这项工作。
字符串是程序二进制文件中可打印字符的序列。恶意软件分析师通常依赖恶意样本中的字符串来快速了解其中可能发生的情况。这些字符串通常包含下载网页和文件的HTTP和FTP命令,用于告诉你恶意软件连接到的地址的IP地址和主机名等类似信息。
有时,即使用于编写字符串的语言也有可能暗示恶意软件二进制文件的来源国,尽管这可能是伪造的。你甚至可以在一个字符串中找到一些文本,它们用网络用语解释了恶意二进制文件的用途。
字符串还可以显示有关二进制文件的更多技术信息。例如,你可能会发现有关用于创建二进制文件的编译器、编写二进制文件所使用的编程语言、嵌入式脚本或HTML等信息。
虽然恶意软件作者可以对所有这些痕迹进行混淆、加密和压缩等处理,但是即便是高水平的恶意软件作者也经常会暴露并留下一些痕迹,因此在分析恶意软件时,对镜像的字符串进行细致检查显得尤为重要。
1. 使用字符串程序
查看文件中所有字符串的标准方法是使用命令行工具strings,按照以下语法进行使用:
$ strings filepath | less
该命令将文件中的所有字符串逐行打印到终端上。在末尾添加 | less可以防止字符串在终端上跨屏显示。默认情况下,strings命令查找所有最小长度为4字节的可打印字符串,但是你可以设置不同的最小长度并更改“命令手册”中所列各种其他参数。
我建议只使用默认的最小字符串长度4,但是你可以使用-n选项更改最小字符串长度。例如,“string -n 10 filepath”只提取最小长度为10字节的字符串。
2. 分析镜像字符串
现在我们镜像了一个恶意软件程序的可打印字符串,但是挑战在于要理解这些字符串的含义。例如,假设我们将ircbot.exe中的字符串镜像到ircbotstring.txt文件中,这在本文前面的内容中,我们使用pefile库已经进行了探讨,如下所示:
$ strings ircbot.exe > ircbotstring.txt
ircbotstring.txt的内容包含数千行文本,但其中一些行应该突出显示出来。例如,代码清单1-7显示了从字符串镜像中提取出来的一串以单词DOWNLOAD开头的行。
代码清单1-7 显示恶意软件可以将攻击者指定的文件下载到目标计算机的字符串输出
这些行表示ircbot.exe将尝试把攻击者指定的文件下载到目标计算机上。
我们来尝试分析另一个。代码清单1-8所示的字符串镜像表明ircbot.exe可以起到Web服务器的作用,在目标机器上侦听来自攻击者的连接。
Code Listing 1-8 String output showing that the malware has an HTTP server that the attacker can connect to
Code Listing 1-8 shows various HTTP boilerplate programs used by ircbot.exe to implement an HTTP server. This HTTP server could allow an attacker to connect to the target machine via HTTP to issue commands, such as taking a screenshot of the victim's desktop and passing it back to the attacker.
We see evidence of HTTP functionality throughout the code listing. For example, the GET method ❶ that requests data from an Internet resource. HTTP/1.0 200 OK❷This line is an HTTP string that returns status code 200, indicating that HTTP network transactions are running well, and Server:myBot❸ indicates that the name of the HTTP server is myBot, which is a built-in HTTP server attached to ircbot.exe.
All this information helps understand and block specific malware samples or malicious activity. For example, knowing that a malware sample has an HTTP server that outputs a specific string when you connect to it can allow you to scan your network to identify infected hosts.
The above is the detailed content of How Python identifies malware. For more information, please follow other related articles on the PHP Chinese website!