Home  >  Article  >  System Tutorial  >  Exception information in the Linux kernel: Detailed explanation of Oops

Exception information in the Linux kernel: Detailed explanation of Oops

王林
王林forward
2024-02-10 12:00:28719browse

Oops is a special error message in the Linux kernel. It is used to indicate that a non-fatal exception has occurred in the kernel, such as null pointer dereference, illegal memory access, division by zero error, etc. The occurrence of Oops usually means that there is a bug in the kernel or a problem with the driver, which can cause system instability or crash. In this article, we will introduce the principles and characteristics of Oops in the Linux kernel, including the format, content, reasons, classification, etc. of Oops, and give examples of their usage and precautions.

Exception information in the Linux kernel: Detailed explanation of Oops

What is Oops in Linux kernel development? In fact, there is no essential difference between it and the above explanation, except that the protagonist who speaks becomes Linux. When some of the more fatal problems occur, our Linux kernel will also say to us apologetically: "Oops, I'm sorry, I messed up." When a kernel panic occurs, the Linux kernel will print out Oops information and show us the current register status, stack content, and complete Call trace, which can help us locate the error.

Next, let’s look at an example. In order to highlight the protagonist of this article - Oops, the only function of this example is to create a null pointer reference error.

#include 
#include 

static int __init hello_init(void)
{
 int *p = 0;
 
 *p = 1; 
 return 0;
}

static void __exit hello_exit(void)
{
 return;
}

module_init(hello_init);
module_exit(hello_exit);

MODULE_LICENSE("GPL");

Obviously, the error is line 8.

Next, we compile this module and use insmod to insert it into the kernel space. As we expected, Oops appears.

[ 100.243737] BUG: unable to handle kernel NULL pointer dereference at (null)

[ 100.244985] IP: [] hello_init 0x5/0x11 [hello]

[ 100.262266] *pde = 00000000

[ 100.288395] Oops: 0002 [#1] SMP

[ 100.305468] last sysfs file: /sys/devices/virtual/sound/timer/uevent

[ 100.325955] Modules linked in: hello( ) vmblock vsock vmmemctl vmhgfs acpiphp snd_ens1371 gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ppdev psmouse serio_raw fbcon tileblit font bitblit softcursor snd parport_pc soundcore snd_page_alloc vmci i2c_piix4 vga16fb vgastate intel_agp agpgart shpchp lp parport floppy pcnet32 mii mptspi mptscsih mptbase scsi_transport_spi vmxnet

[ 100.472178] [ 100.494931] Pid: 1586, comm: insmod Not tainted (2.6.32-21-generic #32-Ubuntu) VMware Virtual Platform

[100.540018] EIP: 0060:[] EFLAGS: 00010246 CPU: 0

[ 100.562844] EIP is at hello_init 0x5/0x11 [hello]

[ 100.584351] EAX: 00000000 EBX: ffffffffc ECX: f82cf040 EDX: 00000001

[ 100.609358] ESI: f82cf040 EDI: 00000000 EBP: f1b9ff5c ESP: f1b9ff5c

[ 100.631467] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

[ 100.657664] Process insmod (pid: 1586, ti=f1b9e000 task=f137b340 task.ti=f1b9e000)

[ 100.706083] Stack:

[ 100.731783] f1b9ff88 c0101131 f82cf040 c076d240 ffffffffc f82cf040 0072cff4 f82d2000

[ 100.759324]ffffffffc f82cf040 0072cff4 f1b9ffac c0182340 f19638f8 f137b340 f19638c0

[ 100.811396]00000004 09cc9018 09cc9018 00020000 f1b9e000 c01033ec 09cc9018 00015324

[ 100.891922] Call Trace:

[ 100.916257] [] ? do_one_initcall 0x31/0x190

[ 100.943670] [] ? hello_init 0x0/0x11 [hello]

[ 100.970905] [] ? sys_init_module 0xb0/0x210

[ 100.995542] [] ? syscall_call 0x7/0xb

[ 101.024087] Code: 05 00 00 00 00 01 00 00 00 5d c3 00 00 00 00 00 00 00 00 00 00

[ 101.079592] EIP: [] hello_init 0x5/0x11 [hello] SS:ESP 0068:f1b9ff5c

[ 101.134682] CR2: 0000000000000000

[ 101.158929] —[ end trace e294b69a66d752cb ]—

Oops first described what kind of bug this was, and then pointed out the location where the bug occurred, which is "IP: [] hello_init 0x5/0x11 [hello]".

Here, we need to use an auxiliary tool objdump to help analyze the problem. objdump can be used to disassemble, the command format is as follows:

objdump -S hello.o

下面是hello.o反汇编的结果,而且是和C代码混排的,非常的直观。

hello.o:     file format elf32-i386


Disassembly of section .init.text:

00000000 :
#include 
#include 

static int __init hello_init(void)
{
   0: 55                    push   %ebp
 int *p = 0;
 
 *p = 1;
 
 return 0;
}
   1: 31 c0                 xor    %eax,%eax
#include 
#include 

static int __init hello_init(void)
{
   3: 89 e5                 mov    %esp,%ebp
 int *p = 0;
 
 *p = 1;
   5: c7 05 00 00 00 00 01  movl   $0x1,0x0
   c: 00 00 00 
 
 return 0;
}
   f: 5d                    pop    %ebp
  10: c3                    ret    

Disassembly of section .exit.text:

00000000 :

static void __exit hello_exit(void)
{
   0: 55                    push   %ebp
   1: 89 e5                 mov    %esp,%ebp
   3: e8 fc ff ff ff        call   4 
 return;
}
   8: 5d                    pop    %ebp
   9: c3                    ret    

对照Oops的提示,我们可以很清楚的看到,出错的位置hello_init+0x5的汇编代码是:

5:c7 05 00 00 00 00 01 movl   $0x1,0x0

这句代码的作用是把数值1存入0这个地址,这个操作当然是非法的。

我们还能看到它对应的c代码是:

*p = 1;

Bingo!在Oops的帮助下我们很快就解决了问题。

我们再回过头来检查一下上面的Oops,看看Linux内核还有没有给我们留下其他的有用信息。

Oops: 0002 [#1]

这里面,0002表示Oops的错误代码(写错误,发生在内核空间),#1表示这个错误发生一次。

Oops的错误代码根据错误的原因会有不同的定义,本文中的例子可以参考下面的定义(如果发现自己遇到的Oops和下面无法对应的话,最好去内核代码里查找):

* error_code:
* bit 0 == 0 means no page found, 1 means protection fault
* bit 1 == 0 means read, 1 means write
* bit 2 == 0 means kernel, 1 means user-mode
* bit 3 == 0 means data, 1 means instruction

有时候,Oops还会打印出Tainted信息。这个信息用来指出内核是因何种原因被tainted(直译为“玷污”)。具体的定义如下:

1: ‘G’ if all modules loaded have a GPL or compatible license, ‘P’ if any proprietary module has been loaded. Modules without a MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by insmod as GPL compatible are assumed to be proprietary.
2: ‘F’ if any module was force loaded by “insmod -f”, ‘ ‘ if all modules were loaded normally.
3: ‘S’ if the oops occurred on an SMP kernel running on hardware that hasn’t been certified as safe to run multiprocessor. Currently this occurs only on various Athlons that are not SMP capable.
4: ‘R’ if a module was force unloaded by “rmmod -f”, ‘ ‘ if all modules were unloaded normally.
5: ‘M’ if any processor has reported a Machine Check Exception, ‘ ‘ if no Machine Check Exceptions have occurred.
6: ‘B’ if a page-release function has found a bad page reference or some unexpected page flags.
7: ‘U’ if a user or user application specifically requested that the Tainted flag be set, ‘ ‘ otherwise.
8: ‘D’ if the kernel has died recently, i.e. there was an OOPS or BUG.
9: ‘A’ if the ACPI table has been overridden.
10: ‘W’ if a warning has previously been issued by the kernel. (Though some warnings may set more specific taint flags.)
11: ‘C’ if a staging driver has been loaded.
12: ‘I’ if the kernel is working around a severe bug in the platform firmware (BIOS or similar).

Basically, this Tainted information is reserved for kernel developers. If users encounter Oops when using Linux, they can send the contents of Oops to kernel developers for debugging. Based on this Tainted information, kernel developers can probably determine the environment in which the kernel is running when the kernel panics. If we just debug our own driver, this information will be meaningless.

The example in this article is very simple. Oops did not cause downtime after it occurred, so we can view the complete information from dmesg. But more often than not, the system will be down when Oops occurs. At this time, these error messages have no time to be stored in the file, and they can no longer be seen after turning off the power. We can only record it in other ways: handwriting or taking photos.

There is a worse situation. If there is too much Oops information, the screen of one page will not be fully displayed. How can we view the complete content? The first method is to use the vga parameter in grub to specify a higher resolution so that the screen can display more content. Obviously, this method cannot actually solve too many problems; the second method is to use two machines to print the Oops information of the debugging machine to the host screen through the serial port. But most laptops now do not have serial ports, and this solution also has great limitations; the third method is to use the kernel dump tool kdump to dump the contents of the memory and CPU registers when Oops occurs into a file. We will then use gdb to analyze the problem.

The problems you may encounter in the process of developing kernel drivers are all kinds of strange, and the debugging methods are also diverse. Oops is a reminder given to us by the Linux kernel, and we must make good use of it.

Through this article, we have learned about the principles and characteristics of Oops in the Linux kernel, which can be used to diagnose and debug problems in the kernel. We should choose appropriate tools based on actual needs and follow some basic principles, such as saving and analyzing Oops information, using symbol tables and source code to locate problems, using module parameters and kernel parameters to adjust kernel behavior, etc. Oops is a common error message in the Linux kernel. It can reflect the status and exceptions of the kernel, and can also improve the quality and stability of the kernel. I hope this article can be helpful and inspiring to you.

The above is the detailed content of Exception information in the Linux kernel: Detailed explanation of Oops. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:lxlinux.net. If there is any infringement, please contact admin@php.cn delete