客户一套10.2.0.4的数据库,一个实例突然的Crash掉了。客户想让我们帮忙分析宕机的原因。对于这种数据库突然Crash的问题,我们首先就会看数据库的Alert日志,可以看到在宕机之前,SMON进程报了ORA-00600[15709]的错误,紧接数据库就输出了一条信息“Fatal in
客户一套10.2.0.4的数据库,一个实例突然的Crash掉了。客户想让我们帮忙分析宕机的原因。对于这种数据库突然Crash的问题,我们首先就会看数据库的Alert日志,可以看到在宕机之前,SMON进程报了ORA-00600[15709]的错误,紧接数据库就输出了一条信息“Fatal internal error happened while SMON was doing active transaction recovery.”也就是说SMON在做活动事务恢复的时候出现了异常。最终导致了数据库实例的宕机。日志输出如下所示:
Fri Sep 26 10:53:35 2014 Errors in file /oracle/app/oracle/admin/wxyydb/bdump/wxyydb_smon_28997.trc: ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], [] ORA-30319: Message 30319 not found; product=RDBMS; facility=ORA Fri Sep 26 10:53:55 2014 Fatal internal error happened while SMON was doing active transaction recovery. Fri Sep 26 10:53:55 2014 Errors in file /oracle/app/oracle/admin/wxyydb/bdump/wxyydb_smon_28997.trc: ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], [] ORA-30319: Message 30319 not found; product=RDBMS; facility=ORA SMON: terminating instance due to error 474 Termination issued to instance processes. Waiting for the processes to exit Fri Sep 26 10:54:05 2014 Instance termination failed to kill one or more processes Instance terminated by SMON, pid = 28997
我们再来分析一下wxyydb_smon_28997.trc文件的信息。可以看到数据库的SMON进程一直尝试在做并行恢复事务。在恢复的过程中遇到了ORA-00600错误,最终底层代码异常触发了数据库的宕机。
*** 2014-09-26 10:10:36.236 Parallel Transaction recovery caught error 30319 *** 2014-09-26 10:15:10.643 Parallel Transaction recovery caught exception 30319 *** 2014-09-26 10:15:21.816 Parallel Transaction recovery caught error 30319 *** 2014-09-26 10:19:51.707 Parallel Transaction recovery caught exception 30319 *** 2014-09-26 10:53:35.830 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], [] ORA-30319: Message 30319 not found; product=RDBMS; facility=ORA ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst()+64 call ksedst1() 000000000 ? 000000001 ? ksedmp()+2176 call ksedst() 000000000 ? C000000000000C9F ? 4000000004057F40 ? 000000000 ? 000000000 ? 000000000 ? ksfdmp()+48 call ksedmp() 000000003 ? kgeriv()+336 call ksfdmp() C000000000000695 ? 000000003 ? 40000000095185E0 ? 00000EC33 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kgeasi()+416 call kgeriv() 6000000000031770 ? 6000000000032828 ? 4000000001A504E0 ? 000000002 ? 9FFFFFFFFFFFA138 ? $cold_kxfpqsrls()+1 call kgeasi() 6000000000031770 ? 168 9FFFFFFFFD3D2290 ? 000003D5D ? 000000002 ? 000000002 ? 0000003E7 ? 000003D5D ? 9FFFFFFFFD3D22A0 ? kxfpqrsod()+1104 call $cold_kxfpqsrls() C0000004FDF7A838 ? C0000004FDF74430 ? 000000004 ? 9FFFFFFFFFFFA200 ? C0000000000011AB ? 4000000003AA1250 ? 00000EDF5 ? 000000001 ? kxfpdelqrefs()+640 call kxfpqrsod() C0000004FDF74430 ? 000000001 ? 60000000000B6300 ? C000000000000694 ? 4000000003DD14F0 ? 00000EE2D ? 60000000000C6708 ? kxfpqsod_qc_sod()+2 call kxfpdelqrefs() 00000003E ? 000000001 ? 016 60000000000B6300 ? C000000000001028 ? 40000000025DE5A0 ? 4000000001B1A110 ? 60000000000C2D04 ? 60000000000C2E90 ? kxfpqsod()+816 call kxfpqsod_qc_sod() 000000010 ? 000000001 ? 9FFFFFFFFFFFA260 ? 60000000000B6300 ? 9FFFFFFFFFFFA7F0 ? C000000000001028 ? 40000000025DF810 ? 00000EE65 ? ktprdestroy()+208 call kxfpqsod() C0000004FDF7A838 ? 000000001 ? 9FFFFFFFFFFFA810 ? 60000000000B6300 ? 9FFFFFFFFFFFAD90 ? ktprbeg()+8272 call ktprdestroy() C000000000001026 ? 40000000025615B0 ? 000006E61 ? 000000000 ? 4000000001052E40 ? 000000000 ? ktmmon()+10096 call ktprbeg() 9FFFFFFFFFFFBE70 ? 9FFFFFFFFFFFADA0 ? 60000000000B6300 ? 40000000028B75A0 ? 00000EF21 ? 9FFFFFFFFFFFADD8 ? 9FFFFFFFFFFFADE0 ? ktmSmonMain()+64 call ktmmon() 9FFFFFFFFFFFD140 ? ksbrdp()+2816 call ktmSmonMain() C000000100E1CA60 ? C000000000000FA5 ? 000007361 ? 4000000003B5AE10 ? C000000000000205 ? 400000000409DCD0 ? opirip()+1136 call ksbrdp() 9FFFFFFFFFFFD150 ? 60000000000B6300 ? 9FFFFFFFFFFFDC90 ? 4000000002863EF0 ? 000004861 ? C000000000000B1D ? 60000000000318F0 ? $cold_opidrv()+1408 call opirip() 9FFFFFFFFFFFEA70 ? 000000004 ? 9FFFFFFFFFFFF090 ? 9FFFFFFFFFFFDCA0 ? 60000000000B6300 ? C000000000000DA1 ? sou2o()+336 call $cold_opidrv() 000000032 ? 9FFFFFFFFFFFF090 ? 60000000000C2C78 ? $cold_opimai_real() call sou2o() 9FFFFFFFFFFFF0B0 ? +640 000000032 ? 000000004 ? 9FFFFFFFFFFFF090 ? main()+368 call $cold_opimai_real() 000000003 ? 000000000 ? main_opd_entry()+80 call main() 000000003 ? 9FFFFFFFFFFFF598 ? 60000000000B6300 ? C000000000000004 ?
根据ORA-00600[15709],我们在Oracle Support上找到一篇文档,SMON may fail with ORA-00600 [15709] Errors Crashing the Instance (文档 ID 736348.1),这篇文档的错误信息和我们所报出来的信息雷同。这篇文档列出了出现错误的堆栈情况:kxfpqsrls <- kxfpqrsod <- kxfpdelqrefs <- kxfpqsod_qc_sod <- kxfpqsod <- ktprdestroy <- ktprbe <- ktmmon。我们可以从SMON的Trace里面看到,堆栈内容基本上和这个匹配。所以,这个问题是在恢复的过程中命中了bug 695472,而如果你安装了这个patch,还是有类似的问题,很可能是遇到了另外一个类似的bug 9233544,Oracle的Bug还真是多啊。
bug 695472会影响9.2.0.8和10.2.0.4这两个版本,并且在10.2.0.4.2和10.2.0.5,11.1.0.7,11.2.0.1上得到了修复。解决bug 695472的方法是:
1.Use the following workaround
Set fast_start_parallel_rollback=false and recovery_parallelism=0
OR
2.Apply one-off <
OR
3.Upgrade to fixed release 10.2.0.5, 11.1.0.7 or 11.2.0.1.
bug 9233544会影响10.2.0.4,11.1.0.7和11.2.0.1这三个版本,并且在11.2.0.3和12.1上得到了修复,解决bug 9233544的方法是:
1.Apply patchset 11.2.0.3, in which Bug: 9233544 is fixed.
OR
2.Check if one-off Patch:9233544 is available for your release and platform here.
我们仔细检查了一下系统的补丁,发现系统已经安装了patch 6954722,那就证明是bug 9233544影响的。要么升级到11.2.0.3的版本,要么就是安装单独的patch 9233544。对于升级11.2.0.3这个动作太大了,给客户说了一下考虑安装小patch来解决。
原文地址:ORA-00600: internal error code, arguments: [15709], 感谢原作者分享。