Home  >  Article  >  Database  >  内存参数设置不合理导致数据库HANG

内存参数设置不合理导致数据库HANG

WBOY
WBOYOriginal
2016-06-07 15:22:431291browse

内存参数设置不合理导致数据库HANG 现象: 2节点RAC,数据库忽然HANG住,重启一个实例后恢复正常。 分析: 故障时间段约为8:30-10:00,以下为alert报错: alert_crm2.log: Mon May 27 06:54:26 2013 SUCCESS:>Mon May 27 07:32:24 2013 Thread 2>ORA-07445:

内存参数设置不合理导致数据库HANG
现象:
2节点RAC,数据库忽然HANG住,重启一个实例后恢复正常。

分析:
故障时间段约为8:30-10:00,以下为alert报错:
alert_crm2.log:

Mon May 27 06:54:26 2013
SUCCESS:> Mon May 27 07:32:24 2013
Thread 2> ORA-07445: 出现异常错误: 核心转储 [kksMapCursor()+323] [SIGSEGV] [ADDR:0x8] [PC:0x763597B] [Address not mapped to object] []
ORA-03135: 连接失去联系
Mon May 27 09:54:56 2013
Errors> ORA-07445: 出现异常错误: 核心转储 [kksMapCursor()+323] [SIGSEGV] [ADDR:0x8] [PC:0x763597B] [Address not mapped to object] []
ORA-03135: 连接失去联系
Mon May 27 09:54:56 2013
Errors> ORA-07445: 出现异常错误: 核心转储 [kksMapCursor()+323] [SIGSEGV] [ADDR:0x8] [PC:0x763597B] [Address not mapped to object] []
ORA-03135: 连接失去联系
Mon May 27 09:54:56 2013
Errors> ORA-07445: 出现异常错误: 核心转储 [kksMapCursor()+323] [SIGSEGV] [ADDR:0x8] [PC:0x763597B] [Address not mapped to object] []
ORA-03135: 连接失去联系
Incident> USER (ospid: 15258): terminating the instance
Mon May 27 09:55:05 2013
ORA-1092 :> ORA-00600: internal error code, arguments: [723], [109464], [127072], [memory leak], [], [], [], [] Incident> loadavg : 69.72 40.04 27.44
memory> swap info: free = 0.00M alloc = 0.00M total = 0.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S> #0 0x0000003e2d6d50e7 in semop () from /lib64/libc.so.6
#1 0x000000000778a4f6> #7 0x0000000003b87b4a in kjdrchkdrm ()
#8 0x0000000003a38c5a>
Snap Id Snap Time Sessions Curs/Sess
--------- ------------------- -------- ---------
Begin Snap: 6261 27-May-13 09:00:40 404 7.5
End Snap: 6262 27-May-13 10:00:34 488 5.3
Elapsed: 59.90 (mins)
DB Time: 10,417.13 (mins)

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg
wait % DB
Event Waits Time(s) (ms)> gc current block 2-way 411,847 673 2 3.8 Cluster
gc>
*** 2013-05-27 07:26:41.101
Trace> (session) sid: 1645 ser: 1 trans: (nil), creator: 0x590fc76e0
flags: (0x51) USR/-> Dumping Session Wait History:
0:> wait_id=348204630 seq_num=17176 snap_id=1
wait> wait times (usecs) - max=infinite
wait> occurred after 228 microseconds of elapsed time
1:>




CRMM01_130527_0800.nmon.xlsx:

CPU Total CRMM01 User% Sys% Wait% Idle% CPU% CPUs

9:38:00 2.4 0.8 6.2 90.7 3.2 24
9:39:31 1.3 1 5.9 91.9 2.3 24
9:41:01 16 5 7.6 71.4 21 24
9:42:33 91.3 7.9 0.2 0.6 99.2 24 Time PID %CPU %Usr %Sys Size ResSet ResText ResData ShdLib MinorFault MajorFault Command
8:07:34 773 0.76 0 0.76 0 0 0 0 0 0 0> 8:07:34 774 36.91 0 36.91 0 0 0 0 0 0 0 kswapd1 1.54 0

Memory MB CRMM01>
Paging> Node1: swap increased after 8:00.

CRMM01_130527_0800.nmon.xlsx:
Paging> 8:03:03 589 10.81 kswapd0
8:06:03 589 1.68>
通过以上的日志分析,大致发现客户的DB在故障时间段存在一些问题:
1.内存资源紧张(a.lmd0在进行一些内存释放的操作;b.free> 2.空闲SWAP页面紧张,大量的page in/out 3.严重的shared> 4.实例1的lmd0在9:42-9:44HANG住(STALL),
还未完全理清的时候,客户的DB又出现了HANG住的情况,这次客户做了systemstate> HANG ANALYSIS:
instances (db_name.oracle_sid):>
Chains> Chain 1 Signature Hash: 0xb52ba8a9
[b] Chain 2 Signature: 'latch:> Chain 2 Signature Hash: 0x985d217a
[c] Chain 3 Signature: 'latch:> Chain 3 Signature Hash: 0xb52ba8a9

Chain 1:
-------------------------------------------------------------------------------
Oracle> p2: 'number'=0x101
p3: 'tries'=0x0
time> short stack: wait> time waited: 4.944027 secs p2: 'number'=0x101
p3: 'tries'=0x0
2.> time waited: 0.104395 secs p2: 'number'=0x101
p3: 'tries'=0x0
3.> time waited: 0.079024 secs p2: 'number'=0x101
p3: 'tries'=0x0
}
and> {
instance: 1 (crm.crm1)
os> p2: 'number'=0x115
p3: 'tries'=0x0
time> current sql:
short> time waited: 5.627769 secs p2: 'number'=0x101
p3: 'tries'=0x0
2.> time waited: 0.465190 secs p2: 'number'=0x101
p3: 'tries'=0x0
3.> time waited: 0.082002 secs p2: 'number'=0x101
p3: 'tries'=0x0
}
从DUMP信息看来,这次的情况跟上次类似,大量的latch: shared pool等待。
客户的DB配置情况:
物理内存24G,而MEMORY_TARGET设置为22G,感觉配置的非常不合理,客户的情况跟我之前处理过的一个CASE很像(ORA-609:疑似MEMORY_TARGET设置过大导致的宕机http://blog.csdn.net/zhou1862324/article/details/17288103),都是MEMORY_TARGET参数设置过大导致出现SWAP PAGE IN/OUT的情况,最终导致数据库HANG住或宕机。
之前的CASE发生在另一位客户的一套非常重要的生产库上,数据库屡次宕机客户苦不堪言,而客户接收了我的建议将MEMORY_TARGET调低到一个合理的值之后,类似的问题没有再发生了。
所以,对于这个案例,我给了客户2个建议:
1.减少 memory_target 和 memory_max_target,预留更多内存供OS使用,减少发生SWAP PAGE IN/OUT的可能性。
2.启用hugepages,hugepages本身就是锁定在内存中不能被SWAP的,但Hugepages与memory_target不兼容,所以需要禁用memory_target,设置sga_target和pga_aggregate_target。
关于HUGEPAGE,可以参考我转的一篇文章HugePages on Oracle Linux 64-bit(http://blog.csdn.net/zhou1862324/article/details/17540277)。

解决方法:
最终客户选择了调小memory_target 和 memory_max_target,问题未再出现。
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn