首頁  >  文章  >  資料庫  >  11.2.0.3RAC(VCS)节点crash以及hang的问题分析

11.2.0.3RAC(VCS)节点crash以及hang的问题分析

WBOY
WBOY原創
2016-06-07 15:55:502745瀏覽

昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort. 且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr,arch等进程. 首先我们来看下,在下午出现crash的节点的alert

昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort.
且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr,arch等进程.

首先我们来看下,在下午出现crash的节点的alert log信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Tue Apr 22 17:16:04 2014 <code class="php plain">Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4878_9l51y1cc_.arc <code class="php plain">Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4879_9l529hc6_.arc <code class="php plain">Archived Log entry 10847 added <code class="php keyword">for <code class="php plain">thread 1 sequence 5314 ID 0xffffffffae21a60f dest 1: <code class="php plain">Tue Apr 22 17:25:05 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 27573 [oracle@xhdb-server3 (LMON)] <code class="php plain">Receiver: inst 2 binc 95439 ospid 13752 <code class="php plain">Communications reconfiguration: instance_number 2 <code class="php plain">Tue Apr 22 17:26:49 2014 <code class="php plain">LMON (ospid: 27573) has not called a wait <code class="php keyword">for <code class="php plain">89 secs. <code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_lmhb_27613.trc (incident=14129): <code class="php plain">ORA-29770: <code class="php keyword">global <code class="php plain">enqueue process LMON (OSID 27573) is hung <code class="php keyword">for <code class="php plain">more than 70 seconds <code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/incident/incdir_14129/authorcl1_lmhb_27613_i14129.trc <code class="php plain">Tue Apr 22 17:26:58 2014 <code class="php plain">Sweep [inc][14129]: completed <code class="php plain">Sweep [inc2][14129]: completed <code class="php plain">ERROR: Some process(s) is not making progress. <code class="php plain">LMHB (ospid: 27613) is terminating the instance. <code class="php plain">Please check LMHB trace file <code class="php keyword">for <code class="php plain">more details. <code class="php plain">Please also check the CPU load, I/O load <code class="php keyword">and <code class="php plain">other system properties <code class="php keyword">for <code class="php plain">anomalous behavior <code class="php plain">ERROR: Some process(s) is not making progress. <code class="php plain">Tue Apr 22 17:26:58 2014 <code class="php plain">System state dump requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination]. <code class="php plain">LMHB (ospid: 27613): terminating the instance due to error 29770 <code class="php plain">System State dumped to trace file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc <code class="php plain">Tue Apr 22 17:27:00 2014 <code class="php plain">ORA-1092 : opitsk aborting process <code class="php plain">Tue Apr 22 17:27:01 2014 <code class="php plain">License high water mark = 144 <code class="php plain">Tue Apr 22 17:27:08 2014 <code class="php plain">Termination issued to instance processes. Waiting <code class="php keyword">for <code class="php plain">the processes to <code class="php functions">exit <code class="php plain">Instance termination failed to kill one <code class="php keyword">or <code class="php plain">more processes <code class="php plain">Instance terminated by LMHB, pid = 27613 <code class="php plain">Tue Apr 22 17:27:15 2014 <code class="php plain">USER (ospid: 1378): terminating the instance <code class="php plain">Termination issued to instance processes. Waiting <code class="php keyword">for <code class="php plain">the processes to <code class="php functions">exit <code class="php plain">Tue Apr 22 17:27:25 2014 <code class="php plain">Instance termination failed to kill one <code class="php keyword">or <code class="php plain">more processes <code class="php plain">Instance terminated by USER, pid = 1378 <code class="php plain">Tue Apr 22 21:51:56 2014 <code class="php plain">Adjusting the <code class="php keyword">default <code class="php plain">value of parameter parallel_max_servers <code class="php plain">from 640 to 135 due to the value of parameter processes (150) <code class="php plain">Starting ORACLE instance (normal)

我们可以看到,最早在Apr 22 17:25:05 2014 时间点,即抛出LMON IPC send timeout的错误了。

Receiver: inst 2 binc 95439 ospid 13752 这里的receiver进程为节点2的13752进程,即节点2的LMON进程。

对于LMON进程,主要是监控RAC的GES信息,当然其作用不仅仅局限于此,其还负责检查集群中各个Node的健康
情况,当有节点出现故障是,负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道
RAC的脑裂机制,如果IO fencing是Oracle本身来完成,也就是说由CLusterware来完成。那么Lmon进程检查到
实例级别出现脑裂时,会通知clusterware来进行脑裂操作,然而其并不会等待clusterware的处理结果。当等待
超过一定时间,那么LMON进程会自动触发IMR(instance membership recovery),这实际上也就是我们所说的
Instance membership reconfig。

其次,lmon进程主要通过2种心跳机制来检查判断集群节点的健康状态:
1) 网络心跳 (主要是通过ping进行检测)
2) 控制文件磁盘心跳,其实就是每个节点的ckpt进程每3s更新一次controlfile的机制。

所以这里大家可以看出,Lmon进程是需要操作controlfile的。否则无法判断第2点。
虽然从上面的错误来看,该实例是被LMHB进程给终止掉的,这里我们需要说明一下,LMBH进程的原理。
LMBH进程是Oracle 11R2版本引入的一个进程,该进程的作用的监控LMD,LMON,LCK,LMS等核心进程,防止这些Oracle

关键性后台进程spin或不被阻塞。该进程会定时的将监控的信息打印输出在trace文件中,便于我们进行诊断,

这也是11gR2一个亮点。当LMBH进程发现其他核心进程出现异常时,会尝试发起一些kill动作,如何有进程被阻塞的话。
如果一定时间内仍然无法解决,那么将触发保护,将实例强行终止掉,当然,这是为了保证RAC节点数据的完整性和一致性。

这里比较郁闷的是,这个diag的dump并没有产没 /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc

我们首先来看下Node1的Lmon进程的信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 *** ACTION NAME:() 2014-04-22 17:26:56.052 <code class="php plain">*** 2014-04-22 17:26:49.401 <code class="php plain">============================== <code class="php plain">LMON (ospid: 27573) has not moved <code class="php keyword">for <code class="php plain">105 sec (1398158808.1398158703) <code class="php plain">kjfmGCR_HBCheckAll: LMON (ospid: 27573) has status 6 <code class="php plain">================================================== <code class="php plain">=== LMON (ospid: 27573) Heartbeat Report <code class="php plain">================================================== <code class="php plain">LMON (ospid: 27573) has no heartbeats <code class="php keyword">for <code class="php plain">105 sec. (threshold 70 sec) <code class="php spaces"><code class="php plain">: Not in wait; last wait ended 89 secs ago. -------------等待了89秒 <code class="php spaces"><code class="php plain">: last wait_id 165313538 at <code class="php string">'enq: CF - contention'<code class="php plain">. <code class="php plain">============================== <code class="php plain">Dumping PROCESS LMON (ospid: 27573) States <code class="php plain">============================== <code class="php plain">===[ System Load State ]=== <code class="php spaces"><code class="php plain">CPU Total 16 Core 16 Socket 16 <code class="php spaces"><code class="php plain">Load normal: Cur 988 Highmark 20480 (3.85 80.00) <code class="php plain">===[ Latch State ]=== <code class="php spaces"><code class="php plain">Not in Latch Get <code class="php plain">===[ Session State Object ]=== <code class="php spaces"><code class="php plain">---------------------------------------- <code class="php spaces"><code class="php plain">SO: 0x52daba340, type: 4, owner: 0x52f5d8330, flag: INIT/-/-/0x00 <code class="php keyword">if<code class="php plain">: 0x3 c: 0x3 <code class="php spaces"><code class="php plain">proc=0x52f5d8330, name=session, file=ksu.h LINE:12624 ID:, pg=0 <code class="php spaces"><code class="php plain">(session) sid: 1057 ser: 1 trans: 0x0, creator: 0x52f5d8330 <code class="php spaces"><code class="php plain">flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- <code class="php spaces"><code class="php plain">flags2: (0x409) -/-/INC <code class="php spaces"><code class="php plain">DID: , short-term DID: <code class="php plain">txn branch: 0x0 <code class="php spaces"><code class="php plain">oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS <code class="php spaces"><code class="php plain">ksuxds FALSE at location: 0 <code class="php spaces"><code class="php plain">service name: SYS<code class="php variable">$BACKGROUND <code class="php spaces"><code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">Not in wait; last wait ended 1 min 29 sec ago <code class="php plain">There are 1 sessions blocked by this session. <code class="php spaces"><code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 1, sid: 297, ser: 6347 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'name-service call wait' <code class="php spaces"><code class="php plain">p1: <code class="php string">'waittime'<code class="php plain">=0x32 <code class="php spaces"><code class="php plain">p2: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">p3: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 <code class="php spaces"><code class="php plain">min_blocked_time: 0 secs, waiter_cache_ver: 30272 <code class="php spaces"><code class="php plain">Wait State: <code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x20 boundary=0x0/-1 <code class="php spaces"><code class="php plain">Session Wait History: <code class="php spaces"><code class="php plain">elapsed time of 1 min 29 sec since last wait ---LMON进程等待enq: CF - contention,等待了1分29秒,即89秒 <code class="php spaces"><code class="php plain">0: waited <code class="php keyword">for <code class="php string">'enq: CF - contention' <code class="php spaces"><code class="php plain">name|mode=0x43460006, 0=0x0, operation=0x3 <code class="php spaces"><code class="php plain">wait_id=165313538 seq_num=35946 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=1.027254 sec, exc=1.027254 sec, total=1.027254 sec <code class="php spaces"><code class="php plain">wait times: max=1.000000 sec <code class="php spaces"><code class="php plain">wait counts: calls=1 os=1 <code class="php spaces"><code class="php plain">occurred after 0.000109 sec of elapsed time <code class="php spaces"><code class="php plain">。。。。。。

如下是该进程的资源使用情况:

?
1 2 3 4 5 6 *** 2014-04-22 17:26:57.229 <code class="php plain">loadavg : 3.94 3.80 3.99 <code class="php plain">swap info: free_mem = 36949.53M rsv = 24548.22M <code class="php plain">alloc = 23576.62M avail = 45643.61M swap_free = 46615.21M <code class="php spaces"><code class="php plain">F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD <code class="php spaces"><code class="php plain">0 O oa 27573 1 6 79 20 ? 799187 Jan 23 ? 1589:29 ora_lmon_authorcl1

我们可以看到,系统在该时间点load并不高,Memory也很充足。

这里有一个问题,该节点LMON进程hung的原因是什么? 从日志分析来看,是由于无法获得enq: CF – contention。

我们知道ckpt 进程会定时更新操作controlfile,且就需要获得该enqueue。 所有这里我大胆的假设,是由于ckpt持有CF的latch

不释放,导致LMON进程无法获得. 根据这一点我搜mos 发现一个bug,可惜该bug说已经在11.2.0.3中fixed了。

Bug 10276173 LMON hang possible while trying to get access to CKPT progress record

该bug描述说,当在进行reconfig时,lmon会尝试去获得ckpt processes record,会等待enq: CF -contention,会导致hung.

根据文档来看,显然这跟我们的实际情况不符。
下面我们来结合Node2的日志进行综合分析:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Tue Apr 22 17:25:06 2014 <code class="php plain">IPC Send timeout detected. Receiver ospid 13752 [oracle@xhdb-server4 (LMON)] <code class="php plain">Tue Apr 22 17:26:59 2014 <code class="php plain">Dumping diagnostic data in directory=[cdmp_20140422172658], requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination]. <code class="php plain">Tue Apr 22 17:29:22 2014 <code class="php plain">WARNING: aiowait timed out 1 times <code class="php plain">Tue Apr 22 17:29:53 2014 <code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3681): <code class="php plain">ORA-32701: Possible hangs up to hang ID=3 detected <code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3681/authorcl2_dia0_13750_i3681.trc <code class="php plain">Tue Apr 22 17:30:24 2014 <code class="php plain">DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3 <code class="php spaces"><code class="php plain">requested by master DIA0 process on instance 2 <code class="php spaces"><code class="php plain">Hang Resolution Reason: Automatic hang resolution was performed to free a <code class="php spaces"><code class="php plain">significant number of affected sessions. <code class="php spaces"><code class="php plain">by terminating session sid: 578 ospid: 16818 <code class="php plain">DIA0 successfully terminated session sid:578 ospid:16818 with status 31. <code class="php plain">DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3. <code class="php plain">Tue Apr 22 17:30:33 2014 <code class="php plain">LMS2 (ospid: 13764) has detected no messaging activity from instance 1 <code class="php plain">LMS2 (ospid: 13764) issues an IMR to resolve the situation <code class="php plain">Please check LMS2 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:31:48 2014 <code class="php plain">LMD0 (ospid: 13754) has detected no messaging activity from instance 1 <code class="php plain">LMD0 (ospid: 13754) issues an IMR to resolve the situation <code class="php plain">Please check LMD0 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:32:03 2014 <code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3682): <code class="php plain">ORA-32701: Possible hangs up to hang ID=3 detected <code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3682/authorcl2_dia0_13750_i3682.trc <code class="php plain">Tue Apr 22 17:32:16 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 23666 [oracle@xhdb-server4 (TNS V1-V3)] <code class="php plain">Receiver: inst 1 binc 380222 ospid 27575 <code class="php plain">IPC Send timeout to 1.0 inc 20 <code class="php keyword">for <code class="php plain">msg type 12 from opid 154 <code class="php plain">IPC Send timeout: Terminating pid 154 osid 23666 <code class="php plain">Tue Apr 22 17:32:20 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 13746 [oracle@xhdb-server4 (PING)] <code class="php plain">Receiver: inst 1 binc 380164 ospid 27565 <code class="php plain">Tue Apr 22 17:32:34 2014 <code class="php plain">DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3 <code class="php spaces"><code class="php plain">requested by master DIA0 process on instance 2 <code class="php spaces"><code class="php plain">Hang Resolution Reason: Automatic hang resolution was performed to free a <code class="php spaces"><code class="php plain">significant number of affected sessions. <code class="php spaces"><code class="php plain">by terminating the process <code class="php plain">DIA0 successfully terminated process ospid:16818. <code class="php plain">DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3. <code class="php plain">Tue Apr 22 17:32:35 2014 <code class="php plain">LMS1 (ospid: 13760) has detected no messaging activity from instance 1 <code class="php plain">LMS1 (ospid: 13760) issues an IMR to resolve the situation <code class="php plain">Please check LMS1 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:32:44 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 13754 [oracle@xhdb-server4 (LMD0)] <code class="php plain">Receiver: inst 1 binc 380222 ospid 27575 <code class="php plain">IPC Send timeout to 1.0 inc 20 <code class="php keyword">for <code class="php plain">msg type 65521 from opid 12 <code class="php plain">Tue Apr 22 17:33:11 2014 <code class="php plain">LMS0 (ospid: 13756) has detected no messaging activity from instance 1 <code class="php plain">LMS0 (ospid: 13756) issues an IMR to resolve the situation <code class="php plain">Please check LMS0 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:34:29 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 13764 [oracle@xhdb-server4 (LMS2)] <code class="php plain">Receiver: inst 1 binc 380309 ospid 27585 <code class="php plain">IPC Send timeout to 1.1 inc 20 <code class="php keyword">for <code class="php plain">msg type 65522 from opid 15 <code class="php plain">Tue Apr 22 17:36:31 2014

我们可以看到Node2 在Apr 22 17:26:59 2014 节点Node1的LMBH终止instance的信息了。然后在后面抛出hung的信息,
不过Oracle自动解决了hung的session。 下面我们来看下Node2上lmon进程的trace内容:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 *** 2014-04-22 17:26:59.377 <code class="php plain">Process diagnostic dump <code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (LMON), OS id=13752, <code class="php plain">pid: 11, proc_ser: 1, sid: 353, sess_ser: 1 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">current sql: <none> <code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">0: waiting <code class="php keyword">for <code class="php string">'control file sequential read' <code class="php spaces"><code class="php plain">file#=0x0, block#=0x23, blocks=0x1 <code class="php spaces"><code class="php plain">wait_id=272969233 seq_num=24337 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=7 min 42 sec, exc=7 min 42 sec, total=7 min 42 sec ---已经等待了7分42秒 <code class="php spaces"><code class="php plain">wait times: max=infinite, heur=7 min 42 sec <code class="php spaces"><code class="php plain">wait counts: calls=0 os=0 <code class="php spaces"><code class="php plain">in_wait=1 iflags=0x5a0 <code class="php plain">There are 1 sessions blocked by this session. <code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 2, sid: 1092, ser: 49369 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'name-service call wait' <code class="php spaces"><code class="php plain">p1: <code class="php string">'waittime'<code class="php plain">=0x32 <code class="php spaces"><code class="php plain">p2: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">p3: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 <code class="php spaces"><code class="php plain">min_blocked_time: 0 secs, waiter_cache_ver: 6248 <code class="php plain">Wait State: <code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x22 boundary=0x0/-1 <code class="php plain">Session Wait History:

从lmon的trace信息我们可以看出,该进程正在等待control file sequential read,且已经等待了7分42秒。

根据trace的时间点,我们可以向前推进7分42秒,换句话讲,从17:19:18秒就开始等待了。

既然是controlfile的等待,那么我们就有必要来看下Node2节点上的ckpt进程在干什么了? 如下是ckpt进程的信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Redo thread mounted by this instance: 2 <code class="php plain">Oracle process number: 26 <code class="php plain">Unix process pid: 13788, image: oracle@xhdb-server4 (CKPT) <code class="php plain">*** 2014-04-22 17:26:59.882 <code class="php plain">*** SESSION ID:(833.1) 2014-04-22 17:26:59.882 <code class="php plain">*** 2014-04-22 17:26:59.882 <code class="php plain">Process diagnostic dump <code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (CKPT), OS id=13788, <code class="php plain">pid: 26, proc_ser: 1, sid: 833, sess_ser: 1 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">current sql: <none> <code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">0: waiting <code class="php keyword">for <code class="php string">'control file sequential read' <code class="php spaces"><code class="php plain">file#=0x0, block#=0x1, blocks=0x1 <code class="php spaces"><code class="php plain">wait_id=14858985 seq_num=48092 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=7 min 40 sec, exc=7 min 40 sec, total=7 min 40 sec ----等待了7分40秒 <code class="php spaces"><code class="php plain">wait times: max=infinite, heur=7 min 40 sec <code class="php spaces"><code class="php plain">wait counts: calls=0 os=0 <code class="php spaces"><code class="php plain">in_wait=1 iflags=0x5a0 <code class="php plain">There are 2 sessions blocked by this session. <code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 2, sid: 291, ser: 59157 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'DFS lock handle' <code class="php spaces"><code class="php plain">p1: <code class="php string">'type|mode'<code class="php plain">=0x43490005 <code class="php spaces"><code class="php plain">p2: <code class="php string">'id1'<code class="php plain">=0xa <code class="php spaces"><code class="php plain">p3: <code class="php string">'id2'<code class="php plain">=0x2 <code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 <code class="php spaces"><code class="php plain">min_blocked_time: 352 secs, waiter_cache_ver: 6248 <code class="php plain">Wait State: <code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x22 boundary=0x0/-1

我们可以看到,Node2的ckpt进程等待control file sequential read,等待了7分40秒。同时大家还可以看到,ckpt
进程阻塞了2个进程,也就是说ckpt进程有2个waiter,其中一个waiter的信息是:sid:291,ser:59157
且该waiter进程的等待事件居然是DFS lock handle,这是一个比较危险的event。 这里我们还无法确认这个waiter是什么?
同时ckpt进程为啥等待这么长的时间 ?

大家知道11g引入的hung auto resolution,那么我们就来看下Node1上的diag的信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 Unix process pid: 27571, image: oracle@xhdb-server3 (DIA0) <code class="php plain">*** 2014-04-22 17:22:01.536 <code class="php plain">*** SESSION ID:(961.1) 2014-04-22 17:22:01.536 <code class="php plain">*** CLIENT ID:() 2014-04-22 17:22:01.536 <code class="php plain">*** SERVICE NAME:(SYS<code class="php variable">$BACKGROUND<code class="php plain">) 2014-04-22 17:22:01.536 <code class="php plain">*** MODULE NAME:() 2014-04-22 17:22:01.536 <code class="php plain">*** ACTION NAME:() 2014-04-22 17:22:01.536 <code class="php plain">One <code class="php keyword">or <code class="php plain">more possible hangs have been detected on your system. <code class="php plain">These could be genuine hangs in which no further progress will <code class="php plain">be made without intervention, <code class="php keyword">or <code class="php plain">it may be very slow progress <code class="php plain">in the system due to high load. <code class="php plain">Previously detected <code class="php keyword">and <code class="php plain">output hangs are not displayed again. <code class="php plain">Instead, the <code class="php string">'Victim Information' <code class="php plain">section will indicate that <code class="php plain">the victim is from an <code class="php string">'Existing Hang' <code class="php plain">under the <code class="php string">'Previous Hang' <code class="php plain">column. <code class="php string">'Verified Hangs' <code class="php plain">below indicate one <code class="php keyword">or <code class="php plain">more hangs that were found <code class="php keyword">and <code class="php plain"> identify the <code class="php keyword">final <code class="php plain"> blocking session <code class="php keyword">and <code class="php plain"> instance on which <code class="php plain">they occurred. Since the current hang resolution state is <code class="php string">'PROCESS'<code class="php plain">, <code class="php plain">any hangs requiring session <code class="php keyword">or <code class="php plain">process termination will be <code class="php plain">automatically resolved. <code class="php plain">Any hangs with a <code class="php string">'Hang Resolution Action' <code class="php plain"> of <code class="php string">'Unresolvable' <code class="php plain">will be ignored. These types of hangs will either be resolved <code class="php plain">by another layer in the RDBMS <code class="php keyword">or <code class="php plain">cannot be resolved <code class="php keyword">as <code class="php plain">they may <code class="php keyword">require <code class="php plain"> external intervention. <code class="php plain">Deadlocks (also named cycles) are currently NOT resolved even <code class="php keyword">if <code class="php plain">hang resolution is enabled. The <code class="php string">'Hang Type' <code class="php plain">of DLCK in the <code class="php string">'Verified Hangs' <code class="php plain">output identifies these hangs. <code class="php plain">Below that are the complete hang chains from the time the hang <code class="php plain">was detected. <code class="php plain">The following information will assist Oracle Support Services <code class="php plain">in further analysis of the root cause of the hang. <code class="php plain">*** 2014-04-22 17:22:01.537 <code class="php plain">Verified Hangs in the System <code class="php spaces"><code class="php plain">Root Chain Total Hang <code class="php plain">Hang Hang Inst Root #hung #hung Hang Hang Resolution <code class="php plain">ID Type Status Num Sess Sess Sess Conf Span Action <code class="php plain">----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- <code class="php spaces"><code class="php plain">2 HANG VICSELTD 2 833 2 2 HIGH LOCAL IGNRD:InstKillNotA <code class="php plain">Hang Ignored Reason: Since instance termination is not allowed, automatic <code class="php spaces"><code class="php plain">hang resolution cannot be performed. <code class="php plain">inst# SessId Ser# OSPID PrcNm Event <code class="php spaces"><code class="php plain">----- ------ ----- --------- ----- ----- <code class="php spaces"><code class="php plain">2 291 59157 10646 M000 DFS lock handle ----大家注意这里的sid和ser#以及PrcNm <code class="php spaces"><code class="php plain">2 833 1 13788 CKPT control file sequential read

这里提到M000进程,大家应该知道这是跟AWR快照有关系的进程,该进程其实是被CKPT所阻塞,我们也可以来看下该进程
的情况,如下:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 *** 2014-04-22 17:27:00.778 <code class="php plain">Process diagnostic dump <code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (M000), OS id=10646, <code class="php plain">pid: 57, proc_ser: 143, sid: 291, sess_ser: 59157 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">current sql: <code class="php plain">select tablespace_id, rfno, allocated_space, file_size, file_maxsize, changescn_base, changescn_wrap, flag, <code class="php plain">inst_id from sys.ts$, GV<code class="php variable">$FILESPACE_USAGE <code class="php plain">where ts# = tablespace_id <code class="php keyword">and <code class="php plain">online$ != 3 <code class="php keyword">and <code class="php plain"> (changescn_wrap > PITRSCNWRP <code class="php keyword">or <code class="php plain"> (changescn_wrap = PITRSCNWRP <code class="php keyword">and <code class="php plain"> changescn_base >= PITRSCNBAS)) <code class="php keyword">and <code class="php plain"> inst_id != :inst <code class="php keyword">and <code class="php plain"> (changescn_wrap > :w <code class="php keyword">or <code class="php plain"> (changescn_wrap = :w <code class="php keyword">and <code class="php plain"> changescn_base >= :b)) <code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">0: waiting <code class="php keyword">for <code class="php string">'DFS lock handle' <code class="php spaces"><code class="php plain">type|mode=0x43490005, id1=0xa, id2=0x2 <code class="php spaces"><code class="php plain">wait_id=6 seq_num=7 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=6 min 12 sec, exc=6 min 12 sec, total=6 min 12 sec <code class="php spaces"><code class="php plain">wait times: max=infinite, heur=6 min 12 sec <code class="php spaces"><code class="php plain">wait counts: calls=818 os=818 <code class="php spaces"><code class="php plain">in_wait=1 iflags=0x15a2 <code class="php plain">There is at least one session blocking this session. <code class="php spaces"><code class="php plain">Dumping 2 direct blocker(s): <code class="php spaces"><code class="php plain">inst: 2, sid: 833, ser: 1 <code class="php spaces"><code class="php plain">inst: 1, sid: 482, ser: 1 <code class="php spaces"><code class="php plain">Dumping <code class="php keyword">final <code class="php plain">blocker: <code class="php spaces"><code class="php plain">inst: 2, sid: 833, ser: 1 -----最终的blocker是833,也就是Node2节点的CKPT进程。 <code class="php plain">There are 1 sessions blocked by this session. <code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 1, sid: 581, ser: 36139 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'DFS lock handle' <code class="php spaces"><code class="php plain">p1: <code class="php string">'type|mode'<code class="php plain">=0x43490005 <code class="php spaces"><code class="php plain">p2: <code class="php string">'id1'<code class="php plain">=0xa <code class="php spaces"><code class="php plain">p3: <code class="php string">'id2'<code class="php plain">=0x5

从这里看,root sess却是833,也就是我们Node2节点的CKPT进程。 到这里或许有人会说,问题的原因
应该很明确了,由于Node2 ckpt的异常,到这Node2 节点Lmon进程异常,由于需要和Node1的Lmon进程
进行通信,导致Node1 的lmon进程出现IPc send timeout的情况。

其实不然,到最后至始至终我们都没有完全弄清楚为何CKPT进程会等待这么长时间 ?

到这里我不得不怀疑IO的问题了,再回过头来分析Node1的diag trace时,发现一个有趣的事情:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 *** 2014-04-22 17:24:08.363 <code class="php plain">Verified Hangs in the System <code class="php spaces"><code class="php plain">Root Chain Total Hang <code class="php plain">Hang Hang Inst Root #hung #hung Hang Hang Resolution <code class="php plain">ID Type Status Num Sess Sess Sess Conf Span Action <code class="php plain">----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- <code class="php spaces"><code class="php plain">7 HANG VICSELTD 2 801 2 3 HIGH LOCAL IGNRD:InstKillNotA <code class="php plain">Hang Ignored Reason: Since instance termination is not allowed, automatic <code class="php spaces"><code class="php plain">hang resolution cannot be performed. <code class="php plain">inst# SessId Ser# OSPID PrcNm Event <code class="php spaces"><code class="php plain">----- ------ ----- --------- ----- ----- <code class="php spaces"><code class="php plain">2 996 39917 6598 FG log file sync <code class="php spaces"><code class="php plain">2 801 1 13786 LGWR log file parallel write <code class="php plain">。。。。。省略部分内容 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">Chain 2: <code class="php plain">------------------------------------------------------------------------------- <code class="php spaces"><code class="php plain">Oracle session identified by: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">instance: 1 (authorcl.authorcl1) <code class="php spaces"><code class="php plain">os id: 22323 <code class="php spaces"><code class="php plain">process id: 64, oracle@xhdb-server3 <code class="php spaces"><code class="php plain">session id: 14 <code class="php spaces"><code class="php plain">session serial #: 29257 <code class="php spaces"><code class="php plain">} <code class="php spaces"><code class="php plain">is waiting <code class="php keyword">for <code class="php string">'log file sync' <code class="php plain">with wait info: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">p1: <code class="php string">'buffer#'<code class="php plain">=0x2ab <code class="php spaces"><code class="php plain">p2: <code class="php string">'sync scn'<code class="php plain">=0x1ed4d8be <code class="php spaces"><code class="php plain">time in wait: 4 min 18 sec <code class="php spaces"><code class="php plain">timeout after: never <code class="php spaces"><code class="php plain">wait id: 13287 <code class="php spaces"><code class="php plain">blocking: 0 sessions <code class="php spaces"><code class="php plain">wait history: <code class="php spaces"><code class="php plain">* time between current wait <code class="php keyword">and <code class="php plain">wait #1: 0.000388 sec <code class="php spaces"><code class="php plain">1. event: <code class="php string">'SQL*Net message from client' <code class="php spaces"><code class="php plain">time waited: 0.000486 sec <code class="php spaces"><code class="php plain">wait id: 13286 p1: <code class="php string">'driver id'<code class="php plain">=0x54435000 <code class="php spaces"><code class="php plain">p2: <code class="php string">'#bytes'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">* time between wait #1 <code class="php keyword">and <code class="php plain">#2: 0.000027 sec <code class="php spaces"><code class="php plain">2. event: <code class="php string">'SQL*Net message to client' <code class="php spaces"><code class="php plain">time waited: 0.000003 sec <code class="php spaces"><code class="php plain">wait id: 13285 p1: <code class="php string">'driver id'<code class="php plain">=0x54435000 <code class="php spaces"><code class="php plain">p2: <code class="php string">'#bytes'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">* time between wait #2 <code class="php keyword">and <code class="php plain">#3: 0.001494 sec <code class="php spaces"><code class="php plain">3. event: <code class="php string">'SQL*Net message from client' <code class="php spaces"><code class="php plain">time waited: 0.002089 sec <code class="php spaces"><code class="php plain">wait id: 13284 p1: <code class="php string">'driver id'<code class="php plain">=0x54435000 <code class="php spaces"><code class="php plain">p2: <code class="php string">'#bytes'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">} <code class="php spaces"><code class="php keyword">and <code class="php plain">is blocked by <code class="php spaces"><code class="php plain">=> Oracle session identified by: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">instance: 1 (authorcl.authorcl1) <code class="php spaces"><code class="php plain">os id: 27634 <code class="php spaces"><code class="php plain">process id: 20, oracle@xhdb-server3 (LGWR) <code class="php spaces"><code class="php plain">session id: 386 <code class="php spaces"><code class="php plain">session serial #: 1 <code class="php spaces"><code class="php plain">} <code class="php spaces"><code class="php plain">which is waiting <code class="php keyword">for <code class="php string">'log file parallel write' <code class="php plain">with wait info: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">p1: <code class="php string">'files'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">p2: <code class="php string">'blocks'<code class="php plain">=0x2 <code class="php spaces"><code class="php plain">p3: <code class="php string">'requests'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">time in wait: 4 min 32 sec -----等待了4分32秒 <code class="php spaces"><code class="php plain">timeout after: never <code class="php spaces"><code class="php plain">wait id: 51742574 <code class="php spaces"><code class="php plain">blocking: 17 sessions ------阻塞了17个Session <code class="php spaces"><code class="php plain">wait history: <code class="php spaces"><code class="php plain">* time between current wait <code class="php keyword">and <code class="php plain">wait #1: 0.000176 sec <code class="php spaces"><code class="php plain">1. event: <code class="php string">'rdbms ipc message' <code class="php spaces"><code class="php plain">time waited: 0.047194 sec <code class="php spaces"><code class="php plain">wait id: 51742573 p1: <code class="php string">'timeout'<code class="php plain">=0x22 <code class="php spaces"><code class="php plain">* time between wait #1 <code class="php keyword">and <code class="php plain">#2: 0.000399 sec <code class="php spaces"><code class="php plain">2. event: <code class="php string">'log file parallel write' <code class="php spaces"><code class="php plain">time waited: 0.004006 sec <code class="php spaces"><code class="php plain">wait id: 51742572 p1: <code class="php string">'files'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">p2: <code class="php string">'blocks'<code class="php plain">=0x2 <code class="php spaces"><code class="php plain">p3: <code class="php string">'requests'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">* time between wait #2 <code class="php keyword">and <code class="php plain">#3: 0.000318 sec <code class="php spaces"><code class="php plain">3. event: <code class="php string">'rdbms ipc message' <code class="php spaces"><code class="php plain">time waited: 2.654606 sec <code class="php spaces"><code class="php plain">wait id: 51742571 p1: <code class="php string">'timeout'<code class="php plain">=0x12c <code class="php spaces"><code class="php plain">} <code class="php plain">Chain 2 Signature: <code class="php string">'log file parallel write'<code class="php plain">'log file sync'

我们可以看到,Node1上,lgwr进程阻塞了17个会话,等待log file parallel write,一直持续了4分32秒。

如果将时间点2014-04-22 17:24:08,向前推进4分32秒,那么就是2014-04-22 17:19:36.

我们再来检查Node2的操作系统日志你会发现有意思的事情:

?
1 2 3 4 5 6 7 8 9 10 11 Apr 22 17:19:16 xhdb-server4 vxio: [ID 442312 kern.notice] NOTICE: VxVM VVR vxio V-5-0-209 Log overflow protection on rlink oa_zs_bj triggered DCM protection <code class="php plain">Apr 22 17:19:16 xhdb-server4 vxio: [ID 636438 kern.warning] WARNING: VxVM VVR vxio V-5-0-1905 Replication stopped <code class="php keyword">for <code class="php plain">RVG rvg_oa due to SRL overflow, DCM protection is triggered. To start replication, perform DCM resynchronization using <code class="php string">"vradmin resync" <code class="php plain">command. <code class="php plain">Apr 22 19:41:19 xhdb-server4 su: [ID 810491 auth.crit] <code class="php string">'su root' <code class="php plain">failed <code class="php keyword">for <code class="php plain">oa on /dev/pts/5 <code class="php plain">Apr 22 21:31:38 xhdb-server4 su: [ID 810491 auth.crit] <code class="php string">'su grid' <code class="php plain">failed <code class="php keyword">for <code class="php plain">root on /dev/pts/14 <code class="php plain">Apr 22 21:36:03 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 Thread(4) Agent is calling clean <code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own. <code class="php plain">Apr 22 21:36:03 xhdb-server4 Had[5704]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling clean <code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own. <code class="php plain">Apr 22 21:36:06 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(4) Resource(cssd_oaora) - clean completed successfully. <code class="php plain">Apr 22 21:40:51 xhdb-server4 reboot: [ID 662345 auth.crit] rebooted by root <code class="php plain">Apr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal received <code class="php keyword">while <code class="php plain">waiting <code class="php keyword">for <code class="php plain">event on reaper <code class="php string">'CFSMount'<code class="php plain">. Returning. <code class="php plain">Apr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal received <code class="php keyword">while <code class="php plain">waiting <code class="php keyword">for <code class="php plain">event on reaper <code class="php string">'CFSfsckd'<code class="php plain">. Returning. <code class="php plain">Apr 23 00:31:59 xhdb-server4 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-25 64-bit

我们可以看到,在17:19:16秒时vertias的VVR出现了异常。这也就是为什么我们在前面分析看到Node2在17:19:18时出现control file sequential read
等待的原因。虽然从vcs的日志什么信息:

?
1 2 3 4 5 6 2014/04/22 14:39:12 VCS INFO V-16-1-53504 VCS Engine Alive message!! <code class="php plain">2014/04/22 18:39:13 VCS INFO V-16-1-53504 VCS Engine Alive message!! <code class="php plain">2014/04/22 21:36:03 VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling clean <code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own. <code class="php plain">2014/04/22 21:36:06 VCS INFO V-16-2-13068 (xhdb-server4) Resource(cssd_oaora) - clean completed successfully. <code class="php plain">2014/04/22 21:36:09 VCS INFO V-16-1-10307 Resource cssd_oaora (Owner: Unspecified, Group: sg_oaora) is offline on xhdb-server4 (Not initiated by VCS) <code class="php plain">2014/04/22 21:36:09 VCS NOTICE V-16-1-10446 Group sg_oaora is offline on system xhdb-server4

所以,最后我的感觉是根本原因是vcs的问题。虽然vertias的工程师一直解释这里的Log overflow protection没有太大的关系。

针对这个问题,欢迎大家探讨。

补充:关于ora-29770导致实例crash的问题,Oracle确实有不少的bug,但是针对这个情况,目前没有发现符合的。如下是来自Oracle MOS的搜索结果:

Bug 11890804:LMHB crashes instance with ORA-29770 after long “control file sequential read” waits
Bug 8888434: LMHB crashes the instance with LMON waiting on controlfile read
Bug 11890804: LMHB TERMINATE INSTANCE WHEN LMON WAIT CHANGE FROM CF READ AFTER 60 SEC
Bug 13467673: CSS MISSCOUNT AND ALL ASM DOWN WITH ORA-29770 BY LMHB
Bug 13390052: KJFMGCR_HBCHECKALL MESSAGES ARE CONTINUOUSLY LOGGED IN LMHB TRACE FILE.
Bug 13322797: LMHB TERMINATES THE INSTANCE DUE TO ERROR 29770
Bug 13061883: LMHB IS TERMINATING THE INSTANCE DURING SHUTDOWN IMMEDIATE
Bug 12886605: ESSC: LMHB TERMINATE INSTANCE DUE TO 29770 – LMON WAIT ENQ: AM – DISK OFFLINE
Bug 12757321: LMHB TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10296263: LMHB (OSPID: 15872): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10431752: SINGLE NODE RAC: LMHB TERMINATES INSTANCE DUE TO 29770
Bug 11656856: LMHB (OSPID: 27701): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10411143: INSTANCE CRASHES WITH IPC SEND TIMEOUT AND LMHB TERMINATES WITH ORA-29770
Bug 11704041: DATABASE INSTANCE CRASH BY LMHB PROCESS
Bug 10412545: ORA-29770 LMHB TERMINATE INSTANCE DUE TO VARIOUS LONG CSS WAIT
Bug 10147827: INSTANCE TERMINATED BY LMHB WITH ERROR ORA-29770
Bug 10016974: ORA-29770 LMD IS HUNG FOR MORE THAN 70 SECONDS AND LMHB TERMINATE INSTANCE
Bug 9376100: LMHB TERMINATING INSTANCE DUE ERROR 29770

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn