Heim  >  Artikel  >  Datenbank  >  11203RAC(asm)恢复一例

11203RAC(asm)恢复一例

WBOY
WBOYOriginal
2016-06-07 15:56:411229Durchsuche

前天某客户的11203 rac(asm)出现掉电,导致数据库无法启动,注意数据库是归档模式。可见是多么倒霉。据同事说开始是由于发redo和undo损坏导致无法启动,部分信息如下:? 1 2 3 4 5 6 7 8 9 Thu May 08 20:51:07 2014 Dumping diagnostic data in directory=[c

前天某客户的11203 rac(asm)出现掉电,导致数据库无法启动,注意数据库是归档模式。可见是多么倒霉。据同事说开始是由于发redo和undo损坏导致无法启动,部分信息如下: ?
1 2 3 4 5 6 7 8 9 Thu May 08 20:51:07 2014 <code class="php spaces"><code class="php plain">Dumping diagnostic data in directory=[cdmp_20140508205107], requested by (instance=1, osid=13828272), summary=[incident=77085]. <code class="php spaces"><code class="php plain">Abort recovery <code class="php keyword">for <code class="php plain">domain 0 <code class="php spaces"><code class="php plain">Aborting crash recovery due to error 354 <code class="php spaces"><code class="php plain">Errors in file /oracle/db/diag/rdbms/hiatmpdb/hiatmpdb1/trace/hiatmpdb1_ora_13828272.trc: <code class="php spaces"><code class="php plain">ORA-00354: corrupt redo log block header <code class="php spaces"><code class="php plain">ORA-00353: log corruption near block 67856 change 13820540000932 time 05/08/2014 13:12:44 <code class="php spaces"><code class="php plain">ORA-00312: online log 3 thread 2: <code class="php string">'+DATA/hiatmpdb/onlinelog/group_3.269.830653613' <code class="php spaces"><code class="php plain">ORA-00312: online log 3 thread 2: <code class="php string">'+DATA/hiatmpdb/onlinelog/group_3.268.830653613'

他做了一些recover database until cancel操作。甚至还使用了隐含参数,但是仍然无法open数据库,如下: SQL> alter system set “_allow_resetlogs_corruption”=true scope=spfile ; SQL> alter system set “_allow_error_simulation”=true scope=spfile ; open数据库时报undo存坏块,如下: \ 可以看到,在使用隐含参数进行open都仍然报undo存在坏块。本来我想进行不完全恢复,发现后面执行recover database using backup controlfile until cancel 居然报ora-16433错误,很明显,同事之前做过resetlogs了,解决这个错误只能重建undo,本想通过如下方式来重建controlfile的,发现居然不行: oradebug setmypid alter database backup controlfile to trace; 居然错误我记不住了。既然是报undo错误,那么首先的想法就是吧该undo坏块涉及的回滚段进行屏蔽。通过10046 event可以定位到问题回滚段,但是,11g的回滚段格式发生了变化,仅仅是这样还不够的,如下是 10046 event的跟踪信息: ?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 ===================== <code class="plain spaces"><code class="plain plain">PARSING IN CURSOR #4574130432 len=142 dep=1 uid=0 oct=3 lid=0 tim=237352129855 hv=361892850 ad='700000160cd5178' sqlid='7bd391hat42zk' <code class="plain spaces"><code class="plain plain">select /*+ rule */ name,file#,block#,status$,user#,undosqn,xactsqn,scnbas,scnwrp,DECODE(inst#,0,NULL,inst#),ts#,spare1 from undo$ where us#=:1 <code class="plain spaces"><code class="plain plain">END OF STMT <code class="plain spaces"><code class="plain plain">PARSE #4574130432:c=9,e=14,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=3,plh=4258302260,tim=237352129854 <code class="plain spaces"><code class="plain plain">BINDS #4574130432: <code class="plain spaces"><code class="plain plain">Bind#0 <code class="plain spaces"><code class="plain plain">oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 <code class="plain spaces"><code class="plain plain">oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0 <code class="plain spaces"><code class="plain plain">kxsbbbfp=110a3acb8 bln=22 avl=02 flg=05 <code class="plain spaces"><code class="plain plain">value=3 <code class="plain spaces"><code class="plain plain">EXEC #4574130432:c=47,e=88,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=3,plh=4258302260,tim=237352130029 <code class="plain spaces"><code class="plain plain">FETCH #4574130432:c=8,e=13,p=0,cr=2,cu=0,mis=0,r=1,dep=1,og=3,plh=4258302260,tim=237352130065 <code class="plain spaces"><code class="plain plain">STAT #4574130432 id=1 cnt=1 pid=0 pos=1 obj=15 op='TABLE ACCESS BY INDEX ROWID UNDO$ (cr=2 pr=0 pw=0 time=10 us)' <code class="plain spaces"><code class="plain plain">STAT #4574130432 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=1 pr=0 pw=0 time=5 us)' <code class="plain spaces"><code class="plain plain">CLOSE #4574130432:c=3,e=5,dep=1,type=1,tim=237352130125 <code class="plain spaces"><code class="plain plain">PARSE #4574130432:c=5,e=9,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=3,plh=4258302260,tim=237352130158 <code class="plain spaces"><code class="plain plain">BINDS #4574130432: <code class="plain spaces"><code class="plain plain">Bind#0 <code class="plain spaces"><code class="plain plain">oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 <code class="plain spaces"><code class="plain plain">oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0 <code class="plain spaces"><code class="plain plain">kxsbbbfp=110a3ab88 bln=22 avl=02 flg=05 <code class="plain spaces"><code class="plain plain">value=4 <code class="plain spaces"><code class="plain plain">EXEC #4574130432:c=44,e=71,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=3,plh=4258302260,tim=237352130278 <code class="plain spaces"><code class="plain plain">FETCH #4574130432:c=7,e=12,p=0,cr=2,cu=0,mis=0,r=1,dep=1,og=3,plh=4258302260,tim=237352130308 <code class="plain spaces"><code class="plain plain">CLOSE #4574130432:c=2,e=3,dep=1,type=3,tim=237352130335 <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='db file sequential read' ela= 6947 file#=3 block#=176 blocks=1 obj#=0 tim=237352137334 <code class="plain spaces"><code class="plain plain">DDE rules only execution for: ORA 1110 <code class="plain spaces"><code class="plain plain">----- START Event Driven Actions Dump ---- <code class="plain spaces"><code class="plain plain">---- END Event Driven Actions Dump ---- <code class="plain spaces"><code class="plain plain">----- START DDE Actions Dump ----- <code class="plain spaces"><code class="plain plain">Executing SYNC actions <code class="plain spaces"><code class="plain plain">----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) ----- <code class="plain spaces"><code class="plain plain">Successfully dispatched <code class="plain spaces"><code class="plain plain">----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) ----- <code class="plain spaces"><code class="plain plain">Executing ASYNC actions <code class="plain spaces"><code class="plain plain">----- END DDE Actions Dump (total 0 csec) ----- <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='control file sequential read' ela= 258 file#=0 block#=1 blocks=1 obj#=0 tim=237352138057 <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='control file sequential read' ela= 205 file#=1 block#=1 blocks=1 obj#=0 tim=237352138319 <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='control file sequential read' ela= 190 file#=0 block#=40 blocks=1 obj#=0 tim=237352138539 <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='control file sequential read' ela= 251 file#=0 block#=42 blocks=1 obj#=0 tim=237352138818 <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='control file sequential read' ela= 192 file#=0 block#=48 blocks=1 obj#=0 tim=237352139044 <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='control file sequential read' ela= 255 file#=0 block#=113 blocks=1 obj#=0 tim=237352139328 <code class="plain spaces"><code class="plain plain">WAIT #4573319128: nam='KSV master wait' ela= 1 p1=0 p2=0 p3=0 obj#=0 tim=237352139400 <code class="plain spaces"><code class="plain plain">Byte offset to file# 3 block# 176 is unknown <code class="plain spaces"><code class="plain plain">Incident 115456 created, dump file: /oracle/db/diag/rdbms/hiatmpdb/hiatmpdb1/incident/incdir_115456/hiatmpdb1_ora_12583082_i115456.trc <code class="plain spaces"><code class="plain plain">ORA-01578: ORACLE data block corrupted (file # 3, block # 176) <code class="plain spaces"><code class="plain plain">ORA-01110: data file 3: '+DATA/hiatmpdb/datafile/undotbs1.264.830644315' <code class="plain plain">ORA-01578: ORACLE data block corrupted (file # 3, block # 176) <code class="plain spaces"><code class="plain plain">ORA-01110: data file 3: '+DATA/hiatmpdb/datafile/undotbs1.264.830644315' <code class="plain spaces"><code class="plain plain">ORA-01578: ORACLE data block corrupted (file # 3, block # 176) <code class="plain spaces"><code class="plain plain">ORA-01110: data file 3: '+DATA/hiatmpdb/datafile/undotbs1.264.830644315'
我们可以看到,在访问回滚段4的时候报错了,但是无法获得回滚段的时间戳。Oracle 11g中的回滚段名称的格式如下: _SYSSMUx_时间戳. 实际上,回滚段的信息都存在undo$基表中,我们只需要获得该基表的数据即可。11g中该基表的数据在file 1 block 225block中。因此只需要dd该block,然后strings+grep就行了。最后利用隐含参数_offline_rollback_segments=(_SYSSMUx$) 和_corrupted_rollback_segments=(_SYSSMUx$) 来屏蔽,讲数据库open。 open之后发现想drop问题回滚段居然报错,既然能open也就能够查询dba_rollback_segs试图了,最后发现还有部分回滚段状态也是异常的,因此通过类似这一点 方法来drop 回滚段: ?
1 2 3 4 5 alter session set <code class="php string">"_smu_debug_mode" <code class="php plain">= 4; <code class="php spaces"><code class="php plain">alter rollback segment <code class="php string">"_SYSSMU3_83481414$" <code class="php plain">offline; <code class="php spaces"><code class="php plain">drop rollback segment <code class="php string">"_SYSSMU3_83481414$" <code class="php plain">; <code class="php spaces"><code class="php plain">alter rollback segment <code class="php string">"_SYSSMU4_2115859630$" <code class="php plain">offline; <code class="php spaces"><code class="php plain">drop rollback segment <code class="php string">"_SYSSMU4_2115859630$" <code class="php plain">;
注意,这里如果不这样做的话,无法清理回滚段,你想切换undo 表空间也会报错的。这个问题搞完后,最后发现一个数据文件的坏块,这个坏块折腾了我很长时间,非常奇怪: ?
1 2 3 4 5 6 7 8 9 10 11 continued from file: /oracle/db/diag/rdbms/hiatmpdb/hiatmpdb2/trace/hiatmpdb2_ora_13959382.trc <code class="plain spaces"><code class="plain plain">ORA-01578: ORACLE 数据块损坏 (文件号 97, 块号 373505) <code class="plain spaces"><code class="plain plain">ORA-01110: 数据文件 97: '+DATA/hiatmpdb/datafile/hiatmpts_in06.dbf' <code class="plain plain">========= Dump for incident 197185 (ORA 1578) ======== <code class="plain plain">*** 2014-05-11 16:38:55.665 <code class="plain spaces"><code class="plain plain">dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0) <code class="plain spaces"><code class="plain plain">----- Current SQL Statement for this session (sql_id=47bt6vfv19g6z) ----- <code class="plain spaces"><code class="plain plain">select t.nid ,t.cpic1path,t.cpic2path from HIATMP.DETECT_SPEED_DATA_EHL_EXTEND t where t.cdevicecode like '%44900100000001%' and row <code class="plain spaces"><code class="plain plain">num
可以看到,97号文件存在一个坏块,我dump了一下该block,发现比较怪: ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ** 2014-05-11 18:51:21.074 <code class="plain spaces"><code class="plain plain">Start dump data blocks tsn: 6 file#:97 minblk 373505 maxblk 373505 <code class="plain spaces"><code class="plain plain">Block dump from cache: <code class="plain spaces"><code class="plain plain">Dump of buffer cache at level 4 for tsn=6 rdba=407220993 <code class="plain spaces"><code class="plain plain">BH (0x70000012ef08528) file#: 97 rdba: 0x1845b301 (97/373505) class: 8 ba: 0x70000012e5fe000 <code class="plain spaces"><code class="plain plain">set: 33 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 1575,18 <code class="plain spaces"><code class="plain plain">dbwrid: 0 obj: 90762 objn: 90762 tsn: 6 afn: 97 hint: f <code class="plain spaces"><code class="plain plain">hash: [0x700000157972b00,0x700000157972b00] lru: [0x70000012ef09348,0x70000012ef091d0] <code class="plain spaces"><code class="plain plain">ckptq: [NULL] fileq: [NULL] objq: [0x70000012ef087b0,0x70000014153f8a0] objaq: [0x70000012ef087c0,0x70000014153f890] <code class="plain spaces"><code class="plain plain">st: SCURRENT md: NULL fpin: 'ktspfwh13: ktspGetNextL1ForScan' tch: 4 le: 0x700000043fd8780 <code class="plain spaces"><code class="plain plain">flags: auto_bmr_tried <code class="plain spaces"><code class="plain plain">LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [65535] <code class="plain spaces"><code class="plain plain">Block dump from disk: <code class="plain spaces"><code class="plain plain">buffer tsn: 6 rdba: 0x1845b301 (97/373505) <code class="plain spaces"><code class="plain plain">scn: 0x0c91.d8604ed8 seq: 0xff flg: 0x04 tail: 0x4ed845ff <code class="plain spaces"><code class="plain plain">frmt: 0x02 chkval: 0xcafb type: 0x45=NGLOB: Lob Extent Header <code class="plain spaces"><code class="plain plain">Hex dump of block: st=0, typ_found=1 <code class="plain spaces"><code class="plain plain">Dump of memory from 0x0000000110AEA800 to 0x0000000110AEC800 <code class="plain spaces"><code class="plain plain">110AEA800 45A20000 1845B301 D8604ED8 0C91FF04 [E....E...`N.....] <code class="plain spaces"><code class="plain plain">110AEA810 CAFB0000 B9138F29 9DC40000 05B6CCDA [.......)........]
大家可以看下这个坏块的type,居然是0×45,说这是一个lob extent header block。开始我还以为这个表存在lob字段,最后desc看了下表结构,根本没有lob字段。可见这个block是写乱了。大家知道处理坏块的方法无法就是10231 event,dbms_repair,以及dbms_rowid来处理。当时试了几种方法均不行。其中10231 event和dbms_repair本质上差不多,都是标记坏块,跳过多块读。而dbms_rowid则是根据坏块获取rowid,然后根据rowid来抢救数据。居然也不行,比较怪。最后我干脆创建一个空间,分配到该文件,然后delete掉数据,然后dd一个空块,修改掉rdba和obj id,然后直接dd替换。后记:后面让同事全库检查,还发现了20来个数据坏块,不过大多是Index,处理相对简单,我就不参与了。
Stellungnahme:
Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn