Home >Database >Mysql Tutorial >1.4TBASM(RAC)磁盘损坏恢复小记

1.4TBASM(RAC)磁盘损坏恢复小记

WBOY
WBOYOriginal
2016-06-07 15:29:251526browse

这周折腾了2天的时间帮客户成功恢复了一套近1.4TB的10.2.0.5 RAC(ASM). 该库在3月4号直接crash了。 大家可以看到,该库在开始报错读取redo,controlfile报错,本质原因是DISKGROUP dismount了,信息如下: Tue Mar 04 18:09:59 CST 2014 Errors in file /home/o

这周折腾了2天的时间帮客户成功恢复了一套近1.4TB的10.2.0.5 RAC(ASM). 该库在3月4号直接crash了。

大家可以看到,该库在开始报错读取redo,controlfile报错,本质原因是DISKGROUP dismount了,信息如下:

Tue Mar 04 18:09:59 CST 2014 <code class="php plain">Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lgwr_15943.trc: <code class="php plain">ORA-00345: redo log write error block 68145 <code class="php functions">count <code class="php plain">5 <code class="php plain">ORA-00312: online log 6 thread 2: <code class="php string">'+DATA/xxxx/onlinelog/o2_t2_redo3.log' <code class="php plain">ORA-15078: ASM diskgroup was forcibly dismounted <code class="php plain">Tue Mar 04 18:09:59 CST 2014 <code class="php plain">SUCCESS: diskgroup DATA was dismounted <code class="php plain">SUCCESS: diskgroup DATA was dismounted <code class="php plain">Tue Mar 04 18:10:00 CST 2014 <code class="php plain">Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lmon_15892.trc: <code class="php plain">ORA-00202: control file: <code class="php string">'+DATA/xxxx/controlfile/o1_mf_4g1zr1yo_.ctl' <code class="php plain">ORA-15078: ASM diskgroup was forcibly dismounted <code class="php plain">Tue Mar 04 18:10:00 CST 2014 <code class="php plain">KCF: write/open error block=0x1f41e online=1 <code class="php spaces"><code class="php plain">file=31 +DATA/xxxx/datafile/apps_ts_queues.310.692585175 <code class="php spaces"><code class="php plain">error=15078 txt: <code class="php string">'' <code class="php plain">Tue Mar 04 18:10:00 CST 2014 <code class="php plain">KCF: write/open error block=0x47d5d online=1 <code class="php spaces"><code class="php plain">file=51 +DATA/xxx/datafile/apps_ts_tx_data.353.692593409 <code class="php spaces"><code class="php plain">error=15078 txt: <code class="php string">'' <code class="php plain">Tue Mar 04 18:10:00 CST 2014 <code class="php plain">Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_dbw2_15939.trc: <code class="php plain">ORA-00202: control file: <code class="php string">'+DATA/prod/controlfile/o1_mf_4g1zr1yo_.ctl' <code class="php plain">ORA-15078: ASM diskgroup was forcibly dismounted <code class="php plain">Tue Mar 04 18:10:00 CST 2014 <code class="php plain">KCF: write/open error block=0x47d5b online=1 <code class="php spaces"><code class="php plain">file=51 +DATA/prod/datafile/apps_ts_tx_data.353.692593409 <code class="php spaces"><code class="php plain">error=15078 txt: <code class="php string">'' <code class="php plain">Tue Mar 04 18:10:00 CST 2014 <p>数据库实例挂了之后,我们来看下ASM实例的alert log信息,如下:</p> <code class="php plain">Tue Mar 04 18:10:04 CST 2014 <code class="php plain">NOTE: SMON starting instance recovery <code class="php keyword">for <code class="php plain">group 1 (mounted) <code class="php plain">Tue Mar 04 18:10:04 CST 2014 <code class="php plain">WARNING: IO Failed. au:0 diskname:/dev/raw/raw5 <code class="php spaces"><code class="php plain">rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0 <code class="php spaces"><code class="php plain">status:2 <code class="php plain">WARNING: IO Failed. au:0 diskname:/dev/raw/raw5 <code class="php spaces"><code class="php plain">rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0 <code class="php spaces"><code class="php plain">status:2 <code class="php plain">NOTE: F1X0 found on disk 0 fcn 0.160230519 <code class="php plain">WARNING: IO Failed. au:33 diskname:/dev/raw/raw5 <code class="php spaces"><code class="php plain">rq:0x60000000002d64f0 buffer:0x400405df000 au_offset(bytes):0 iosz:4096 operation:0 <code class="php spaces"><code class="php plain">status:2 <code class="php plain">WARNING: cache failed to read gn 1 fn 3 blk 10752 <code class="php functions">count <code class="php plain">1 from disk 2 <code class="php plain">ERROR: cache failed to read fn=3 blk=10752 from disk(s): 2 <code class="php plain">ORA-15081: failed to submit an I/O operation to a disk <code class="php plain">NOTE: cache initiating offline of disk 2 group 1 <code class="php plain">WARNING: process 12863 initiating offline of disk 2.2526420198 (DATA_0002) with mask 0x3 in group 1 <code class="php plain">NOTE: PST update: grp = 1, dsk = 2, mode = 0x6 <code class="php plain">Tue Mar 04 18:10:04 CST 2014 <code class="php plain">ERROR: too many offline disks in PST (grp 1) <code class="php plain">Tue Mar 04 18:10:04 CST 2014 <code class="php plain">ERROR: PST-initiated MANDATORY DISMOUNT of group DATA <code class="php plain">Tue Mar 04 18:10:04 CST 2014 <code class="php plain">WARNING: Disk 2 in group 1 in mode: 0x7,state: 0x2 was taken offline <code class="php plain">Tue Mar 04 18:10:05 CST 2014 <code class="php plain">NOTE: halting all I/Os to diskgroup DATA <code class="php plain">NOTE: active pin found: 0x0x40045bb0fd0 <code class="php plain">Tue Mar 04 18:10:05 CST 2014 <code class="php plain">Abort recovery <code class="php keyword">for <code class="php plain">domain 1 <code class="php plain">Tue Mar 04 18:10:05 CST 2014 <code class="php plain">NOTE: cache dismounting group 1/0xD916EC16 (DATA) <code class="php plain">Tue Mar 04 18:10:06 CST 2014 <p>大家可以看到,ASM报了一个ORA-15081错误,在该错误之前是报对其中一个盘/dev/raw/raw5的IO操作错误。<br> 细心的朋友可以看到,这里由于IO 操作异常后,该disk被offline了。最后磁盘组无法mount。</p> <p>我们测试使用kfed read无法读取该disk,dd也无法操作。但是却可以直接dd 该disk对应的物理盘。</p> <p>磁盘组无法mount,从其中trace来看显然是磁盘头损坏,如下:</p> <code class="php plain">WARNING: cache read a corrupted block gn=1 dsk=2 blk=1 from disk 2 <code class="php plain">OSM metadata block dump: <code class="php plain">kfbh.endian: 0 ; 0x000: 0x00 <code class="php plain">kfbh.hard: 0 ; 0x001: 0x00 <code class="php plain">kfbh.type: 0 ; 0x002: KFBTYP_INVALID <code class="php plain">kfbh.datfmt: 0 ; 0x003: 0x00 <code class="php plain">kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 <code class="php plain">kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0 <code class="php plain">kfbh.check: 0 ; 0x00c: 0x00000000 <code class="php plain">kfbh.fcn.base: 0 ; 0x010: 0x00000000 <code class="php plain">kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 <code class="php plain">kfbh.spare1: 0 ; 0x018: 0x00000000 <code class="php plain">kfbh.spare2: 0 ; 0x01c: 0x00000000 <code class="php spaces"><code class="php plain">CE: (0x0x400417ee4e0) group=1 (DATA) obj=2 (disk) blk=1 <code class="php spaces"><code class="php plain">hashFlags=0x0002 lid=0x0002 lruFlags=0x0000 bastCount=1 <code class="php spaces"><code class="php plain">redundancy=0x11 fileExtent=-2147483648 AUindex=0 blockIndex=1 <code class="php spaces"><code class="php functions">copy <code class="php plain">#0: disk=2 au=0 <code class="php spaces"><code class="php plain">BH: (0x0x40041795000) bnum=4586 type=reading state=reading chgSt=not modifying <code class="php spaces"><code class="php plain">flags=0x00000000 pinmode=excl lockmode=share bf=0x0x40041400000 <code class="php spaces"><code class="php plain">kfbh_kfcbh.fcn_kfbh = 0.0 lowAba=655.8572 highAba=0.0 <code class="php spaces"><code class="php plain">last kfcbInitSlot <code class="php keyword">return <code class="php plain">code=null cpkt lnk is null <p>大家知道Oracle ASM 10.2.0.5版本开始会对ASM disk header 进行自动备份,如果如果仅仅是盘头<br> 损坏那么恢复是很easy的。但是其实并不是这么简单,通过dd判断,该盘的前面几个block其实被损坏。</p> <p>最后我们通过ODU 直接将数据文件从磁盘拷贝到文件系统,然后起库,最后完成整个恢复过程。</p> <p>备注:在恢复过程中,发现ODU无法直接拷贝test201402.dbf 这样的文件,然而通过检查</p> <p>asm alias directory发现,其实是完好的,这里可能odu处理还有点小问题,我们通过手工将该元数据</p> <p>的AU 读取出来,然后匹配将剩下的文件全部抽取出来了,包括redo,controlfile,直接顺利打开数据库。</p> <p>不得不说,熊哥的ODU太强大了,秒杀各种Oracle ASM的数据库恢复Case!</p> <p> </p>
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn