理解LGWR,Log File Sync Waits以及Commit的性能问题
一.概要:
1. Commit和log filesync的工作机制
2. 为什么log file wait太久
3. 如何去度量问题出在那里呢?
二.log file sync等待的原因
1. 默认情况下我们commit一个事务是要等待logfile sync,这其中包括:
(1)User commit(用户提交的统计信息可以通过v$sesstat来查看)
(2)DDL-这一部分主要是由于递归的事务提交所产生
(3)递归的数据字典DML操作
2. Rollbacks导致log file sync等待
(1)Userrollbacks-用户或者由应用发出的rollback操作所致
(2)Transactionrollbacks:1,由于一些失败的操作导致oracle内部的rollback 2.空间分配,或者ASSM相关的问题,以及用户取消的长查询,被kill掉的session等等。
下图为Commit和log file sync相关的流程图:
Log file sync performance>disk IO speed
****大多数log file sync的等待时间其实都是花费在logfile parallel write,类似与DBWR会等待db file parallel write
****其它的log file sync等待花费在调度延迟,IPC通信延迟等等
1. 前台进程对LGWR发出调用,然后到sleep状态下面看看Log file sync等待的整个流程:
此时log file sync等待开始记数
次调用在Unix平台是通过信号量来实现
2. LGWR被唤醒,得到CPU时间片来工作
LGWR发出IO请求
LGWR转去sleep,并且等待log file parallel write
3. 当在存储级别完成IO调用后OS唤醒LGWR进程
LGWR继续去获得CPU时间片
此时标记log file parallel write等待完成,Post相关信息给前台进程
4. 前台进程被LGWR唤醒,前台进程得到CPU时间片并且标记log file sync等待完成
通过snapper脚本来度量LGWR的速度:
---------------------------------------------------------------------------------
SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH
---------------------------------------------------------------------------------
1096, (LGWR) , STAT, messages sent , 12 , 12,
1096, (LGWR) , STAT, messages received , 10 , 10,
1096, (LGWR) , STAT, background timeouts , 1 , 1,
1096, (LGWR) , STAT, physical write total IO requests , 40, 40,
1096, (LGWR) , STAT, physical write total multi block request, 38, 38,
1096, (LGWR) , STAT, physical write total bytes, 2884608 , 2.88M,
1096, (LGWR) , STAT, calls to kcmgcs , 20 , 20,
1096, (LGWR) , STAT, redo wastage , 4548 , 4.55k,
1096, (LGWR) , STAT, redo writes , 10 , 10,
1096, (LGWR) , STAT, redo blocks written , 2817 , 2.82k,
1096, (LGWR) , STAT, redo write time , 25 , 25,
1096, (LGWR) , WAIT, LGWR wait on LNS , 1040575 , 1.04s, 104.1%, |@@@@@@@@@@|
1096, (LGWR) , WAIT, log file parallel write , 273837 , 273.84ms, 27.4%,|@@@ |
1096, (LGWR) , WAIT, events in waitclass Other , 1035172 , 1.04s , 103.5%,|@@@@@@@@@@|
LGWR和Asynch IO
oracle@linux01:~$ strace -cp `pgrep -f lgwr`
Process 12457 attached - interrupt to quit
^CProcess 12457 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- --------------
100.00 0.010000 263 38 3 semtimedop
0.00 0.000000 0 213 times
0.00 0.000000 0 8 getrusage
0.00 0.000000 0 701 gettimeofday
0.00 0.000000 0 41 io_getevents
0.00 0.000000 0 41 io_submit
0.00 0.000000 0 2 semop
0.00 0.000000 0 37 semctl
------ ----------- ----------- --------- --------- --------------
100.00 0.010000 1081 3 total
***io_getevents是在AIO阶段log file parallel write等待事件度量Redo,commit相关的latch tuning
1.redo allocation latches-故名思议,在私有现成写redo到log buffer时保护分配空间的latch
2.redo copy latches-当从私有内存区域copy redo到log buffer时需要的latch直到相关redo流被copy到log buffer,,那么LGWR进程
直到已经copy完成可以写buffers到磁盘,此时LGWR将等待LGWR wait for redo copy事件,相关的可以被调整的参数:_log_simultaneous_copies
等待事件:
1.log file sync
2.log file parallel write
3.log file single write
可以获取相关的统计信息(v$sesstat,v$sysstat)