Last week, one of our clients had a sudden slow down on all of their applications which is running on two node RAC environment
Below is the summary of the setup:
– Server and Storage: SunFire X4200 with LUNs on EMC CX300
– OS: RHEL 4.3 ES
– Oracle 10.2.0.3 (database and clusterware)
– Database Files, Flash Recovery Area, OCR, and Voting disk are located on OCFS2 filesystems
– Application: Forms and Reports (6i and also lower)
As per the DBA, the workload on the database was normal and there were no changes on the RAC nodes and on the applications. Hmm, I can’t really tell because I haven’t really looked into their workload so I don’t have past data to compare.
I first setup the OS Watcher on both nodes (with 60sec snapshot interval) to have an end-to-end view of the server performance while I’m doing the diagnosis on the database. It’s a cool tool and automatically gathers a lot of info when you start it (cpu, io, memory, network, process info), also this tool has a graphing facility which makes it faster to detect spikes and any trends.
Then, I looked into the database parameters and compared it with the RDA of the database I had before. This is just to check if there are any changes on the parameters that might cause the sudden slow down. Hmm…there was nothing new
Then, I queried on the V$SESSION
SQL> select count(*) from v$session; -- server1 COUNT(*) ---------- 385 SQL> select count(*) from v$session; -- server2 COUNT(*) ---------- 49 |
.
I was surprised! The session count on both servers show that there are more users connected on server1 (almost 89% of the total users). This could be because of
1) the clients having lower versions (< Sql*Plus 8.1 or OCI8, see Note 97926.1) that may not support TAF (FAILOVER_MODE) and Load Balancing (LOAD_BALANCE) or
2) they are using TNS entries explicitly connecting to server1
Since the RAC database has different applications connected to it, knowing which applications are on server1 and how many they are will be enough to determine which users on particular applications could be transferred to server2 to at least balance the number of users on both nodes, but doing this may still not solve the issue as I don’t have enough info that this is the root cause of the problem..
The query below shows that all the users don’t have FAILOVER capabilities which could be attributed by the two points I mentioned above
select distinct s.inst_id instance_id, s.failover_type failover_type, s.failover_method failover_method, s.failed_over failed_over from gv$session s; INSTANCE_ID FAILOVER_TYPE FAILOVER_METHOD FAILED_OVER ----------- -------------------- -------------------- -------------------- 1 NONE NONE NO 2 NONE NONE NO |
.
Below query shows the distinct applications/modules and their count which are all on server1. Also you’ll notice that most of them are inactive (some of these users should connect to server2 to have balance on both nodes)
SQL> select distinct inst_id, module, program, status, count(*) from gv$session group by inst_id, module, program, status order by 1,2,3,4; INST_ID MODULE PROGRAM STATUS COUNT(*) ---------- -------------------------------------------------- -------------------------------------------------- ---------- ---------- 1 F45RUN32.EXE F45RUN32.EXE INACTIVE 23 ... output snipped .... 1 xxxPgm xxxPgm INACTIVE 21 ... output snipped .... 1 c:\xxxwin95\BIN\F45RUN32.EXE L:\xxx\VS\FORMS\MAI c:\xxxwin95\BIN\F45RUN32.EXE L:\xxx\VS\FORMS\MAI INACTIVE 44 1 c:\xxxwin95\BIN\F45RUN32.EXE L:\MAIN.FMX c:\xxxwin95\BIN\F45RUN32.EXE L:\MAIN.FMX INACTIVE 31 1 c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxMENU.fmx c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxMENU.fmx INACTIVE 14 1 c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxlogin.fmx c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxlogin.fmx INACTIVE 96 ... output snipped .... 1 c:\xxxwin95\BIN\R25SRV32.exe c:\xxxwin95\BIN\R25SRV32.exe INACTIVE 54 ... output snipped .... 1 ifweb90.exe ifweb90.exe INACTIVE 11 |
.
Below info shows some data samples of DBA_HIST_ACTIVE_SESS_HISTORY which is graphed using Tanel Poder’s Perfsheet (could be found here and here) to clearly distinguish the distribution of the modules and number of users. Most of the users on server1 uses the “c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxlogin.fmx”
Having the graph above may lead us to conclusion that the significant number of users on server1 attributes to the slow down of the transactions. But as per the DBA, there were no changes made and they were running the same transactions as before on server1 which has acceptable response time. Also the OS Watcher shows that the CPU utilization (peak is 60% and most of the time at 40%) and run queue was low and the disk IO utilization and service time were average (always 10ms below), and there were enough memory on the server (3GB free out of 12GB.. the SGA is 2GB) and no swapping.
Hmm…Drilling down on the wait interface and on per session level that is doing the important business operations will give us a definite conclusion on what is really the bottleneck on the database.
The graph below is another sample from DBA_HIST_ACTIVE_SESS_HISTORY that shows server1 (in blue box) is suffering from “gc cr block lost” and “gc cr multi block request” from 7am to 4pm. The “Metalink Doc ID: 563566.1 gc lost blocks diagnostics” indicates that it is a cluster problem which could be a problem on the network interconnect
To confirm the bottleneck shown from the sample of DBA_HIST_ACTIVE_SESS_HISTORY (above), the output of ADDM and AWR report from a peak period was analyzed (from 2-3 & 3-4 pm). Below are the output of ADDM and AWR reports:
ADDM output:
DETAILED ADDM REPORT FOR TASK 'TASK_137854' WITH ID 137854 ---------------------------------------------------------- Analysis Period: 28-MAY-2009 from 13:58:59 to 14:59:12 Database ID/Instance: 3967623528/2 Database/Instance Names: xxx Host Name: xxx Database Version: 10.2.0.3.0 Snapshot Range: from 15642 to 15643 Database Time: 3710 seconds Average Database Load: 1 active sessions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FINDING 1: 52% impact (1917 seconds) ------------------------------------ Cluster multi-block requests were consuming significant database time. RECOMMENDATION 1: SQL Tuning, 37% benefit (1389 seconds) ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID "0h0fn2d19adtk". Look for an alternative plan that does not use object scans. RELEVANT OBJECT: SQL statement with SQL_ID 0h0fn2d19adtk ... output snipped ... RECOMMENDATION 2: SQL Tuning, 8.4% benefit (310 seconds) ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID "8jd43xr4rp00u". Look for an alternative plan that does not use object scans. RELEVANT OBJECT: SQL statement with SQL_ID 8jd43xr4rp00u ... output snipped ... RECOMMENDATION 3: SQL Tuning, 4.6% benefit (172 seconds) ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID "6wpgf3s0vzuks". Look for an alternative plan that does not use object scans. RELEVANT OBJECT: SQL statement with SQL_ID 6wpgf3s0vzuks ... output snipped ... SYMPTOMS THAT LED TO THE FINDING: SYMPTOM: Inter-instance messaging was consuming significant database time on this instance. (94% impact [3488 seconds]) SYMPTOM: Wait class "Cluster" was consuming significant database time. (94% impact [3488 seconds]) FINDING 2: 51% impact (1874 seconds) ------------------------------------ Global Cache Service Processes (LMSn) in other instances were not processing requests fast enough. RECOMMENDATION 1: DB Configuration, 51% benefit (1874 seconds) ACTION: Increase throughput of the Global Cache Service (LMSn) processes. Increase the number of Global Cache Service processes by increasing the value of the parameter "gcs_server_processes". Alternatively, if the host is CPU bound consider increasing the OS priority of the Global Cache Service processes. RATIONALE: The value of parameter "gcs_server_processes" was "2" during the analysis period. SYMPTOMS THAT LED TO THE FINDING: SYMPTOM: Inter-instance messaging was consuming significant database time on this instance. (94% impact [3488 seconds]) SYMPTOM: Wait class "Cluster" was consuming significant database time. (94% impact [3488 seconds]) FINDING 3: 36% impact (1330 seconds) ------------------------------------ Cluster communications that were retried due to lost blocks consumed significant database time. RECOMMENDATION 1: Host Configuration, 36% benefit (1330 seconds) ACTION: Check the configuration of the cluster interconnect. Check OS setup like adapter setting, firmware and driver release. Check that the OS's socket receive buffers are large enough to store an entire multiblock read. The value of parameter "db_file_multiblock_read_count" may be decreased as a workaround. RATIONALE: The instance was consuming 477 kilo bits per second of interconnect bandwidth. SYMPTOMS THAT LED TO THE FINDING: SYMPTOM: Inter-instance messaging was consuming significant database time on this instance. (94% impact [3488 seconds]) SYMPTOM: Wait class "Cluster" was consuming significant database time. (94% impact [3488 seconds]) ####################################################################### DETAILED ADDM REPORT FOR TASK 'TASK_137874' WITH ID 137874 ---------------------------------------------------------- Analysis Period: 28-MAY-2009 from 14:59:12 to 15:58:44 Database ID/Instance: 3967623528/2 Database/Instance Names: xxx Host Name: xxx Database Version: 10.2.0.3.0 Snapshot Range: from 15643 to 15644 Database Time: 1563 seconds Average Database Load: .4 active sessions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FINDING 1: 42% impact (654 seconds) ----------------------------------- Cluster communications that were retried due to lost blocks consumed significant database time. RECOMMENDATION 1: Host Configuration, 42% benefit (654 seconds) ACTION: Check the configuration of the cluster interconnect. Check OS setup like adapter setting, firmware and driver release. Check that the OS's socket receive buffers are large enough to store an entire multiblock read. The value of parameter "db_file_multiblock_read_count" may be decreased as a workaround. RATIONALE: The instance was consuming 134 kilo bits per second of interconnect bandwidth. SYMPTOMS THAT LED TO THE FINDING: SYMPTOM: Inter-instance messaging was consuming significant database time on this instance. (88% impact [1371 seconds]) SYMPTOM: Wait class "Cluster" was consuming significant database time. (88% impact [1372 seconds]) |
.
AWR output:
WORKLOAD REPOSITORY report for DB Name DB Id Instance Inst Num Release RAC Host ------------ ----------- ------------ -------- ----------- --- ------------ xxxxx 3967623528 xxxxxx 2 10.2.0.3.0 YES xxx Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 15642 28-May-09 13:58:58 57 27.0 End Snap: 15643 28-May-09 14:59:12 64 28.7 Elapsed: 60.23 (mins) DB Time: 61.83 (mins) Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- gc cr multi block request 19,644 1,902 97 51.3 Cluster gc cr block lost 1,011 1,117 1104 30.1 Cluster enq: WF - contention 626 268 429 7.2 Other gc current block lost 186 203 1089 5.5 Cluster cr request retry 678 162 240 4.4 Other Per Wait Class ~~~~~~~~~~~~~~~~~~ Avg %Time Total Wait wait Waits Wait Class Waits -outs Time (s) (ms) /txn -------------------- ---------------- ------ ---------------- ------- --------- Cluster 34,450 7.0 3,385 98 6.8 Other 91,292 64.7 467 5 18.1 User I/O 26,510 .0 102 4 5.3 Concurrency 29,668 .0 54 2 5.9 System I/O 13,360 .0 11 1 2.6 Commit 1,313 .0 4 3 0.3 Application 2,374 .0 2 1 0.5 Network 129,774 .0 0 0 25.7 Configuration 20 .0 0 0 0.0 ####################################################################### WORKLOAD REPOSITORY report for DB Name DB Id Instance Inst Num Release RAC Host ------------ ----------- ------------ -------- ----------- --- ------------ xxxxx 3967623528 xxxxxx 2 10.2.0.3.0 YES xxx Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 15643 28-May-09 14:59:12 64 28.7 End Snap: 15644 28-May-09 15:58:43 64 29.6 Elapsed: 59.53 (mins) DB Time: 26.05 (mins) Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- gc cr block lost 389 429 1103 27.5 Cluster gc cr multi block request 2,360 422 179 27.0 Cluster enq: WF - contention 510 212 416 13.6 Other gc current block lost 188 204 1084 13.0 Cluster cr request retry 326 181 554 11.6 Other Per Wait Class ~~~~~~~~~~~~~~~~~~ Avg %Time Total Wait wait Waits Wait Class Waits -outs Time (s) (ms) /txn -------------------- ---------------- ------ ---------------- ------- --------- Cluster 10,239 8.9 1,247 122 2.1 Other 153,082 71.9 468 3 31.0 Application 1,784 18.0 161 90 0.4 Concurrency 33,014 .0 53 2 6.7 User I/O 4,246 .0 27 6 0.9 System I/O 13,648 .0 14 1 2.8 Commit 1,546 .0 4 3 0.3 Network 138,333 .0 1 0 28.0 Configuration 16 .0 0 4 0.0 |
.
The only thing that is left to do is to drill down on the complaining users and check on the network performance and interconnect… unfortunately the users were going home and I can’t see any significant database activity. So I called it a night, and just analyzed my activity log and read on Metalink Doc ID 563566.1. Also I advised the DBA to divide the users across the nodes.
Next day afternoon, and it’s time to do the drill down. You can see on the graph below (a sample from GV$SESSION) that the users were already distributed, but still there is a performance problem. I told you this will not solve the issue ! :) it is just a plus that we manage to distribute the users :)
A query on GV$SESSION_LONGOPS will show users with large SQLs. Drilling down on the session statistics of these user’s will give detail on what are they waiting and how much time they spent on what resource. However, it would be a repetitive step to get all the sessions on the GV$SESSION_LONGOPS.
select * from ( select a.inst_id inst, a.START_TIME, b.osuser, b.module, a.sid, a.serial#, a.opname, b.event, a.target, a.sofar, a.totalwork, round(a.sofar/a.totalwork, 4)*100 pct, a.units, a.elapsed_seconds, a.time_remaining time_remaining_sec, round(a.time_remaining/60,2) min ,a.sql_hash_value -- ,message from gv$session_longops a, gv$session b where a.sid = b.sid and b.serial# = b.serial# and sofar < totalwork order by totalwork desc) INST START_TIME OSUSER MODULE SID SERIAL# OPNAME EVENT TARGET SOFAR TOTALWORK PCT UNITS ELAPSED_SECONDS TIME_REMAINING_SEC MIN SQL_HASH_VALUE ---- ----------------- ---------- --------------- ------ ------- ---------- --------------- ---------- ---------- ---------- ---------- ---------- --------------- ------------------ ---------- -------------- 2 20090529 18:36:38 oraprod oracle@xxxxx280 454 82 Table Scan gc cr multi blo xxxxxxxx.B 2205 25403 8.68 Blocks 343 3609 60.15 1857468103 R (TNS V1-V3) ck request ENMAS 2 20090529 18:30:39 oraprod oracle@xxxxx280 607 546 Table Scan gc cr multi blo xxxxxxxx.C 11062 15203 72.76 Blocks 708 265 4.42 1477847824 R (TNS V1-V3) ck request OVMAS 1 20090529 15:51:58 PC829 c:\xxxwin95\BIN 474 721 Table Scan gc cr request xxxxxxxx.C 11108 15203 73.06 Blocks 10317 3803 63.38 1636678807 \F45RUN32.EXE w OVMAS : \xxxlogin.fmx 1 20090529 17:27:49 PC1280 c:\xxxwin95\BIN 473 171 Table Scan gc cr request xxxxxx.DTT 336 12918 2.6 Blocks 4449 166599 2776.65 1117075250 \R25SRV32.exe _DVDTL 1 20090529 17:27:49 oracle sqlplus@server2 473 171 Table Scan PX Deq: Executi xxxxxx.DTT 336 12918 2.6 Blocks 4449 166599 2776.65 1117075250 .xxxx.xxx (TNS on Msg _DVDTL V1-V3) |
.
I already have an idea that my problem is on the cluster configuration. So I asked the DBA that we go up to the Accounting Department (where there are a lot of users complaining) and check on the desktop terminals of the users. We saw two users which has the application module on total hang. These users are PC1069 (with SID 601) and PC918 (with SID 483). So a query on the GV$SESSION_LONGOPS does not show these users but a query on GV$SESSION_WAIT shows that they are waiting on “gc cr request” event
select s.inst_id, w.sid, s.sql_hash_value, s.osuser, s.program, w.event, w.seconds_in_wait seconds_in_wait, w.state, w.p1, w.p2, w.p3 from gv$session s, gv$session_wait w where s.sid = w.sid and s.type = 'USER' and w.state = 'WAITING' and s.osuser in ('PC1069','PC918') order by 6 asc; INST_ID SID SQL_HASH_VALUE OSUSER PROGRAM EVENT SECONDS_IN_WAIT STATE P1 P2 P3 ---------- ------ -------------- ---------- ----------------------- ------------------ --------------- ----------------- ---------- ---------- ---------- 1 483 449494677 PC918 c:\xxxwin95\BIN\F45RUN3 SQL*Net message fr 3853 WAITING 675562835 1 0 2.EXE w:\xxxlogin.fmx om client 1 601 1874068660 PC1069 c:\xxxwin95\BIN\F45RUN3 gc cr request 6 WAITING 12 1173863 1 2.EXE w:\xxxlogin.fmx 1 483 449494677 PC918 c:\xxxwin95\BIN\F45RUN3 gc cr request 0 WAITING 7 1649989 1 2.EXE w:\xxxlogin.fmx |
.
Focusing on these two sessions (SID 601 and 483) and getting their session statistics will give us an idea if they are doing something, on a total hang, or waiting for something. Below query on V$SESSION_WAIT shows that the SEQ# is being incremented (a number that uniquely identifies this wait) which means the session (SID 601) is still waiting for something, the event “gc cr request”
BTW, I’m using SW script of Tanel Poder (downloadable here), and you’ll also see some usage of Snapper and USQL
SQL> @sw 601 SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 P1TRANSL ------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------ 601 WAITING gc cr request 27223 3 7 1647259 1 ... output snipped ... SQL> @sw 601 SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 P1TRANSL ------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------ 601 WAITING gc cr request 27225 0 7 1647259 1 ... output snipped ... SQL> @sw 601 SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 P1TRANSL ------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------ 601 WAITING gc cr request 27227 0 7 1647259 1 ... output snipped ... SQL> @sw 601 SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 P1TRANSL ------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------ 601 WAITING gc cr request 27229 7 7 1647259 1 |
.
A query on V$SESSTAT will show the performance/wait counters being incremented (SID 601), the output below was a snap of the session for 5 seconds period done 5 times. You’ll see that “gc current” waits are significant
SQL> @snapper out 5 5 601 -- Session Snapper v1.10 by Tanel Poder ( http://www.tanelpoder.com ) --------------------------------------------------------------------------------------------------------------------------------------------- HEAD, SID, SNAPSHOT START , SECONDS, TYPE, STATISTIC , DELTA, DELTA/SEC, HDELTA, HDELTA/SEC --------------------------------------------------------------------------------------------------------------------------------------------- DATA, 601, 20090529 18:48:36, 6, STAT, opened cursors current , -1, -0, -1, -.17 DATA, 601, 20090529 18:48:36, 6, STAT, recursive calls , 5, 1, 5, .83 DATA, 601, 20090529 18:48:36, 6, STAT, session logical reads , 24, 4, 24, 4 DATA, 601, 20090529 18:48:36, 6, STAT, cluster wait time , 492, 82, 492, 82 DATA, 601, 20090529 18:48:36, 6, STAT, user I/O wait time , 1, 0, 1, .17 DATA, 601, 20090529 18:48:36, 6, STAT, session pga memory , -65536, -10923, -65.54k, -10.92k DATA, 601, 20090529 18:48:36, 6, STAT, physical read total IO requests , 1, 0, 1, .17 DATA, 601, 20090529 18:48:36, 6, STAT, physical read total bytes , 8192, 1365, 8.19k, 1.37k DATA, 601, 20090529 18:48:36, 6, STAT, gcs messages sent , 3, 1, 3, .5 DATA, 601, 20090529 18:48:36, 6, STAT, db block gets , 6, 1, 6, 1 DATA, 601, 20090529 18:48:36, 6, STAT, db block gets from cache , 6, 1, 6, 1 DATA, 601, 20090529 18:48:36, 6, STAT, consistent gets , 18, 3, 18, 3 DATA, 601, 20090529 18:48:36, 6, STAT, consistent gets from cache , 18, 3, 18, 3 DATA, 601, 20090529 18:48:36, 6, STAT, consistent gets - examination , 8, 1, 8, 1.33 DATA, 601, 20090529 18:48:36, 6, STAT, physical reads , 1, 0, 1, .17 DATA, 601, 20090529 18:48:36, 6, STAT, physical reads cache , 1, 0, 1, .17 DATA, 601, 20090529 18:48:36, 6, STAT, physical read IO requests , 1, 0, 1, .17 DATA, 601, 20090529 18:48:36, 6, STAT, physical read bytes , 8192, 1365, 8.19k, 1.37k DATA, 601, 20090529 18:48:36, 6, STAT, db block changes , 3, 1, 3, .5 DATA, 601, 20090529 18:48:36, 6, STAT, free buffer requested , 3, 1, 3, .5 DATA, 601, 20090529 18:48:36, 6, STAT, calls to get snapshot scn: kcmgss , 5, 1, 5, .83 DATA, 601, 20090529 18:48:36, 6, STAT, redo entries , 2, 0, 2, .33 DATA, 601, 20090529 18:48:36, 6, STAT, redo size , 536, 89, 536, 89.33 DATA, 601, 20090529 18:48:36, 6, STAT, gc current blocks received , 1, 0, 1, .17 DATA, 601, 20090529 18:48:36, 6, STAT, gc remote grants , 1, 0, 1, .17 DATA, 601, 20090529 18:48:36, 6, STAT, undo change vector size , 84, 14, 84, 14 DATA, 601, 20090529 18:48:36, 6, STAT, no work - consistent read gets , 9, 2, 9, 1.5 DATA, 601, 20090529 18:48:36, 6, STAT, table fetch by rowid , 4, 1, 4, .67 DATA, 601, 20090529 18:48:36, 6, STAT, index scans kdiixs1 , 4, 1, 4, .67 DATA, 601, 20090529 18:48:36, 6, STAT, buffer is not pinned count , 8, 1, 8, 1.33 DATA, 601, 20090529 18:48:36, 6, STAT, execute count , 4, 1, 4, .67 DATA, 601, 20090529 18:48:36, 6, TIME, PL/SQL execution elapsed time , 139, 23, 139us, 23.17us DATA, 601, 20090529 18:48:36, 6, TIME, DB CPU , 10000, 1667, 10ms, 1.67ms DATA, 601, 20090529 18:48:36, 6, TIME, sql execute elapsed time , 27552125, 4592021, 27.55s, 4.59s DATA, 601, 20090529 18:48:36, 6, TIME, DB time , 27552133, 4592022, 27.55s, 4.59s DATA, 601, 20090529 18:48:36, 6, WAIT, db file sequential read , 8374, 1396, 8.37ms, 1.4ms DATA, 601, 20090529 18:48:36, 6, WAIT, gc current request , 9664835, 1610806, 9.66s, 1.61s -- End of snap 1 DATA, 601, 20090529 18:48:42, 5, STAT, opened cursors cumulative , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, opened cursors current , -1, -0, -1, -.2 DATA, 601, 20090529 18:48:42, 5, STAT, recursive calls , 4, 1, 4, .8 DATA, 601, 20090529 18:48:42, 5, STAT, recursive cpu usage , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, session logical reads , 42, 8, 42, 8.4 DATA, 601, 20090529 18:48:42, 5, STAT, cluster wait time , 419, 84, 419, 83.8 DATA, 601, 20090529 18:48:42, 5, STAT, user I/O wait time , 2, 0, 2, .4 DATA, 601, 20090529 18:48:42, 5, STAT, enqueue requests , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, global enqueue gets sync , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, physical read total IO requests , 2, 0, 2, .4 DATA, 601, 20090529 18:48:42, 5, STAT, physical read total bytes , 16384, 3277, 16.38k, 3.28k DATA, 601, 20090529 18:48:42, 5, STAT, gcs messages sent , 2, 0, 2, .4 DATA, 601, 20090529 18:48:42, 5, STAT, db block gets , 41, 8, 41, 8.2 DATA, 601, 20090529 18:48:42, 5, STAT, db block gets from cache , 41, 8, 41, 8.2 DATA, 601, 20090529 18:48:42, 5, STAT, consistent gets , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, consistent gets from cache , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, physical reads , 2, 0, 2, .4 DATA, 601, 20090529 18:48:42, 5, STAT, physical reads cache , 2, 0, 2, .4 DATA, 601, 20090529 18:48:42, 5, STAT, physical read IO requests , 2, 0, 2, .4 DATA, 601, 20090529 18:48:42, 5, STAT, physical read bytes , 16384, 3277, 16.38k, 3.28k DATA, 601, 20090529 18:48:42, 5, STAT, db block changes , 26, 5, 26, 5.2 DATA, 601, 20090529 18:48:42, 5, STAT, exchange deadlocks , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, free buffer requested , 3, 1, 3, .6 DATA, 601, 20090529 18:48:42, 5, STAT, calls to get snapshot scn: kcmgss , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, redo entries , 14, 3, 14, 2.8 DATA, 601, 20090529 18:48:42, 5, STAT, redo size , 3504, 701, 3.5k, 700.8 DATA, 601, 20090529 18:48:42, 5, STAT, redo subscn max counts , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, gc local grants , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, gc remote grants , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, undo change vector size , 1140, 228, 1.14k, 228 DATA, 601, 20090529 18:48:42, 5, STAT, deferred (CURRENT) block cleanout applic, 4, 1, 4, .8 DATA, 601, 20090529 18:48:42, 5, STAT, parse count (total) , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, STAT, execute count , 1, 0, 1, .2 DATA, 601, 20090529 18:48:42, 5, TIME, parse time elapsed , 41, 8, 41us, 8.2us DATA, 601, 20090529 18:48:42, 5, TIME, PL/SQL execution elapsed time , 125, 25, 125us, 25us DATA, 601, 20090529 18:48:42, 5, TIME, sql execute elapsed time , 5433501, 1086700, 5.43s, 1.09s DATA, 601, 20090529 18:48:42, 5, TIME, DB time , 5433501, 1086700, 5.43s, 1.09s DATA, 601, 20090529 18:48:42, 5, WAIT, db file sequential read , 17812, 3562, 17.81ms, 3.56ms DATA, 601, 20090529 18:48:42, 5, WAIT, gc current request , -4221489, -844298, -4.22s, -844.3ms DATA, 601, 20090529 18:48:42, 5, WAIT, gc current block lost , 6634963, 1326993, 6.63s, 1.33s DATA, 601, 20090529 18:48:42, 5, WAIT, gc current grant 2-way , 254, 51, 254us, 50.8us DATA, 601, 20090529 18:48:42, 5, WAIT, events in waitclass Other , 600, 120, 600us, 120us -- End of snap 2 DATA, 601, 20090529 18:48:47, 5, STAT, session logical reads , 3, 1, 3, .6 DATA, 601, 20090529 18:48:47, 5, STAT, cluster wait time , 503, 101, 503, 100.6 DATA, 601, 20090529 18:48:47, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 601, 20090529 18:48:47, 5, STAT, db block gets , 3, 1, 3, .6 DATA, 601, 20090529 18:48:47, 5, STAT, db block gets from cache , 3, 1, 3, .6 DATA, 601, 20090529 18:48:47, 5, STAT, exchange deadlocks , 1, 0, 1, .2 DATA, 601, 20090529 18:48:47, 5, STAT, free buffer requested , 1, 0, 1, .2 DATA, 601, 20090529 18:48:47, 5, WAIT, gc current request , 1221742, 244348, 1.22s, 244.35ms DATA, 601, 20090529 18:48:47, 5, WAIT, gc current block lost , 3813435, 762687, 3.81s, 762.69ms DATA, 601, 20090529 18:48:47, 5, WAIT, events in waitclass Other , 609, 122, 609us, 121.8us -- End of snap 3 DATA, 601, 20090529 18:48:52, 5, STAT, session logical reads , 3, 1, 3, .6 DATA, 601, 20090529 18:48:52, 5, STAT, cluster wait time , 524, 105, 524, 104.8 DATA, 601, 20090529 18:48:52, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 601, 20090529 18:48:52, 5, STAT, db block gets , 3, 1, 3, .6 DATA, 601, 20090529 18:48:52, 5, STAT, db block gets from cache , 3, 1, 3, .6 DATA, 601, 20090529 18:48:52, 5, STAT, exchange deadlocks , 1, 0, 1, .2 DATA, 601, 20090529 18:48:52, 5, STAT, free buffer requested , 1, 0, 1, .2 DATA, 601, 20090529 18:48:52, 5, TIME, sql execute elapsed time , 6613095, 1322619, 6.61s, 1.32s DATA, 601, 20090529 18:48:52, 5, TIME, DB time , 6613095, 1322619, 6.61s, 1.32s DATA, 601, 20090529 18:48:52, 5, WAIT, gc current request , 3000843, 600169, 3s, 600.17ms DATA, 601, 20090529 18:48:52, 5, WAIT, gc current block lost , 4018577, 803715, 4.02s, 803.72ms DATA, 601, 20090529 18:48:52, 5, WAIT, events in waitclass Other , 570, 114, 570us, 114us -- End of snap 4 DATA, 601, 20090529 18:48:57, 5, STAT, session logical reads , 3, 1, 3, .6 DATA, 601, 20090529 18:48:57, 5, STAT, cluster wait time , 523, 105, 523, 104.6 DATA, 601, 20090529 18:48:57, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 601, 20090529 18:48:57, 5, STAT, db block gets , 3, 1, 3, .6 DATA, 601, 20090529 18:48:57, 5, STAT, db block gets from cache , 3, 1, 3, .6 DATA, 601, 20090529 18:48:57, 5, STAT, exchange deadlocks , 1, 0, 1, .2 DATA, 601, 20090529 18:48:57, 5, STAT, free buffer requested , 1, 0, 1, .2 DATA, 601, 20090529 18:48:57, 5, TIME, sql execute elapsed time , 5230170, 1046034, 5.23s, 1.05s DATA, 601, 20090529 18:48:57, 5, TIME, DB time , 5230170, 1046034, 5.23s, 1.05s DATA, 601, 20090529 18:48:57, 5, WAIT, gc current request , -473, -95, -473us, -94.6us DATA, 601, 20090529 18:48:57, 5, WAIT, gc current block lost , 5230137, 1046027, 5.23s, 1.05s DATA, 601, 20090529 18:48:57, 5, WAIT, events in waitclass Other , 390, 78, 390us, 78us -- End of snap 5 PL/SQL procedure successfully completed. |
.
Below is the SQL being executed for SID 601 which is a batch INSERT operation that is just a normal operation as per the DBA
SQL> @usql 601 HASH_VALUE CH# SQL_TEXT ---------- ---- ------------------------------------------------------------------------------------------------------------------------------------------------------ 6143262 0 INSERT INTO xxx( BATxxx,POLxxx,SRCxxx,PAYxxx,DATxxx,DATxxx,AMxxxx,ACCxxx,NEWxxx,POLxxx,FUNxxx,Uxxxxx,Txxxx,Cxxxx,PRxxxx,Exxxx,xxxxID,POxxxx,Nxxxx,xxxAMT,xxxPRM,xxxMAV ,COLxxxx,NOPxxx ) VALUES ( :b1,:b2,:b3,:b4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,:b17,:b18,:b19,:b20,:b21,:b22,:b23,:b24 ) CH# PARENT_HANDLE OBJECT_HANDLE PARSES H_PARSES EXECUTIONS FETCHES ROWS_PROCESSED LIOS PIOS SORTS CPU_MS ELA_MS USERS_EXECUTING ---- ---------------- ---------------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- ---------- ---------- ---------- --------------- 0 00000000D0A5FC68 00000000D0A5FA40 10 1 2714 0 2714 300753 17731 0 12136.359 5899798.3 1 SQL hash value: 6143262 Cursor address: 00000000D0A5FC68 | Statement first parsed at: 2009-05-29/13:01:42 | 20903 seconds ago |
.
Below is the SQL being executed for SID 483 which is a SELECT operation that is just a normal operation as per the DBA
SQL> @usql 483 HASH_VALUE CH# SQL_TEXT ---------- ---- ------------------------------------------------------------------------------------------------------------------------------------------------------ 449494677 0 SELECT x.xxxPxD,F.xxxNMx,B.xxxPxD,x.xxxNMx,D.PxLMxD,D.PxLFRQ,x.xxVxMT,x.xxVSTx,x.xxVxFF,x.xxVDxx,SxM(G.xxVMxN) MxDPRxM FRxM RxPMxS x,BxNMxS B,xxxMxS x,PxLMxS D,xxVMxS x,xxxMxS F,xxVMxS G WHxRx x.PxLNxM = NVL(:b1,'X') xND x.RxLTYP = 'PxY' xND (x.RxLPRx = 100 xR x.RxLPRx xS NxLL ) xND x.xxxPxD = F.xxxPxD xND B.PxLNxM (+) = NVL(:b2,'X') xND B.RxLTYP (+) = 'xNS' xND B.xxVNxM = 1 xND (B.RxLPxR = 100 xR B.RxLPxR xS NxLL ) xND B.xxxPxD = x.xxxPxD (+) xND D.PxLNxM = NVL(:b3,'X') xND x.PxLNxM = NVL(:b4,'X') xND x.xxVNxM = 1 xND G.PxLNxM = NVL(:b5,'X') GRxxP BY x.xxxPxD,F.xxxNMx,B.xxxPxD,x.xxxNMx,D.PxLMxD,D.PxLFRQ,x.xxVxMT,x.xxVSTx,x.xxVxFF,x.xxVDxx CH# PARENT_HANDLE OBJECT_HANDLE PARSES H_PARSES EXECUTIONS FETCHES ROWS_PROCESSED LIOS PIOS SORTS CPU_MS ELA_MS USERS_EXECUTING ---- ---------------- ---------------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- ---------- ---------- ---------- --------------- 0 00000000C2E54DD0 00000000C2E5E700 8 1 1047 2089 1043 58793 3905 2087 2676.616 11108807.7 1 |
.
A query on V$SESSTAT will show the performance/wait counters being incremented (SID 483), the output below was a snap of the session for 5 seconds period done 10 times. You’ll see that “gc cr” waits are significant
SQL> @snapper out 5 10 483 -- Session Snapper v1.10 by Tanel Poder ( http://www.tanelpoder.com ) --------------------------------------------------------------------------------------------------------------------------------------------- HEAD, SID, SNAPSHOT START , SECONDS, TYPE, STATISTIC , DELTA, DELTA/SEC, HDELTA, HDELTA/SEC --------------------------------------------------------------------------------------------------------------------------------------------- DATA, 483, 20090529 19:00:17, 5, STAT, cluster wait time , 527, 105, 527, 105.4 DATA, 483, 20090529 19:00:17, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 483, 20090529 19:00:17, 5, STAT, free buffer requested , 1, 0, 1, .2 DATA, 483, 20090529 19:00:17, 5, WAIT, gc cr request , -1220551, -244110, -1.22s, -244.11ms DATA, 483, 20090529 19:00:17, 5, WAIT, gc cr block lost , 6485830, 1297166, 6.49s, 1.3s DATA, 483, 20090529 19:00:17, 5, WAIT, events in waitclass Other , 530, 106, 530us, 106us -- End of snap 1 DATA, 483, 20090529 19:00:22, 5, STAT, session logical reads , 4, 1, 4, .8 DATA, 483, 20090529 19:00:22, 5, STAT, cluster wait time , 454, 91, 454, 90.8 DATA, 483, 20090529 19:00:22, 5, STAT, messages sent , 1, 0, 1, .2 DATA, 483, 20090529 19:00:22, 5, STAT, gcs messages sent , 2, 0, 2, .4 DATA, 483, 20090529 19:00:22, 5, STAT, consistent gets , 4, 1, 4, .8 DATA, 483, 20090529 19:00:22, 5, STAT, consistent gets from cache , 4, 1, 4, .8 DATA, 483, 20090529 19:00:22, 5, STAT, consistent gets - examination , 3, 1, 3, .6 DATA, 483, 20090529 19:00:22, 5, STAT, free buffer requested , 2, 0, 2, .4 DATA, 483, 20090529 19:00:22, 5, STAT, shared hash latch upgrades - no wait , 1, 0, 1, .2 DATA, 483, 20090529 19:00:22, 5, STAT, gc current blocks received , 1, 0, 1, .2 DATA, 483, 20090529 19:00:22, 5, STAT, no work - consistent read gets , 1, 0, 1, .2 DATA, 483, 20090529 19:00:22, 5, STAT, table fetch by rowid , 1, 0, 1, .2 DATA, 483, 20090529 19:00:22, 5, STAT, index fetch by key , 1, 0, 1, .2 DATA, 483, 20090529 19:00:22, 5, STAT, buffer is not pinned count , 2, 0, 2, .4 DATA, 483, 20090529 19:00:22, 5, WAIT, gc cr request , -8443706, -1688741, -8.44s, -1.69s DATA, 483, 20090529 19:00:22, 5, WAIT, gc cr block lost , 6986370, 1397274, 6.99s, 1.4s DATA, 483, 20090529 19:00:22, 5, WAIT, gc current block 2-way , 596, 119, 596us, 119.2us DATA, 483, 20090529 19:00:22, 5, WAIT, events in waitclass Other , 266, 53, 266us, 53.2us -- End of snap 2 DATA, 483, 20090529 19:00:27, 5, STAT, cluster wait time , 506, 101, 506, 101.2 DATA, 483, 20090529 19:00:27, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 483, 20090529 19:00:27, 5, STAT, free buffer requested , 1, 0, 1, .2 -- End of snap 3 DATA, 483, 20090529 19:00:32, 5, STAT, cluster wait time , 488, 98, 488, 97.6 DATA, 483, 20090529 19:00:32, 5, WAIT, gc cr request , 9888169, 1977634, 9.89s, 1.98s -- End of snap 4 DATA, 483, 20090529 19:00:37, 5, STAT, cluster wait time , 548, 110, 548, 109.6 DATA, 483, 20090529 19:00:37, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 483, 20090529 19:00:37, 5, STAT, free buffer requested , 1, 0, 1, .2 DATA, 483, 20090529 19:00:37, 5, WAIT, gc cr request , -4444128, -888826, -4.44s, -888.83ms DATA, 483, 20090529 19:00:37, 5, WAIT, gc cr block lost , 7920880, 1584176, 7.92s, 1.58s DATA, 483, 20090529 19:00:37, 5, WAIT, events in waitclass Other , 341, 68, 341us, 68.2us -- End of snap 5 DATA, 483, 20090529 19:00:42, 5, STAT, cluster wait time , 494, 99, 494, 98.8 DATA, 483, 20090529 19:00:42, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 483, 20090529 19:00:42, 5, STAT, free buffer requested , 1, 0, 1, .2 DATA, 483, 20090529 19:00:42, 5, WAIT, gc cr request , 883591, 176718, 883.59ms, 176.72ms DATA, 483, 20090529 19:00:42, 5, WAIT, gc cr block lost , 4055130, 811026, 4.06s, 811.03ms DATA, 483, 20090529 19:00:42, 5, WAIT, events in waitclass Other , 337, 67, 337us, 67.4us -- End of snap 6 DATA, 483, 20090529 19:00:47, 6, STAT, cluster wait time , 448, 75, 448, 74.67 DATA, 483, 20090529 19:00:47, 6, STAT, gcs messages sent , 1, 0, 1, .17 DATA, 483, 20090529 19:00:47, 6, STAT, free buffer requested , 1, 0, 1, .17 DATA, 483, 20090529 19:00:47, 6, WAIT, gc cr request , -2105578, -350930, -2.11s, -350.93ms DATA, 483, 20090529 19:00:47, 6, WAIT, gc cr block lost , 6590281, 1098380, 6.59s, 1.1s DATA, 483, 20090529 19:00:47, 6, WAIT, events in waitclass Other , 496, 83, 496us, 82.67us -- End of snap 7 DATA, 483, 20090529 19:00:53, 5, STAT, cluster wait time , 547, 109, 547, 109.4 DATA, 483, 20090529 19:00:53, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 483, 20090529 19:00:53, 5, STAT, free buffer requested , 1, 0, 1, .2 -- End of snap 8 DATA, 483, 20090529 19:00:58, 5, STAT, cluster wait time , 528, 106, 528, 105.6 DATA, 483, 20090529 19:00:58, 5, STAT, gcs messages sent , 1, 0, 1, .2 DATA, 483, 20090529 19:00:58, 5, STAT, free buffer requested , 1, 0, 1, .2 DATA, 483, 20090529 19:00:58, 5, WAIT, gc cr block lost , 5286330, 1057266, 5.29s, 1.06s DATA, 483, 20090529 19:00:58, 5, WAIT, events in waitclass Other , 636, 127, 636us, 127.2us -- End of snap 9 DATA, 483, 20090529 19:01:03, 5, STAT, cluster wait time , 489, 98, 489, 97.8 DATA, 483, 20090529 19:01:03, 5, WAIT, gc cr request , 10886698, 2177340, 10.89s, 2.18s -- End of snap 10 PL/SQL procedure successfully completed. |
.
Another graph of DBA_HIST_ACTIVE_SESS_HISTORY will show that the server is suffering from “gc cr block lost” and “gc cr multi block request”. Again, ADDM and AWR reports also support this graph
Hmm… I already have sufficient info that on the instance and session level, I’m getting waits that point me to diagnose and look into the network interconnect. Analyzing the “netstat -s” output from OS Watcher shows that there are many errors on the TCP and UDP packets
$ cat server1_netstat.dat | egrep -i "udpInOverflows|packet receive errors|fragments dropped|reassembles failed|fragments dropped after timeout" 34096 fragments dropped after timeout 306030 packet reassembles failed 15 packet receive errors 34096 fragments dropped after timeout 306268 packet reassembles failed 15 packet receive errors 34096 fragments dropped after timeout 306574 packet reassembles failed 15 packet receive errors 34096 fragments dropped after timeout 306930 packet reassembles failed 15 packet receive errors 34096 fragments dropped after timeout … output snipped … 15 packet receive errors 34096 fragments dropped after timeout 328087 packet reassembles failed 15 packet receive errors 34096 fragments dropped after timeout 328493 packet reassembles failed 15 packet receive errors 34096 fragments dropped after timeout 328889 packet reassembles failed 15 packet receive errors |
.
We quickly advised the network team to check on the interconnect switch for any issues and if possible replace it with a new unit to see if it solves the performance problem.
Then the DBA scheduled a downtime, and at first we tried restarting just the switch if it solves the problem. Then we make them run the applications again, but still there is a performance problem.
Then we shutdown again and the network team replaced the interconnect switch with a new unit. Then startup, and run the applications again…..
tentenenenen tenen!!!
…… they got fast :D !!!
Another check on the “netstat -s” shows no errors!
karao@karl:~/Desktop$ cat karlarao.dat | egrep -i "udpInOverflows|packet receive errors|fragments dropped|reassembles failed|fragments dropped after timeout" 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors 0 packet receive errors |
.
Another graph of DBA_HIST_ACTIVE_SESS_HISTORY shows that the “gc block lost” waits are gone :D
Update:
————–
I was also able to share and present this performance scenario on one of our DBA round tables locally (at the Oracle office)..
– Performance Scenario – Sudden slow down on RAC (pdf) – Oracle Philippines DBA round table, 07/21/2009
