Oracle诊断案例:Job任务停止执行


摘要:

本文通过一次Oracle Job任务异常案例诊断,分析其原因及解决过程,从内部揭示Oracle Job任务调度及内部计时机制。

问题及环境

接到研发人员报告,数据库定时任务未正常执行,导致某些操作失败。

开始介入处理该事故.

系统环境:

SunOS DB 5.8 Generic_108528-21 sun4u sparc SUNW,Ultra-4 Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
解决过程

首先介入检查数据库任务

` $ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.3.0 - Production on Wed Nov 17 20:23:53 2004
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production
SQL> select job,last_date,last_sec,next_date,next_sec,broken,failures from
dba_jobs;
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC B FAILURES
INTERVAL



31 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0
trunc(sysdate+1)+1/24
27 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0
TRUNC(SYSDATE) + 1
35 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0
trunc(sysdate+1)+1/24
29 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0
TRUNC(SYSDATE) + 1
30 01-NOV-04 06:00:01 01-DEC-04 06:00:00 N 0
trunc(add_months(sysdate,1),’MM’)+6/24
65 16-NOV-04 04:00:03 17-NOV-04 04:00:00 N 0
trunc(sysdate+1)+4/24
46 16-NOV-04 02:14:27 17-NOV-04 02:14:27 N 0
sysdate+1
66 16-NOV-04 03:00:02 17-NOV-04 18:14:49 N 0
trunc(sysdate+1)+3/24
8 rows selected. `
发现JOB任务是都没有正常执行,最早一个应该在17-NOV-04 01:00:00执行。但是没有执行。

建立测试JOB

` create or replace PROCEDURE pining

IS
BEGIN
NULL;
END;
/
variable jobno number;
variable instno number;
begin
select instance_number into :instno from v$instance;
dbms_job.submit(:jobno, ’pining;’, trunc(sysdate+1/288,’MI’),
’trunc(SYSDATE+1/288,’’MI’’)’, TRUE, :instno);
end;
/ `
发现同样的,不执行。

但是通过dbms_job.run(

  1<job>)执行没有任何问题。   
  2  
  3进行恢复尝试   
  4  
  5怀疑是CJQ0进程失效,首先设置JOB_QUEUE_PROCESSES为0,Oracle会杀掉CJQ0及相应job进程   
  6  
  7` SQL&gt; ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 0; `   
  8等2~3分钟,重新设置   
  9  
 10` SQL&gt; ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 5; `   
 11此时PMON会重起CJQ0进程   
 12  
 13` Thu Nov 18 11:59:50 2004   
 14  
 15ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;   
 16Thu Nov 18 12:01:30 2004   
 17ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;   
 18Thu Nov 18 12:01:30 2004   
 19Restarting dead background process CJQ0   
 20CJQ0 started with pid=8   
 21但是Job仍然不执行,而且在再次修改的时候,CJQ0直接死掉了。   
 22Thu Nov 18 13:52:05 2004   
 23ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;   
 24Thu Nov 18 14:09:30 2004   
 25ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;   
 26Thu Nov 18 14:10:27 2004   
 27ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;   
 28Thu Nov 18 14:10:42 2004   
 29ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;   
 30Thu Nov 18 14:31:07 2004   
 31ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;   
 32Thu Nov 18 14:40:14 2004   
 33ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;   
 34Thu Nov 18 14:40:28 2004   
 35ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;   
 36Thu Nov 18 14:40:33 2004   
 37ALTER SYSTEM SET job_queue_processes=1 SCOPE=MEMORY;   
 38Thu Nov 18 14:40:40 2004   
 39ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;   
 40Thu Nov 18 15:00:42 2004   
 41ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;   
 42  
 43Thu Nov 18 15:01:36 2004   
 44ALTER SYSTEM SET job_queue_processes=15 SCOPE=MEMORY; `   
 45尝试重起数据库,这个必须在晚上进行:   
 46  
 47` PMON started with pid=2   
 48DBW0 started with pid=3   
 49LGWR started with pid=4   
 50CKPT started with pid=5   
 51SMON started with pid=6   
 52RECO started with pid=7   
 53CJQ0 started with pid=8   
 54QMN0 started with pid=9   
 55.... `   
 56CJQ0正常启动,但是Job仍然不执行。   
 57  
 58没办法了...   
 59  
 60继续研究...居然发现Oralce有这样一个bug   
 61  
 62` 1. Clear description of the problem encountered:   
 63slgcsf() / slgcs() on Solaris will stop incrementing after   
 64497 days 2 hrs 28 mins (approx) machine uptime.   
 65  
 662. Pertinent configuration information   
 67No special configuration other than long machine uptime. .   
 683. Indication of the frequency and predictability of the problem   
 69100% but only after 497 days.   
 704. Sequence of events leading to the problem   
 71If the gethrtime() OS call returns a value &gt; 42949672950000000   
 72nanoseconds then slgcs() stays at 0xffffffff. This can   
 73cause some problems in parts of the code which rely on   
 74slgcs() to keep moving.   
 75eg: In kkjssrh() does "now = slgcs(&amp;se)" and compares that   
 76to a previous timestamp. After 497 days uptime slgcs()   
 77keeps returning 0xffffffff so "now - kkjlsrt" will   
 78always return 0. .   
 795. Technical impact on the customer. Include persistent after effects.   
 80In this case DBMS JOBS stopped running after 497 days uptime.   
 81Other symptoms could occur in various places in the code. `   
 82好么,原来是计时器溢出了,一检查我的主机:   
 83  
 84` bash-2.03$ uptime   
 8510:00pm up 500 day(s), 14:57, 1 user, load average: 1.31, 1.09, 1.08   
 86bash-2.03$ date   
 87Fri Nov 19 22:00:14 CST 2004 `   
 88刚好到事发时是497天多一点.ft.   
 89  
 90安排重起主机系统..   
 91  
 92这个问题够郁闷的,NND,谁曾想Oracle这都成...   
 93  
 94Oracle最后声称:   
 95  
 96fix made it into 9.2.0.6 patchset   
 97  
 98在Solaris上的9206尚未发布...晕.   
 99  
100好了,就当是个经历吧,如果有问题非常不可思议的话,那么大胆怀疑Oracle吧,是Bug,可能就是Bug。   
101  
102重起以后问题解决,状态如下:   
103  
104` $ sqlplus "/ as sysdba"   
105SQL*Plus: Release 9.2.0.3.0 - Production on Fri Nov 26 09:21:21 2004   
106Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.   
107Connected to:   
108Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production   
109With the Partitioning, OLAP and Oracle Data Mining options   
110JServer Release 9.2.0.3.0 - Production   
111SQL&gt; select job,last_date,last_sec,next_date,next_sec from user_jobs;   
112  
113JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC   
114---------- --------- ---------------- --------- ----------------   
11570 26-NOV-04 09:21:04 26-NOV-04 09:26:00   
116SQL&gt; /   
117JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC   
118---------- --------- ---------------- --------- ----------------   
11970 26-NOV-04 09:26:01 26-NOV-04 09:31:00   
120SQL&gt;   
121SQL&gt; select * from v$timer;   
122HSECS   
123----------   
1243388153   
125SQL&gt; select * from v$timer;   
126HSECS   
127----------   
1283388319   
129SQL&gt;   
130FAQ `</job>
Published At
Categories with 数据库类
comments powered by Disqus