Support Article
Agent fails when database turns unavailable during failover
SA-14519
Summary
When PRPC is pointed at a database cluster where one node fails over to another, any agent that executes during the failover period (usually a matter of a few seconds) goes into an inconsistent state and must be shutdown manually. This is visible by checking the Agents page of the SMA and inspecting the "next run time" column which is blank (the agent will never run again unless manually stopped and restarted or by bouncing the application server).
Error Messages
2015-09-12 16:46:46,916 [j2ee14_ws,maxpri=10]] [ STANDARD] [B727BE5F93BB4BDA00D0FC1A42BE498D1] [ Your_App:02.01.01] ( internal.access.DatabaseImpl) ERROR - There was a problem with the database when getting a list:
com.pega.pegarules.pub.database.DatabaseException: Database-General Problem encountered when getting connection for database pegarules 12514 66000 Listener refused the connection with the following error:
ORA-12514, TNS:listener does not currently know of service requested in connect descriptor
DSRA0010E: SQL State = 66000, Error Code = 12,514
From: (B727BE5F93BB4BDA00D0FC1A42BE498D1)
Caused by SQL Problems.
Problem #1, SQLState 66000, Error code 12514: java.sql.SQLException: Listener refused the connection with the following error:
ORA-12514, TNS:listener does not currently know of service requested in connect descriptor
DSRA0010E: SQL State = 66000, Error Code = 12,514
Steps to Reproduce
- Set up an agent to run every 5 seconds.
- Take down database for 10 seconds.
- Notice that the next run time is blank for the agent and the agent never runs until restarted.
Root Cause
A defect or configuration issue in the operating environment.
PRPC expects the datasource connections to be usable by the time they are assigned by the application server connection manager. If they are not and the connection encounters an exception, the agents may enter an inconsistent state as per this scenario. From the application perspective, this means that the database failover is not working correctly. The duration of the outage (3 seconds or otherwise) is immaterial to the situation.
The WebSphere application server has robust mechanisms to test datasource connections before they are assigned to an application resource (and before they are returned to the pool). Robust database connection management, security, and performance is one of the “value adds” of an application server. Adding connection testing configuration would be appropriate for this failover scenario. Adding this level of configuration will take care of the failover issues observed, but the key is that the management code for ensuring the failover is successful be kept in the application server, the JDBC driver, and the database engine itself. Adding this kind of connection management and testing logic to the Pega application would be a product enhancement that would be redundant with programming in the other levels of the stack.
Resolution
Add connection testing and purging to the datasouce configuration.
Published October 1, 2015 - Updated October 8, 2020
Have a question? Get answers now.
Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.