Troubleshooting

Cluster Reliability in Unreliable Network Environment

Data Shaper Server instances must cooperate with each other to form a Cluster together. If the connection between nodes doesn’t work at all, or if it is not configured, the Cluster can’t work properly. This chapter describes Cluster nodes behavior in an environment where the connection between nodes is somehow unreliable.

Nodes use three channels to exchange status info or data

synchronous calls (via HTTP/HTTPS)
Typically NodeA requests some operation on NodeB, e.g. job execution. HTTP/HTTPS is also used for streaming data between workers of parallel execution
asynchronous messaging (TCP connection on port 7800 by default)
Typically heart-beat or events, e.g. job started or finished.
shared database – each node must be able to create DB connection
Shared configuration data, execution history, etc.

Following scenarios are described below one by one, however they may occur together:

NodeA Cannot Establish HTTP Connection to NodeB
NodeA Cannot Establish TCP Connection (Port 7800 by Default) to NodeB
NodeB is Killed or It Cannot Connect to the Database
Auto-Resuming in Unreliable Network
Long-Term Network Malfunction May Cause Jobs to Hang on

NodeA Cannot Establish HTTP Connection to NodeB

When HTTP request can’t be established between nodes, jobs which are delegated between nodes or jobs running in parallel on more nodes will fail. The error is visible in the Execution History. Each node periodically executes a check-task which checks the HTTP connection to other nodes. If the problem is detected, one of the nodes is suspended, since they can’t cooperate with each other.

Time-line describing the scenario:

0s network connection between NodeA and NodeB is down
0-40s a check-task running on NodeA can’t establish HTTP connection to NodeB; check may last for 30s until it times-out; there is no re-try, if connection fails even just once, it is considered as unreliable, so the nodes can’t cooperate.
status of NodeA or NodeB (the one with shorter uptime) is changed to “suspended”

The following configuration properties set the time intervals mentioned above:

cluster.node.check.checkMinInterval
Periodicity of Cluster node checks, in milliseconds.
Default: 20000
cluster.sync.connection.readTimeout
An HTTP connection response timeout, in milliseconds.
Default: 90000
cluster.sync.connection.connectTimeout
Establishing HTTP connection timeout, in milliseconds.
Default: 7000

NodeA Cannot Establish TCP Connection (Port 7800 by Default) to NodeB

TCP connection is used for asynchronous messaging. When the NodeB can’t send/receive asynchronous messages, the other nodes aren’t notified about started/finished jobs, so a parent flow running on NodeA keeps waiting for the event from NodeB. A heart-beat is vital for meaningful load-balancing, the same check-task mentioned above also checks the heart-beat from all Cluster nodes.

Time-line describing the scenario:

0s - the network connection between NodeA and NodeB is down;
60s - NodeA uses the last available NodeB heart-beat;
0-40s - a check-task running on NodeA detects the missing heart-beat from NodeB;
the status of NodeA or NodeB (the one with shorter uptime) is changed to suspended.

The following configuration properties set the time intervals mentioned above:

cluster.node.check.checkMinInterval
The periodicity of Cluster node checks, in milliseconds.
Default: 40000
cluster.node.sendinfo.interval
The periodicity of heart-beat messages, in milliseconds.
Default: 2000
cluster.node.sendinfo.min_interval
A heart-beat may occasionally be sent more often than specified by cluster.node.sendinfo.interval. This property specifies the minimum interval in milliseconds.
Default: 500
cluster.node.remove.interval
The maximum interval for missing a heart-beat, in milliseconds.
Default: 50000

NodeB is Killed or It Cannot Connect to the Database

Access to a database is vital for running jobs, running scheduler and cooperation with other nodes. Touching a database is also used for detection of dead process. When the JVM process of NodeB is killed, it stops touching the database and the other nodes may detect it.

Time-line describing the scenario:

0s-30s - the last touch on DB;
NodeB or its connection to the database is down;
90s - NodeA sees the last touch.
0-40s - a check-task running on NodeA detects an obsolete touch from NodeB;
the status of NodeB is changed to stopped, jobs running on the NodeB are solved, which means that their status is changed to UNKNOWN and the event is dispatched among the Cluster nodes. The job result is considered as error.

The following configuration properties set the time intervals mentioned above:

cluster.node.touch.interval
The periodicity of a database touch, in milliseconds.
Default: 20000
cluster.node.touch.forced_stop.interval
The interval when the other nodes accept the last touch, in milliseconds.
Default: 60000
cluster.node.check.checkMinInterval
The periodicity of Cluster node checks, in milliseconds.
Default: 40000
cluster.node.touch.forced_stop.solve_running_jobs.enabled
A boolean value which can switch the solving of running jobs mentioned above.

Node cannot access the sandboxes home directory

The sandboxes home directory is a place where shared sandboxes are located (configured by sandboxes.home server property). The directory can be on a local or network file system. If the directory is not accessible, it is a serious problem preventing the node from working correctly (e.g. jobs cannot be executed and run). In such a case the affected node must be suspended to prevent jobs from being sent to it.

The suspended node can be resumed when the directory is accessible again, see the Auto-Resuming in Unreliable Network section.

Timeline describing the scenario:

sandboxes home is connected to a remote file system
the connection to the file system is lost
periodic check is executed trying to access the directory
if the check fails, the node is suspended

The following configuration properties set the time intervals mentioned above:

sandboxes.home.check.checkMinInterval
Periodicity of sandboxes home checks, in milliseconds.
Default: 20000
sandboxes.home.check.filewrite.timeout
Accessing sandboxes home timeout, in milliseconds.
Default: 600000

Capt. Eddie to ground control:

"Be careful!"
Setting the timeout value too low might force the node under a heavy load to suspend even if the sandboxes home is actually available.

Auto-Resuming in Unreliable Network

In version 4.4, auto-resuming of suspended nodes was introduced.

Time-line describing the scenario:

NodeB is suspended after connection loss
0s - NodeA successfully reestablishes the connection to NodeB;
120s - NodeA changes the NodeB status to forced_resume;
NodeB attempts to resume itself if the maximum auto-resume count is not reached;
If the connection is lost again, the cycle repeats; if the maximum auto-resume count is exceeded, the node will remain suspended until the counter is reset, to prevent suspend-resume cycles.
240m auto-resume counter is reset

The following configuration properties set the time intervals mentioned above:

cluster.node.check.intervalBeforeAutoresume
Time a node has to be accessible to be forcibly resumed, in milliseconds.
Default: 120000
cluster.node.check.maxAutoresumeCount
How many times a node may try to auto-resume itself.
Default: 3
cluster.node.check.intervalResetAutoresumeCount
Time before the auto-resume counter will be reset, in minutes.
Default: 240

Long-Term Network Malfunction May Cause Jobs to Hang on

The master execution executing child jobs on another Cluster node must be notified about status changes of their child jobs. When the asynchronous messaging doesn’t work, events from the child jobs aren’t delivered, so the parent jobs keep running. When the network works again, the child job events may be re-transmitted, so hung parent jobs may be finished. However, the network malfunction may be so long, that the event can’t be re-transmitted.

See the following time-line to consider a proper configuration:

job A running on NodeA executes job B running on NodeB;
the network between NodeA and NodeB is down from some reason;
job B finishes and sends the finished event; however, it can’t be delivered to NodeA – the event is stored in the sent events buffer;
since the network is down, a heart-beat can’t be delivered as well and maybe HTTP connections can’t be established, the Cluster reacts as described in the sections above. Even though the nodes may be suspended, parent job A keeps waiting for the event from job B.
now, there are 3 possibilities:
- The network finally starts working and since all undelivered events are in the sent events buffer, they are re-transmitted and all of them are finally delivered. Parent job A is notified and proceeds. It may fail later, since some Cluster nodes may be suspended.
- Network finally starts working, but the number of the events sent during the malfunction exceeded the sent events buffer limit size. So some messages are lost and won’t be re-transmitted. Thus the buffer size limit should be higher in the environment with unreliable network. Default buffer size limit is 10,000 events. It should be sufficient for thousands of simple job executions; basically, it depends on number of job phases. Each job execution produces at least 3 events (job started, phase finished, job finished). Please note that there are also other events fired occasionally (configuration changes, suspending, resuming, cache invalidation). Also messaging layer itself stores own messages to the buffer, but the number is negligible (tens of messages per hour). The heart-beat is not stored in the buffer. There is also an inbound events buffer used as a temporary storage for events, so events may be delivered in correct order when some events can’t be delivered at the moment. When the Cluster node is inaccessible, the inbound buffer is released after timeout, which is set to 1 hour, by default.
- Node B is restarted, so all undelivered events in the buffer are lost.

The following configuration properties set the time intervals mentioned above:

cluster.jgroups.protocol.NAKACK.gc_lag
Limits the size of the sent events buffer; Note that each stored message takes 2kB of heap memory.
Default: 10000
cluster.jgroups.protocol.NAKACK.xmit_table_obsolete_member_timeout
An inbound buffer timeout of inaccessible Cluster node.

Updated about 1 year ago