Detection of unresponsive Engine Nodes

In a multi-node system, it is important to be able to detect nodes that have become unresponsive in order to recover any interrupted tasks.

Each EngineNode has a lastActive field. Each time the EngineHealthBean runs, it will update this field. One of the engine nodes, the master node, will then regularly check the lastActive fields of all nodes. If some node has a value that is older than the defined timeout, that node will be tagged as NOT_RESPONDING.

In case the master node fails, other nodes will attempt to ‘steal’ the master flag once the master node is past its expiration time. The node that succeeds to take the master flag first will then become the new master and repeat the process.

If an engineNode is unreponsive for a longer time (by default 24 hours), it will be deleted.

Look at the net.democritus.workflow.EngineNodeConfig class to see which parameters can be modified.


Release Expander version Change
201712 implemented