Interrupt Recovery
If a workflow is interrupted mid-execution, it may result in some hanging state:
- Unfinished tasks may leave the target instance in an interim state
- If an engine is interrupted before it can release its claims, these claims will stay and keep the target instances locked.
The solution to this problem differs depending on whether claims are used or not for a given workflow.
Solution for non-claimable workflows: Startup recovery
If no claims are available to denote if a node is working on an instance, the only possible time to recover all interrupted tasks is at startup.
When the ear is deployed, the net.democritus.wfe.EngineHealthBean class will create an EngineNode record with status 'Recovering' and a reference to the current hostname.
If an engineNode with the same hostname already exists, this means that the record was not cleaned up properly during shutdown and seems to indicate an unexpected shutdown. In that case the status will be set to 'Unexpected shutdown' and a log-statement will be placed. In this case, no recovery will run and the engines will not be started automatically.
It will then start a recovery process that will check all workflows that are not claimable. For each workflow it will perform the following action:
- Find all instances that have a status corresponding to an interim state of a StateTask. If the task that was being executed was an atomicInternal task, the status is reverted to its beginState. Otherwise, the errorState is set.
The process will attempt to claim itself as master before starting the recovery. Only 1 node can be master at the same time. This will prevent from multiple nodes running the same recovery process.
After all workflows have been recovered, all registered engineNodes will be updated to status 'Ready'. After this the EngineStarterBean
will kick-start EngineServices.
Solution for claimable workflows: Check Engine Health task
For workflow that are claimable, it is possible to use a more robust implementation for recovery so that one node can shut down and recover while other nodes are still running.
To resolve these issues, the 'CheckEngineHealth' Task has been implemented. The task has been implemented with the following concerns in mind:
- Other nodes may exist, which may have engines running
- Nodes can fail unexpectedly
- Nodes may still be working even though they haven't updated their lastActive state (e.g. because of resource problems)
Recovery in multi-node systems
To solve this problem within these parameters, the following solutions have been provided:
- Claims have a timeout field that is set when claiming the instance. Expired claims are ignored.
- If an engine tries to modify an instance with an expired claim, it fails.
- When checking for instances that seem stuck in an interim state, only unclaimed instances are considered.
- Only 1 node can be master and run the 'CheckEngineHealth' task. This is to prevent 2 or more nodes running the task simultaneously.
Actions performed by the Health task
The 'CheckEngineHealth' will perform the following 3 actions in that order:
- Find all engine nodes that have a lastActive value that is older than a certain threshold and mark them as 'Not responding'. Records that are older than a day are removed.
- For each claim element, clean all claims for which the timeout time has expired.
- For each workflow, find all unclaimed instances that have a status corresponding to an interim state of a StateTask. If the task that was being executed was an
atomicInternal
task, the status is reverted to its beginState. Otherwise, the failedState is set.