Errors and failures¶
A key point for being able to run workflows in an high throughput regime is to have
some automatic tool that deals with the most common errors that might show up during
an DFT calculation, and this is indeed present in Abiflows. Such a system, however,
cannot be expected to solve all the problems of various nature that may happen.
So you should consider that, when running fireworks workflows, you can encounter two
main kinds of failures. In the following we will distinguish the possible failures in
soft and hard failures. We will call a failure soft if a firework
is in the FIZZLED
state. Conversely, we will identify a failure as hard if
the job was killed by some external action leaving the firework in a RUNNING
state as a lost run. A third kind of failure that can happen is the case of
a lost run that is in an inconsistent state and needs to be adjusted.
Here we will provide a guide about how to deal with the different kinds of failures.
Before proceeding to the following section about error handling and dealing with failures specific to Abiflows you might want to review the basic tutorial on failures in the fireworks documentation.
Error Handling¶
As mentioned above Abiflows can try to solve some errors that might be encountered during a DFT calculation. The error messages produced by Abinit are encoded in a specific (YAML) format inside the output file, so these can be easily parsed and identified, without relying on different analysis of the textual messages.
The default approach is that whenever an error is encountered, and thus Abinit stops,
Abiflows will try to fix the problem and the default behaviour is to generate a detour that
will continue the calculation with the required modifications. This can altered setting the
allow_local_restart
keyword to True
in the fw.manager.yaml options. The creation
of a detour is usually preferable, since this may allow to accommodate for a change in the
parallelization options in the restart (if autoparal is used) and to reduce the chances of
hitting the maximum time required by a job on a cluster.
The system will not try to fix an unlimted number of errors (or the same errors for an
unlimited number of times). A maximum number of restarts for the same Firework is set
to 10 by default and can be customized with the max_restarts
keyword in the
fw_manager.yaml
configuration file.
The error handling relies on the event handlers implemented in Abipy and shares the same functionalities as the workflows implemented there. If you encounter an error that is not handled and for which an error handler could be implemented, you can try to develop your own handler and include it in Abipy or try to get in touch with the Abipy developers.
Soft failures¶
When Abinit produces an error that is not handled, or the maximum number of restarts
is reached, the Firework will be end up in a FIZZLED
state. Technically this means
that an exception was raised in the python code. This can thus come from Abiflows
recognizing that it cannot fix the error, but can also happen due to partial failure
of the system (e.g. a full disk will cause an I/O error but the system might still be
able to update the state of the fireworks database) or even to some bug in the python
code.
If the failure originates from a temporary problem of the system, like a problem of one node that leads the calculation to crash, simply rerunning the Firework should solve the issue. For this you can use the standard fireworks commands, e.g.
lpad rerun_fws -i xxx
If the Firework fizzled because the maximum number of restarts has been reached you
can consider to inspect the calculation and decide if you want it to restart some more
times, if you evaluate that there is a possibility to reach the end of the calculation.
In this case you need to increase the number of restarts allowed in the fw_manager.yaml
file and simply rerun the calculation.
However, if the problem arose from an Abinit error that cannot be fixed, rerunning the
Firework will simply lead to the same error. In this case you should probably consider
the calculation as lost and either create a new workflow with different initial
configurations or just discard the system. For an advanced user there might be an
additional option. If you think that you can fix the problem by adjusting some
abinit input variable you might consider updating the document in the fireworks
collection of the fireworks database corresponding to the failed fireworks and modify
the data corresponding to the AbinitInput
object.
Note
If you decide to discard a calculation it might be convenient to delete
the corresponding workflow from the fireworks database, so that it does
not show up again when you look for FIZZLED
fireworks. This can be
achieved running:
lpad delete_wflows -i xxx
Using the fw_id of the failed firework. This will just delete the workflow
from the database, but it is also possible to delete all the related
folders from the file system using the --ldirs
option. This is usually
the best solution
lpad delete_wflows -i xxx --ldirs
Hard failures¶
When your job is killed by the queue manager, being it for some problem
of the cluster or because your calculation exceeded the resources available in the
job (e.g. memory or wall time) the Firework will remain in the RUNNING
state.
First of all you need to identify these lost runs. This can be done with the standard
fireworks procedure with the command
lpad detect_lostruns
This will provide a list of the ids of the lost fireworks (i.e. the fireworks that did not ping the database for a specified amount of time). If you are confident that all the lost jobs are due to a temporary problem or to a whim of the cluster you can just rerun all the lost Fireworks:
lpad detect_lostruns --rerun
If instead you suspect that there might be a problem in some of your jobs, the correct way of proceeding will be to go to the launch directory of the job and inspect the files produced by the queue manager. These usually contain information about the reason of the failure and would probably explicitly mention if the error is coming from an exceeded memory or wall time. If that’s the case simply resubmitting your job will probably end up with the same outcome and you might want to be sure that your job has enough resources when you rerun it.
For this you might consider creating a specific fireworker with additional resources
and make the job run with that
fireworker.
Alternatively, you might want to set or change the _queueadapter
keyword in the spec of
the Firework, as is explained in the
specific section of the fireworks manual <https://materialsproject.github.io/fireworks/queue_tutorial_pt2.html>`_.
If the *autoparal* is enabled the ``_queueadapter
will already be present and you will
need to update the appropriate keywords to increase the resources that will be requested
to the queue manager.
Database issues¶
An additional problem that could leave your job as a lost run is when the database becomes temporarily not available, due to a failure of the database server or to an issue in the connection. In this case it might be that the calculation completed successfully, but it could not update the results in the database. If you end up in this situation the standard solution of simply rerunning the job is perfectly viable, but you will lose the computational time used for the first run. A better solution is to rerun the firework with the following command:
lpad rerun_fws -i xxx --task-level --previous-dir
This will rerun the Firework and will make sure that it reruns in the same folder as the first run. At this point Abiflows will notice that there is a completed output in the folder and use that one instead of running Abinit again.
Warning
Remember that there should be a completed Abinit output file in the folder. Obviously a partial output cannot be recovered in any way. In addition consider that the new launch of the Firework will only last a few seconds and this is the time that will be registered in the fireworks (and in the results) database. Consider this point if you plan to keep statistics about your computations run time.
Inconsistent Fireworks¶
As mentioned above, there is one particular case for which your jobs might
be identified by the lpad detect_lostruns
command, when there in an
inconsistency between the state of the Firework and the state of the Launch.
The output will look like this:
2019-01-01 00:00:00,000 INFO Detected 0 lost FWs: []
2019-01-01 00:00:00,000 INFO Detected 2 inconsistent FWs: [123,124]
You can fix inconsistent FWs using the --refresh argument to the detect_lostruns command
This may happen when fireworks has problems in refreshing the whole state of
the Workflow and the dependencies of a Firework. A concrete example where this
can show up is when there are workflows with a large number of Fireworks that
all have a common child. This is the case for example for the mrgddb
step
in a DFPT workflow.
This is only a small issue due to the particular configuration of the workflow and does not require the job to be rerun. To solve this you simply need to run the command:
lpad detect_lostruns --refresh
Depending on the size of the workflow and on the number of inconsistent Fireworks, this may take a while, but is not a computationally intensive operation.