Errors and failures

A key point for being able to run workflows in an high throughput regime is to have some automatic tool that deals with the most common errors that might show up during an DFT calculation, and this is indeed present in Abiflows. Such a system, however, cannot be expected to solve all the problems of various nature that may happen. So you should consider that, when running fireworks workflows, you can encounter two main kinds of failures. In the following we will distinguish the possible failures in soft and hard failures. We will call a failure soft if a firework is in the FIZZLED state. Conversely, we will identify a failure as hard if the job was killed by some external action leaving the firework in a RUNNING state as a lost run. A third kind of failure that can happen is the case of a lost run that is in an inconsistent state and needs to be adjusted. Here we will provide a guide about how to deal with the different kinds of failures.

Before proceeding to the following section about error handling and dealing with failures specific to Abiflows you might want to review the basic tutorial on failures in the fireworks documentation.

Error Handling

As mentioned above Abiflows can try to solve some errors that might be encountered during a DFT calculation. The error messages produced by Abinit are encoded in a specific (YAML) format inside the output file, so these can be easily parsed and identified, without relying on different analysis of the textual messages.

The default approach is that whenever an error is encountered, and thus Abinit stops, Abiflows will try to fix the problem and the default behaviour is to generate a detour that will continue the calculation with the required modifications. This can altered setting the allow_local_restart keyword to True in the fw.manager.yaml options. The creation of a detour is usually preferable, since this may allow to accommodate for a change in the parallelization options in the restart (if autoparal is used) and to reduce the chances of hitting the maximum time required by a job on a cluster.

The system will not try to fix an unlimted number of errors (or the same errors for an unlimited number of times). A maximum number of restarts for the same Firework is set to 10 by default and can be customized with the max_restarts keyword in the fw_manager.yaml configuration file.

The error handling relies on the event handlers implemented in Abipy and shares the same functionalities as the workflows implemented there. If you encounter an error that is not handled and for which an error handler could be implemented, you can try to develop your own handler and include it in Abipy or try to get in touch with the Abipy developers.

Soft failures

When Abinit produces an error that is not handled, or the maximum number of restarts is reached, the Firework will be end up in a FIZZLED state. Technically this means that an exception was raised in the python code. This can thus come from Abiflows recognizing that it cannot fix the error, but can also happen due to partial failure of the system (e.g. a full disk will cause an I/O error but the system might still be able to update the state of the fireworks database) or even to some bug in the python code.

If the failure originates from a temporary problem of the system, like a problem of one node that leads the calculation to crash, simply rerunning the Firework should solve the issue. For this you can use the standard fireworks commands, e.g.

lpad rerun_fws -i xxx

If the Firework fizzled because the maximum number of restarts has been reached you can consider to inspect the calculation and decide if you want it to restart some more times, if you evaluate that there is a possibility to reach the end of the calculation. In this case you need to increase the number of restarts allowed in the fw_manager.yaml file and simply rerun the calculation.

However, if the problem arose from an Abinit error that cannot be fixed, rerunning the Firework will simply lead to the same error. In this case you should probably consider the calculation as lost and either create a new workflow with different initial configurations or just discard the system. For an advanced user there might be an additional option. If you think that you can fix the problem by adjusting some abinit input variable you might consider updating the document in the fireworks collection of the fireworks database corresponding to the failed fireworks and modify the data corresponding to the AbinitInput object.

Note

If you decide to discard a calculation it might be convenient to delete the corresponding workflow from the fireworks database, so that it does not show up again when you look for FIZZLED fireworks. This can be achieved running:

lpad delete_wflows -i xxx

Using the fw_id of the failed firework. This will just delete the workflow from the database, but it is also possible to delete all the related folders from the file system using the --ldirs option. This is usually the best solution

lpad delete_wflows -i xxx --ldirs

Hard failures

When your job is killed by the queue manager, being it for some problem of the cluster or because your calculation exceeded the resources available in the job (e.g. memory or wall time) the Firework will remain in the RUNNING state. First of all you need to identify these lost runs. This can be done with the standard fireworks procedure with the command

lpad detect_lostruns

This will provide a list of the ids of the lost fireworks (i.e. the fireworks that did not ping the database for a specified amount of time). If you are confident that all the lost jobs are due to a temporary problem or to a whim of the cluster you can just rerun all the lost Fireworks:

lpad detect_lostruns --rerun

If instead you suspect that there might be a problem in some of your jobs, the correct way of proceeding will be to go to the launch directory of the job and inspect the files produced by the queue manager. These usually contain information about the reason of the failure and would probably explicitly mention if the error is coming from an exceeded memory or wall time. If that’s the case simply resubmitting your job will probably end up with the same outcome and you might want to be sure that your job has enough resources when you rerun it.

For this you might consider creating a specific fireworker with additional resources and make the job run with that fireworker. Alternatively, you might want to set or change the _queueadapter keyword in the spec of the Firework, as is explained in the specific section of the fireworks manual <https://materialsproject.github.io/fireworks/queue_tutorial_pt2.html>`_. If the *autoparal* is enabled the ``_queueadapter will already be present and you will need to update the appropriate keywords to increase the resources that will be requested to the queue manager.

Database issues

An additional problem that could leave your job as a lost run is when the database becomes temporarily not available, due to a failure of the database server or to an issue in the connection. In this case it might be that the calculation completed successfully, but it could not update the results in the database. If you end up in this situation the standard solution of simply rerunning the job is perfectly viable, but you will lose the computational time used for the first run. A better solution is to rerun the firework with the following command:

lpad rerun_fws -i xxx --task-level --previous-dir

This will rerun the Firework and will make sure that it reruns in the same folder as the first run. At this point Abiflows will notice that there is a completed output in the folder and use that one instead of running Abinit again.

Warning

Remember that there should be a completed Abinit output file in the folder. Obviously a partial output cannot be recovered in any way. In addition consider that the new launch of the Firework will only last a few seconds and this is the time that will be registered in the fireworks (and in the results) database. Consider this point if you plan to keep statistics about your computations run time.

Inconsistent Fireworks

As mentioned above, there is one particular case for which your jobs might be identified by the lpad detect_lostruns command, when there in an inconsistency between the state of the Firework and the state of the Launch. The output will look like this:

2019-01-01 00:00:00,000 INFO Detected 0 lost FWs: []
2019-01-01 00:00:00,000 INFO Detected 2 inconsistent FWs: [123,124]
You can fix inconsistent FWs using the --refresh argument to the detect_lostruns command

This may happen when fireworks has problems in refreshing the whole state of the Workflow and the dependencies of a Firework. A concrete example where this can show up is when there are workflows with a large number of Fireworks that all have a common child. This is the case for example for the mrgddb step in a DFPT workflow.

This is only a small issue due to the particular configuration of the workflow and does not require the job to be rerun. To solve this you simply need to run the command:

lpad detect_lostruns --refresh

Depending on the size of the workflow and on the number of inconsistent Fireworks, this may take a while, but is not a computationally intensive operation.