TaskManager¶
Besides post-processing tools and a programmatic interface to generate input files, AbiPy also provides a pythonic API to execute small Abinit tasks directly or submit calculations on supercomputing clusters. This section discusses how to create the configuration files required to interface AbiPy with Abinit.
We assume that Abinit is already available on your machine and that you know how to configure
your environment so that the operating system can load and execute Abinit.
In other words, we assume that you know how to set the $PATH
and $LD_LIBRARY_PATH
($DYLD_LIBRARY_PATH
on Mac)
environment variables, load modules with module load
, run MPI applications with mpirun
, etc.
Important
Please make sure that you can execute Abinit interactively with simple input files and that it works as expected before proceeding with the rest of the tutorial. It’s also a very good idea to run the Abinit test suite with the runtest.py script before running production calculations.
Tip
A pre-compiled sequential version of Abinit for Linux and OSx can be installed directly from the conda-forge channel with:
conda install abinit --channel abinit
How to configure the TaskManager¶
The TaskManager
takes care of task submission.
This includes the creation of the submission script,
the initialization of the environment as well as the optimization of the parallel algorithms
(number of MPI processes, number of OpenMP threads, automatic parallelization with Abinit autoparal
feature).
AbiPy obtains the information needed to create the correct TaskManager
for a specific cluster (personal computer)
from the manager.yml
configuration file.
The file is written in YAML a human-readable data serialization language commonly used for configuration files
(a good introduction to the YAML syntax can be found here.
See also this reference card.
Experiment with YAML syntax using a YAML validator)
By default, AbiPy looks for a manager.yml
file in the current working directory i.e.
the directory in which you execute your script in first and then inside $HOME/.abinit/abipy
.
If no file is found, the code aborts immediately.
An important piece of information for the TaskManager
is the type of queueing system available on the cluster,
the list of queues and their specifications.
In AbiPy queueing systems or resource managers are supported via quadapters
.
At the time of writing (Nov 21, 2024), AbiPy provides qadapters
for the following resource managers:
Manager configuration files for typical cases are available inside ~abipy/data/managers
.
We first discuss how to configure AbiPy on a personal computer and then we look at the more complicated case in which the calculation must be submitted to a queue.
TaskManager for a personal computer¶
Let’s start from the simplest case i.e. a personal computer in which we can execute
applications directly from the shell (qtype: shell
).
In this case, the configuration file is relatively easy because we can run Abinit
directly without having to generate and submit a script to the resource manager.
In its simplest form, the manager.yml
file consists of a list of qadapters
:
qadapters:
- # qadapter_0
- # qadapter_1
Each item in the qadapters
list is essentially a YAML dictionary with the following sub-dictionaries:
queue
Dictionary with the name of the queue and optional parameters used to build and customize the header of the submission script.
job
Dictionary with the options used to prepare the environment before submitting the job.
limits
Dictionary with the constraints that must be fulfilled in order to run with this
qadapter
.hardware
Dictionary with information on the hardware available on this particular queue. Used by Abinit
autoparal
to optimize parallel execution.
The qadapter
is therefore responsible for all interactions with a specific
queue management system (shell, Slurm, PBS, etc), including handling all details
of queue script format as well as queue submission and management.
Note
Multiple qadapters
are useful if you are running on a cluster with different queues
but we post-pone the discussion of this rather technical point.
For the time being, we use a manager.yml
with a single adapter.
A typical configuration file used on a laptop to run jobs via the shell is:
qadapters: # List of `qadapters` objects (just one in this simplified example)
- priority: 1
queue:
qtype: shell # "Submit" jobs via the shell.
qname: localhost # "Submit" to the localhost queue
# (it's a fake queue in this case)
job:
pre_run: "export PATH=$HOME/git_repos/abinit/build_gcc/src/98_main:$PATH"
mpi_runner: "mpirun"
limits:
timelimit: 1:00:00 # Time-limit for each task.
max_cores: 2 # Max number of cores that can be used by a single task.
hardware:
num_nodes: 1
sockets_per_node: 1
cores_per_socket: 2
mem_per_node: 4 Gb
The job
section is the most critical one, in particular the pre_run
option
that will be executed by the shell script before invoking Abinit.
In this case Abinit is not installed by default (the executable is not already in the path).
The directory where the Abinit executables are located hence have to be prepended to the original $PATH
variable.
Change pre_run
according to your Abinit installation and make sure that mpirun
is also in $PATH
.
If you don’t use a parallel version of Abinit, just set mpi_runner: null
(null
is the YAML version of the Python None
).
Note this approach also allows you to safely use multiple versions.
Copy this example and change the entries in the hardware
and the limits
section according to
your machine, in particular make sure that max_cores
is not greater than the number of physical cores
available on your personal computer.
Save the file in the current working directory and run the abicheck.py script provided by AbiPy.
If everything is configured properly, you should see something like this in the terminal.
$ abicheck.py --no-colors
/usr/share/miniconda/envs/abipy/bin/abicheck.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
__import__('pkg_resources').require('abipy==0.9.8')
AbiPy Manager:
[Qadapter 0]
ShellAdapter:github
Hardware:
num_nodes: 1, sockets_per_node: 1, cores_per_socket: 2, mem_per_node 4096,
Qadapter selected: 0
self.info DATA TYPE INFORMATION:
REAL: Data type name: REAL(DP)
Kind value: 8
Precision: 15
Smallest nonnegligible quantity relative to 1: 0.22204460E-015
Smallest positive number: 0.22250739E-307
Largest representable number: 0.17976931E+309
INTEGER: Data type name: INTEGER(default)
Kind value: 4
Bit size: 32
Largest representable number: 2147483647
LOGICAL: Data type name: LOGICAL
Kind value: 4
CHARACTER: Data type name: CHARACTER Kind value: 1
==== Using MPI-2 specifications ====
MPI-IO support is ON
xmpi_tag_ub ................ 268435455
xmpi_bsize_ch .............. 1
xmpi_bsize_int ............. 4
xmpi_bsize_sp .............. 4
xmpi_bsize_dp .............. 8
xmpi_bsize_spc ............. 8
xmpi_bsize_dpc ............. 16
xmpio_bsize_frm ............ 4
xmpi_address_kind .......... 8
xmpi_offset_kind ........... 8
MPI_WTICK .................. 1.0000000000000001E-009
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CPP options activated during the build:
CC_GNU CXX_GNU FC_GNU
HAVE_FC_ALLOCATABLE_DT... HAVE_FC_ASYNC HAVE_FC_BACKTRACE
HAVE_FC_COMMAND_ARGUMENT HAVE_FC_COMMAND_LINE HAVE_FC_CONTIGUOUS
HAVE_FC_CPUTIME HAVE_FC_EXIT HAVE_FC_FLUSH
HAVE_FC_GAMMA HAVE_FC_GETENV HAVE_FC_IEEE_ARITHMETIC
HAVE_FC_IEEE_EXCEPTIONS HAVE_FC_INT_QUAD HAVE_FC_IOMSG
HAVE_FC_ISO_C_BINDING HAVE_FC_ISO_FORTRAN_2008 HAVE_FC_LONG_LINES
HAVE_FC_MOVE_ALLOC HAVE_FC_ON_THE_FLY_SHAPE HAVE_FC_PRIVATE
HAVE_FC_PROTECTED HAVE_FC_SHIFTLR HAVE_FC_STREAM_IO
HAVE_FC_SYSTEM HAVE_FFTW3 HAVE_FORTRAN2003
HAVE_HDF5 HAVE_HDF5_MPI HAVE_LIBPAW_ABINIT
HAVE_LIBTETRA_ABINIT HAVE_LIBXC HAVE_MPI
HAVE_MPI2 HAVE_MPI2_INPLACE HAVE_MPI_IALLGATHER
HAVE_MPI_IALLREDUCE HAVE_MPI_IALLTOALL HAVE_MPI_IALLTOALLV
HAVE_MPI_IBCAST HAVE_MPI_IGATHERV HAVE_MPI_INTEGER16
HAVE_MPI_IO HAVE_MPI_TYPE_CREATE_S... HAVE_NETCDF
HAVE_NETCDF_FORTRAN HAVE_NETCDF_FORTRAN_MPI HAVE_NETCDF_MPI
HAVE_OS_LINUX HAVE_TIMER_ABINIT
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
=== Build Information ===
Version : 10.0.3
Build target : x86_64_linux_gnu13.3
Build date : 20241021
=== Compiler Suite ===
C compiler : gnu
C++ compiler : gnu13.3
Fortran compiler : gnu13.3
CFLAGS : -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protect ...
CXXFLAGS : -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune ...
FCFLAGS : -g -ffree-line-length-none -fallow-argument-mismatch -fallow-argument-mismatch
FC_LDFLAGS :
=== Optimizations ===
Debug level : basic
Optimization level : standard
Architecture : intel_xeon
=== Multicore ===
Parallel build : yes
Parallel I/O : yes
openMP support :
GPU support :
=== Connectors / Fallbacks ===
LINALG flavor : netlib
FFT flavor : fftw3
HDF5 : yes
NetCDF : yes
NetCDF Fortran : yes
LibXC : yes
Wannier90 : no
=== Experimental features ===
Exports :
GW double-precision :
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Default optimizations:
-O2 -march=nocona -mtune=haswell
Optimizations for 43_ptgroups:
-O0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA TYPE INFORMATION:
REAL: Data type name: REAL(DP)
Kind value: 8
Precision: 15
Smallest nonnegligible quantity relative to 1: 0.22204460E-015
Smallest positive number: 0.22250739E-307
Largest representable number: 0.17976931E+309
INTEGER: Data type name: INTEGER(default)
Kind value: 4
Bit size: 32
Largest representable number: 2147483647
LOGICAL: Data type name: LOGICAL
Kind value: 4
CHARACTER: Data type name: CHARACTER Kind value: 1
==== Using MPI-2 specifications ====
MPI-IO support is ON
xmpi_tag_ub ................ 268435455
xmpi_bsize_ch .............. 1
xmpi_bsize_int ............. 4
xmpi_bsize_sp .............. 4
xmpi_bsize_dp .............. 8
xmpi_bsize_spc ............. 8
xmpi_bsize_dpc ............. 16
xmpio_bsize_frm ............ 4
xmpi_address_kind .......... 8
xmpi_offset_kind ........... 8
MPI_WTICK .................. 1.0000000000000001E-009
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CPP options activated during the build:
CC_GNU CXX_GNU FC_GNU
HAVE_FC_ALLOCATABLE_DT... HAVE_FC_ASYNC HAVE_FC_BACKTRACE
HAVE_FC_COMMAND_ARGUMENT HAVE_FC_COMMAND_LINE HAVE_FC_CONTIGUOUS
HAVE_FC_CPUTIME HAVE_FC_EXIT HAVE_FC_FLUSH
HAVE_FC_GAMMA HAVE_FC_GETENV HAVE_FC_IEEE_ARITHMETIC
HAVE_FC_IEEE_EXCEPTIONS HAVE_FC_INT_QUAD HAVE_FC_IOMSG
HAVE_FC_ISO_C_BINDING HAVE_FC_ISO_FORTRAN_2008 HAVE_FC_LONG_LINES
HAVE_FC_MOVE_ALLOC HAVE_FC_ON_THE_FLY_SHAPE HAVE_FC_PRIVATE
HAVE_FC_PROTECTED HAVE_FC_SHIFTLR HAVE_FC_STREAM_IO
HAVE_FC_SYSTEM HAVE_FFTW3 HAVE_FORTRAN2003
HAVE_HDF5 HAVE_HDF5_MPI HAVE_LIBPAW_ABINIT
HAVE_LIBTETRA_ABINIT HAVE_LIBXC HAVE_MPI
HAVE_MPI2 HAVE_MPI2_INPLACE HAVE_MPI_IALLGATHER
HAVE_MPI_IALLREDUCE HAVE_MPI_IALLTOALL HAVE_MPI_IALLTOALLV
HAVE_MPI_IBCAST HAVE_MPI_IGATHERV HAVE_MPI_INTEGER16
HAVE_MPI_IO HAVE_MPI_TYPE_CREATE_S... HAVE_NETCDF
HAVE_NETCDF_FORTRAN HAVE_NETCDF_FORTRAN_MPI HAVE_NETCDF_MPI
HAVE_OS_LINUX HAVE_TIMER_ABINIT
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
=== Build Information ===
Version : 10.0.3
Build target : x86_64_linux_gnu13.3
Build date : 20241021
=== Compiler Suite ===
C compiler : gnu
C++ compiler : gnu13.3
Fortran compiler : gnu13.3
CFLAGS : -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protect ...
CXXFLAGS : -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune ...
FCFLAGS : -g -ffree-line-length-none -fallow-argument-mismatch -fallow-argument-mismatch
FC_LDFLAGS :
=== Optimizations ===
Debug level : basic
Optimization level : standard
Architecture : intel_xeon
=== Multicore ===
Parallel build : yes
Parallel I/O : yes
openMP support :
GPU support :
=== Connectors / Fallbacks ===
LINALG flavor : netlib
FFT flavor : fftw3
HDF5 : yes
NetCDF : yes
NetCDF Fortran : yes
LibXC : yes
Wannier90 : no
=== Experimental features ===
Exports :
GW double-precision :
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Default optimizations:
-O2 -march=nocona -mtune=haswell
Optimizations for 43_ptgroups:
-O0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Abinitbuild:
Abinit Build Information:
Abinit version: 10.0.3
MPI: True, MPI-IO: True, OpenMP: True
Netcdf: True
Abipy Scheduler:
PyFlowScheduler, Pid: 3343
Scheduler options:
{'weeks': 0, 'days': 0, 'hours': 0, 'minutes': 0, 'seconds': 5}
Installed packages:
Package Version
-------------- ----------
system Linux
python_version 3.11.8
numpy 1.26.4
scipy 1.14.1
netCDF4 1.7.2
apscheduler 3.10.4
pydispatch 2.0.7
ruamel.yaml 0.18.6
boken 3.6.1
panel 1.5.4
plotly 5.24.1
ase 3.23.0
phonopy 2.30.1
monty 2024.10.21
pymatgen 2024.11.13
abipy 0.9.8
Important Shell Variables:
['/usr/share/miniconda/envs/abipy/bin:/usr/share/miniconda/condabin:/usr/share/miniconda/condabin:/snap/bin:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/runner/.dotnet/tools',
'',
'']
Abipy requirements are properly configured
This message tells us that everything is in place and we can finally run our first calculation.
The directory ~abipy/data/runs
contains python scripts to generate workflows for typical ab-initio calculations.
Here we focus on the configuration of the manager and the execution of the flow so we don’t discuss how to
generate input files and create Flow objects in python.
This topic is covered in more detail in our collection of jupyter notebooks
Let’s start from the simplest example i.e. the run_si_ebands.py
script that generates
a flow to compute the band structure of silicon at the Kohn-Sham level
(GS calculation to get the density followed by a NSCF run along a k-path in the first Brillouin zone).
Cd to ~abipy/data/runs
and execute run_si_ebands.py
to generate the flow:
cd ~abipy/data/runs
./run_si_ebands.py
At this point, you should have a directory flow_si_ebands
with the following structure:
tree flow_si_ebands/
flow_si_ebands/
├── __AbinitFlow__.pickle
├── indata
├── outdata
├── tmpdata
└── w0
├── indata
├── outdata
├── t0
│ ├── indata
│ ├── job.sh
│ ├── outdata
│ ├── run.abi
│ ├── run.files
│ └── tmpdata
├── t1
│ ├── indata
│ ├── job.sh
│ ├── outdata
│ ├── run.abi
│ ├── run.files
│ └── tmpdata
└── tmpdata
15 directories, 7 files
w0/
is the directory containing the input files of the first workflow (well, we have only one workflow in our example).
w0/t0/
and w0/t1/
contain the input files need to run the SCF and the NSC run, respectively.
You might have noticed that each task directory (w0/t0
, w0/t1
) presents the same structure:
run.abi
: Abinit input file.
run.files
: Abinit files file.
job.sh
: Submission/shell script.
outdata
: Directory with output data files.
indata
: Directory with input data files.
tmpdata
: Directory with temporary files.
Danger
__AbinitFlow__.pickle
is the pickle file used to save the status of the Flow. Don’t touch it!
The job.sh
script has been generated by the TaskManager
using the information provided by manager.yml
.
In this case it is a simple shell script that executes the code directly as we are using qtype: shell
.
The script will get more complicated when we start to submit jobs on a cluster with a resource manager.
We usually interact with the AbiPy flow via the abirun.py script whose syntax is:
abirun.py FLOWDIR command [options]
where FLOWDIR
is the directory containing the flow and command
defines the action to perform
(use abirun.py --help
to get the list of possible commands).
abirun.py
reconstructs the python Flow from the pickle file __AbinitFlow__.pickle
located in FLOWDIR
and invokes the methods of the object depending on the options passed via the command line.
Use:
abirun.py flow_si_ebands status
to get a summary with the status of the different tasks and:
abirun.py flow_si_ebands deps
to print the dependencies of the tasks in textual format.
<ScfTask, node_id=75244, workdir=flow_si_ebands/w0/t0>
<NscfTask, node_id=75245, workdir=flow_si_ebands/w0/t1>
+--<ScfTask, node_id=75244, workdir=flow_si_ebands/w0/t0>
Tip
Alternatively one can use abirun.py flow_si_ebands networkx
to visualize the connections with the networkx package.
In this case, we have a flow with one work (w0
) that contains two tasks.
The second task (w0/t1
) depends on first one that is a ScfTask
,
more specifically w0/t1
depends on the density file produced by w0/t0
.
This means that w0/t1
cannot be executed/submitted until we have completed the first task.
AbiPy is aware of this dependency and will use this information to manage the submission/execution
of our flow.
There are two commands that can be used to launch tasks: single
and rapid
.
The single
command executes the first task in the flow that is in the READY
state that is a task
whose dependencies have been fulfilled.
rapid
, on the other hand, submits all tasks of the flow that are in the READY
state.
Let’s try to run the flow with the rapid
command…
abirun.py flow_si_ebands rapid
Running on gmac2 -- system Darwin -- Python 2.7.12 -- abirun-0.1.0
Number of tasks launched: 1
Work #0: <BandStructureWork, node_id=75239, workdir=flow_si_ebands/w0>, Finalized=False
+--------+-------------+-----------------+--------------+------------+----------+-----------------+----------+-----------+
| Task | Status | Queue | MPI|Omp|Gb | Warn|Com | Class | Sub|Rest|Corr | Time | Node_ID |
+========+=============+=================+==============+============+==========+=================+==========+===========+
| w0_t0 | Submitted | 71573@localhost | 2| 1|2.0 | 1| 0 | ScfTask | (1, 0, 0) | 0:00:00Q | 75240 |
+--------+-------------+-----------------+--------------+------------+----------+-----------------+----------+-----------+
| w0_t1 | Initialized | None | 1| 1|2.0 | NA|NA | NscfTask | (0, 0, 0) | None | 75241 |
+--------+-------------+-----------------+--------------+------------+----------+-----------------+----------+-----------+
What’s happening here?
The rapid
command tried to execute all tasks that are READY
but since the second task depends
on the first one only the first task gets submitted.
Note that the SCF task (w0_t0
) has been submitted with 2 MPI processes.
Before submitting the task, indeed, AbiPy
invokes Abinit to get all the possible parallel configurations compatible within the limits
specified by the user (e.g. max_cores
), select an “optimal” configuration according
to some policy and then submit the task with the optimized parameters.
At this point, there’s no other task that can be executed, the script exits
and we have to wait for the SCF task before running the second part of the flow.
At each iteration, abirun.py prints a table with the status of the different tasks. The meaning of the columns is as follows:
Queue
String in the form
JobID @ QueueName
where JobID is the process identifier if we are in the shell or the job ID assigned by the resource manager (e.g. slurm) if we are submitting to a queue.MPI
Number of MPI processes used. This value is obtained automatically by calling Abinit in
autoparal mode
, cannot exceedmax_ncpus
.OMP
Number of OpenMP threads.
Gb
Memory requested in Gb. Meaningless when
qtype: shell
.Warn
Number of warning messages found in the log file.
Com
Number of comments found in the log file.
Sub
Number of submissions. It can be > 1 if AbiPy encounters a problem and resubmit the task with different parameters without performing any operation that can change the physics of the system).
Rest
Number of restarts. AbiPy can restart the job if convergence has not been reached.
Corr
Number of corrections performed by AbiPy to fix runtime errors. These operations can change the physics of the system.
Time
Time spent in the queue (if string ends with Q) or running time (if string ends with R).
Node_ID
Node identifier used by AbiPy to identify each node of the flow.
Note
When the submission is done through the shell there’s almost no difference between job submission and job execution. The scenario is completely different if you are submitting jobs to a resource manager because the task will get a priority value and will enter the queue.
If you execute status
again, you should see that the first task is completed.
We can thus run rapid
again to launch the abipy.flowtk.tasks.NscfTask
.
The second task won’t take long and if you issue status
again, you should see that the entire flow
completed successfully.
To understand what happened in more detail, use the history
command to get
the list of operations performed by AbiPy on each task.
abirun.py flow_si_ebands history
==============================================================================================================================
=================================== <ScfTask, node_id=75244, workdir=flow_si_ebands/w0/t0> ===================================
==============================================================================================================================
[Mon Mar 6 21:46:00 2017] Status changed to Ready. msg: Status set to Ready
[Mon Mar 6 21:46:00 2017] Setting input variables: {'max_ncpus': 2, 'autoparal': 1}
[Mon Mar 6 21:46:00 2017] Old values: {'max_ncpus': None, 'autoparal': None}
[Mon Mar 6 21:46:00 2017] Setting input variables: {'npband': 1, 'bandpp': 1, 'npimage': 1, 'npspinor': 1, 'npfft': 1, 'npkpt': 2}
[Mon Mar 6 21:46:00 2017] Old values: {'npband': None, 'npfft': None, 'npkpt': None, 'npimage': None, 'npspinor': None, 'bandpp': None}
[Mon Mar 6 21:46:00 2017] Status changed to Initialized. msg: finished autoparallel run
[Mon Mar 6 21:46:00 2017] Submitted with MPI=2, Omp=1, Memproc=2.0 [Gb] submitted to queue
[Mon Mar 6 21:46:15 2017] Task completed status set to ok based on abiout
[Mon Mar 6 21:46:15 2017] Finalized set to True
=============================================================================================================================
================================== <NscfTask, node_id=75245, workdir=flow_si_ebands/w0/t1> ==================================
=============================================================================================================================
[Mon Mar 6 21:46:15 2017] Status changed to Ready. msg: Status set to Ready
[Mon Mar 6 21:46:15 2017] Adding connecting vars {u'irdden': 1}
[Mon Mar 6 21:46:15 2017] Setting input variables: {u'irdden': 1}
[Mon Mar 6 21:46:15 2017] Old values: {u'irdden': None}
[Mon Mar 6 21:46:15 2017] Setting input variables: {'max_ncpus': 2, 'autoparal': 1}
[Mon Mar 6 21:46:15 2017] Old values: {'max_ncpus': None, 'autoparal': None}
[Mon Mar 6 21:46:15 2017] Setting input variables: {'npband': 1, 'bandpp': 1, 'npimage': 1, 'npspinor': 1, 'npfft': 1, 'npkpt': 2}
[Mon Mar 6 21:46:15 2017] Old values: {'npband': None, 'npfft': None, 'npkpt': None, 'npimage': None, 'npspinor': None, 'bandpp': None}
[Mon Mar 6 21:46:15 2017] Status changed to Initialized. msg: finished autoparallel run
[Mon Mar 6 21:46:15 2017] Submitted with MPI=2, Omp=1, Memproc=2.0 [Gb] submitted to queue
[Mon Mar 6 21:49:48 2017] Task completed status set to ok based on abiout
[Mon Mar 6 21:49:48 2017] Finalized set to True
A closer inspection of the logs reveal that before submitting the first task, python has executed
Abinit in autoparal
mode to get the list of possible parallel configuration and the calculation is then submitted.
At this point, AbiPy starts to look at the output files produced by the task to understand what’s happening.
When the first task completes, the status of the second task is automatically changed to READY
,
the irdden
input variable is added to the input file of the second task and a symbolic link to
the DEN
file produced by w0/t0
is created in the indata
directory of w0/t1
.
Another auto-parallel run is executed for the NSCF calculation and the second task is finally submitted.
The command line interface is very flexible and sometimes it’s the only tool available. However, there are cases in which we would like to have a global view of what’s happening. The command:
$ abirun.py flow_si_ebands notebook
generates a jupyter notebook with pre-defined python code that can be executed to get a graphical representation of the status of our flow inside a web browser (requires jupyter, nbformat and, obviously, a web browser).
Expert users may want to use:
$ abirun.py flow_si_ebands ipython
to open the flow in the ipython shell to have direct access to the API provided by the flow.
Once manager.yml
is properly configured, it is possible
to use the AbiPy objects to invoke Abinit and perform useful operations.
For example, one can use the abipy.abio.inputs.AbinitInput
object to get the list of k-points in the IBZ,
the list of independent DFPT perturbations, the possible parallel configurations reported by autoparal
etc.
This programmatic interface can be used in scripts to facilitate the creation of input files and workflows. For example, one can call Abinit to get the list of perturbations for each q-point in the IBZ and then generate automatically all the input files for DFPT calculations (actually this is the approach used to generated DFPT workflows in the AbiPy factory functions).
Note that manager.yml
is also used to invoke other executables (anaddb
, optic
, mrgddb
, etcetera)
thus creating some sort of interface between the python language and the Fortran executables.
Thanks to this interface, one can perform relatively simple ab-initio calculations directly in AbiPy.
For instance one can open a DDB
file in a jupyter notebook, call anaddb
to compute
the phonon frequencies and plot the DOS and the phonon band structure with matplotlib.
Tip
abirun.py . doc_manager
gives the full documentation for the different entries of manager.yml
.
$ abirun.py . doc_manager
/usr/share/miniconda/envs/abipy/bin/abirun.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
__import__('pkg_resources').require('abipy==0.9.8')
# TaskManager configuration file (YAML Format)
policy:
# Dictionary with options used to control the execution of the tasks.
qadapters:
# List of qadapters objects (mandatory)
- # qadapter_1
- # qadapter_2
##########################################
# Individual entries are documented below:
##########################################
policy:
autoparal: # (integer). 0 to disable the autoparal feature (DEFAULT: 1 i.e. autoparal is on)
condition: # condition used to filter the autoparal configurations (Mongodb-like syntax).
# DEFAULT: empty i.e. ignored.
vars_condition: # Condition used to filter the list of ABINIT variables reported by autoparal
# (Mongodb-like syntax). DEFAULT: empty i.e. ignored.
frozen_timeout: # A job is considered frozen and its status is set to ERROR if no change to
# the output file has been done for `frozen_timeout` seconds. Accepts int with seconds or
# string in slurm form i.e. days-hours:minutes:seconds. DEFAULT: 1 hour.
precedence: # Under development.
autoparal_priorities: # Under development.
qadapter:
# Dictionary with info on the hardware available on this queue.
hardware:
num_nodes: # Number of nodes available on this queue (integer, MANDATORY).
sockets_per_node: # Number of sockets per node (integer, MANDATORY).
cores_per_socket: # Number of cores per socket (integer, MANDATORY).
# The total number of cores available on this queue is
# `num_nodes * sockets_per_node * cores_per_socket`.
# Dictionary with the options used to prepare the enviroment before submitting the job
job:
setup: # List of commands (strings) executed before running (DEFAULT: empty)
omp_env: # Dictionary with OpenMP environment variables (DEFAULT: empty i.e. no OpenMP)
modules: # List of modules to be imported before running the code (DEFAULT: empty).
# NB: Error messages produced by module load are redirected to mods.err
shell_env: # Dictionary with shell environment variables.
mpi_runner: # MPI runner. Possible values in ["mpirun", "mpiexec", "srun", None]
# DEFAULT: None i.e. no mpirunner is used.
mpi_runner_options: # String with optional options passed to the `mpi_runner` e.g. "--bind-to None"
shell_runner: # Used for running small sequential jobs on the front-end. Set it to None
# if mpirun or mpiexec are not available on the fron-end. If not
# given, small sequential jobs are executed with `mpi_runner`.
shell_runner_options: # Similar to mpi_runner_options but for the runner used on the front-end.
pre_run: # List of commands (strings) executed before the run (DEFAULT: empty)
post_run: # List of commands (strings) executed after the run (DEFAULT: empty)
# dictionary with the name of the queue and other optional parameters
# used to build/customize the header of the submission script.
queue:
qtype: # String defining the qapapter type e.g. slurm, shell ...
qname: # Name of the submission queue (string, MANDATORY)
qparams: # Dictionary with values used to generate the header of the job script
# We use the *normalized* version of the options i.e dashes in the official name
# are replaced by underscores e.g. ``--mail-type`` becomes ``mail_type``
# See pymatgen.io.abinit.qadapters.py for the list of supported values.
# Use ``qverbatim`` to pass additional options that are not included in the template.
# dictionary with the constraints that must be fulfilled in order to run on this queue.
limits:
min_cores: # Minimum number of cores (integer, DEFAULT: 1)
max_cores: # Maximum number of cores (integer, MANDATORY). Hard limit to hint_cores:
# it's the limit beyond which the scheduler will not accept the job (MANDATORY).
hint_cores: # The limit used in the initial setup of jobs.
# Fix_Critical method may increase this number until max_cores is reached
min_mem_per_proc: # Minimum memory per MPI process in MB, units can be specified e.g. 1.4 GB
# (DEFAULT: hardware.mem_per_core)
max_mem_per_proc: # Maximum memory per MPI process in MB, units can be specified e.g. `1.4GB`
# (DEFAULT: hardware.mem_per_node)
timelimit: # Initial time-limit. Accepts time according to slurm-syntax i.e:
# "days-hours" or "days-hours:minutes" or "days-hours:minutes:seconds" or
# "minutes" or "minutes:seconds" or "hours:minutes:seconds",
timelimit_hard: # The hard time-limit for this queue. Same format as timelimit.
# Error handlers could try to submit jobs with increased timelimit
# up to timelimit_hard. If not specified, timelimit_hard == timelimit
condition: # MongoDB-like condition (DEFAULT: empty, i.e. not used)
allocation: # String defining the policy used to select the optimal number of CPUs.
# possible values are in ["nodes", "force_nodes", "shared"]
# "nodes" means that we should try to allocate entire nodes if possible.
# This is a soft limit, in the sense that the qadapter may use a configuration
# that does not fulfill this requirement. In case of failure, it will try to use the
# smallest number of nodes compatible with the optimal configuration.
# Use `force_nodes` to enforce entire nodes allocation.
# `shared` mode does not enforce any constraint (DEFAULT: shared).
max_num_launches: # Limit to the number of times a specific task can be restarted (integer, DEFAULT: 5)
limits_for_task_class: # Dictionary mapping Task class names to a dictionary with limits to be used
# for this particular Task. Example (mind white spaces):
#
# limits_for_task_class: {
# NscfTask: {min_cores: 1, max_cores: 10},
# KerangeTask: {min_cores: 1, max_cores: 1, max_mem_per_proc: 1 GB},
# }
qtype supported: ['bluegene', 'moab', 'pbspro', 'sge', 'shell', 'slurm', 'torque']
Use `abirun.py . manager slurm` to have the list of qparams for slurm.
How to configure the scheduler¶
In the previous example, we ran a simple band structure calculation for silicon in a few seconds
on a laptop but one might have more complicated flows requiring hours or even days to complete.
For such cases, the single
and rapid
commands are not handy because we are supposed
to monitor the evolution of the flow and re-run abirun.py
when a new task is READY
.
In these cases, it is much easier to delegate all the repetitive work to a python scheduler
,
a process that runs in the background, submits tasks automatically and performs the actions
required to complete the flow.
The parameters for the scheduler are declared in the YAML file scheduler.yml
.
Also in this case, AbiPy will look first in the working directory and then inside $HOME/.abinit/abipy
.
Create a scheduler.yml
in the working directory by copying the example below:
seconds: 5 # number of seconds to wait.
#minutes: 0 # number of minutes to wait.
#hours: 0 # number of hours to wait.
This file tells the scheduler to wake up every 5 seconds, inspect the status of the tasks in the flow and perform the actions required for reach completion
Important
Remember to set the time interval to a reasonable value. A small value leads to an increase of the submission rate but it also increases the CPU load and the pressure on the hardware and on the resource manager. A too large time interval can have a detrimental effect on the throughput, especially if you are submitting many small jobs.
At this point, we are ready to run our first calculation with the scheduler. To make things more interesting, we execute a slightly more complicated flow that computes the G0W0 corrections to the direct band gap of silicon at the Gamma point. The flow consists of the following six tasks:
0: Ground state calculation to get the density.
1: NSCF calculation with several empty states.
2: Calculation of the screening using the WFK produced by task 2.
3-4-5: Evaluation of the Self-Energy matrix elements with different values of nband using the WFK produced by task 2 and the SCR file produced by task 3
Generate the flow with:
./run_si_g0w0.py
and let the scheduler manage the submission with:
abirun.py flow_si_g0w0 scheduler
You should see the following output on the terminal
abirun.py flow_si_ebands scheduler
Abipy Scheduler:
PyFlowScheduler, Pid: 72038
Scheduler options: {'seconds': 10, 'hours': 0, 'weeks': 0, 'minutes': 0, 'days': 0}
Pid
is the process identifier associated the scheduler (also saved in in the _PyFlowScheduler.pid
file).
Important
A _PyFlowScheduler.pid
file in FLOWDIR
means that there’s a scheduler running the flow.
Note that there must be only one scheduler associated to a given flow.
As you can easily understand the scheduler brings additional power to the AbiPy flow because
it is possible to automate complicated ab-initio workflows with little effort: write
a script that implements the flow in python and save it to disk, run it with
abirun.py FLOWDIR scheduler
and finally use the AbiPy/Pymatgen tools to analyze the final results.
Even complicated convergence studies for G0W0 calculations can be implemented along these lines
as shown by this video.
The only problem is that at a certain point our flow will become too big or too computational expensive
that cannot be executed on a personal computer anymore and we have to move to a supercomputing center.
The next section discusses how to configure AbiPy to run on a cluster with a queue management system.
Tip
Use abirun.py . doc_scheduler
to get the full list of options supported by the scheduler.
$ abirun.py doc_scheduler
/usr/share/miniconda/envs/abipy/bin/abirun.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
__import__('pkg_resources').require('abipy==0.9.8')
Options that can be specified in scheduler.yml:
weeks: number of weeks to wait (DEFAULT: 0).
days: number of days to wait (DEFAULT: 0).
hours: number of hours to wait (DEFAULT: 0).
minutes: number of minutes to wait (DEFAULT: 0).
seconds: number of seconds to wait (DEFAULT: 0).
mailto: The scheduler will send an email to `mailto` every `remindme_s` seconds.
(DEFAULT: None i.e. not used).
verbose: (int) verbosity level. (DEFAULT: 0)
use_dynamic_manager: "yes" if the |TaskManager| must be re-initialized from
file before launching the jobs. (DEFAULT: "no")
max_njobs_inqueue: Limit on the number of jobs that can be present in the queue. (DEFAULT: 200)
max_ncores_used: Maximum number of cores that can be used by the scheduler.
remindme_s: The scheduler will send an email to the user specified
by `mailto` every `remindme_s` seconds. (int, DEFAULT: 1 day).
max_num_pyexcs: The scheduler will exit if the number of python exceptions is > max_num_pyexcs
(int, DEFAULT: 0)
max_num_abierrs: The scheduler will exit if the number of errored tasks is > max_num_abierrs
(int, DEFAULT: 0)
safety_ratio: The scheduler will exits if the number of jobs launched becomes greater than
`safety_ratio` * total_number_of_tasks_in_flow. (int, DEFAULT: 5)
max_nlaunches: Maximum number of tasks launched in a single iteration of the scheduler.
(DEFAULT: -1 i.e. no limit)
debug: Debug level. Use 0 for production (int, DEFAULT: 0)
fix_qcritical: "yes" if the launcher should try to fix QCritical Errors (DEFAULT: "no")
rmflow: If "yes", the scheduler will remove the flow directory if the calculation
completed successfully. (DEFAULT: "no")
killjobs_if_errors: "yes" if the scheduler should try to kill all the running jobs
before exiting due to an error. (DEFAULT: "yes")
Configuring AbiPy on a cluster¶
In this section we discuss how to configure the manager to run flows on a cluster. The configuration depends on specific queue management system (Slurm, PBS, etc) hence we assume that you are already familiar with job submissions and you know the options that mush be specified in the submission script in order to have your job accepted and executed by the management system (username, name of the queue, memory …)
Let’s assume that our computing center uses slurm and our jobs must be submitted to the default_queue
partition.
In the best case, the system administrator of our cluster (or you create one yourself) already provides
an Abinit module
that can be loaded directly with module load
before invoking the code.
To make things a little bit more difficult, however, we assume the we had to compile our own version of Abinit
inside the build directory ${HOME}/git_repos/abinit/build_impi
using the following two modules
already installed by the system administrator:
compiler/intel/composerxe/2013_sp1.1.106
intelmpi
In this case, we have to be careful with the configuration of our environment because the Slurm submission
script should load the modules and modify our $PATH
so that our version of Abinit can be found.
A manager.yml
with a single qadapter
looks like:
qadapters:
- priority: 1
queue:
qtype: slurm
qname: default_queue
qparams: # Slurm options added to job.sh
mail_type: FAIL
mail_user: john@doe
job:
modules:
- compiler/intel/composerxe/2013_sp1.1.106
- intelmpi
shell_env:
PATH: ${HOME}/git_repos/abinit/build_impi/src/98_main:$PATH
pre_run:
- ulimit -s unlimited
mpi_runner: mpirun
limits:
timelimit: 0:20:0
max_cores: 16
min_mem_per_proc: 1Gb
hardware:
num_nodes: 120
sockets_per_node: 2
cores_per_socket: 8
mem_per_node: 64Gb
Tip
abirun.py FLOWDIR doc_manager script
prints to screen the submission script that will be generated by AbiPy at runtime.
Let’s discuss the different options in more detail. Let’s start from the queue
section:
qtype
String specifying the resource manager. This option tells AbiPy which
qadapter
to use to generate the submission script, submit them, kill jobs in the queue and how to interpret the other options passed by the user.qname
Name of the submission queue (string, MANDATORY)
qparams
Dictionary with the parameters passed to the resource manager. We use the normalized version of the options i.e. dashes in the official name of the parameter are replaced by underscores e.g.
--mail-type
becomesmail_type
. For the list of supported options use thedoc_manager
command. Useqverbatim
to pass additional options that are not included in the template.
Note that we are not specifying the number of cores in qparams
because AbiPy will find an appropriate value
at run-time.
The job
section is the most critical one because it defines how to configure the environment
before executing the application and how to run the code.
The modules
entry specifies the list of modules to load, shell_env
allows us to modify the
$PATH
environment variables so that the OS can find our Abinit executable.
Important
Various resource managers will first execute your .bashrc
before starting to load the new modules.
We also increase the size of the stack with ulimit
before running the code and we run Abinit
with the mpirun
provided by the modules.
The limits
section defines the constraints that must be fulfilled in order to run on this queue
while hardware
is a dictionary with info on the hardware available on this queue.
Every job will have a timelimit
of 20 minutes, cannot use more that max_cores
cores,
and the first job submission will request 1 Gb of memory.
Note that the actual number of cores will be determined at runtime by calling Abinit in autoparal
mode
to get all parallel configurations up to max_cores
.
If the job is killed due to insufficient memory, AbiPy will resubmit the task with increased resources
and it will stop when it reaches the maximum amount given by mem_per_node
.
Note that there are more advances options supported by limits
and other options
will be added as time goes by.
The get the complete list of options supported by the Slurm qadapter
use:
$ abirun.py . doc_manager slurm
/usr/share/miniconda/envs/abipy/bin/abirun.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
__import__('pkg_resources').require('abipy==0.9.8')
# TaskManager configuration file (YAML Format)
policy:
# Dictionary with options used to control the execution of the tasks.
qadapters:
# List of qadapters objects (mandatory)
- # qadapter_1
- # qadapter_2
##########################################
# Individual entries are documented below:
##########################################
policy:
autoparal: # (integer). 0 to disable the autoparal feature (DEFAULT: 1 i.e. autoparal is on)
condition: # condition used to filter the autoparal configurations (Mongodb-like syntax).
# DEFAULT: empty i.e. ignored.
vars_condition: # Condition used to filter the list of ABINIT variables reported by autoparal
# (Mongodb-like syntax). DEFAULT: empty i.e. ignored.
frozen_timeout: # A job is considered frozen and its status is set to ERROR if no change to
# the output file has been done for `frozen_timeout` seconds. Accepts int with seconds or
# string in slurm form i.e. days-hours:minutes:seconds. DEFAULT: 1 hour.
precedence: # Under development.
autoparal_priorities: # Under development.
qadapter:
# Dictionary with info on the hardware available on this queue.
hardware:
num_nodes: # Number of nodes available on this queue (integer, MANDATORY).
sockets_per_node: # Number of sockets per node (integer, MANDATORY).
cores_per_socket: # Number of cores per socket (integer, MANDATORY).
# The total number of cores available on this queue is
# `num_nodes * sockets_per_node * cores_per_socket`.
# Dictionary with the options used to prepare the enviroment before submitting the job
job:
setup: # List of commands (strings) executed before running (DEFAULT: empty)
omp_env: # Dictionary with OpenMP environment variables (DEFAULT: empty i.e. no OpenMP)
modules: # List of modules to be imported before running the code (DEFAULT: empty).
# NB: Error messages produced by module load are redirected to mods.err
shell_env: # Dictionary with shell environment variables.
mpi_runner: # MPI runner. Possible values in ["mpirun", "mpiexec", "srun", None]
# DEFAULT: None i.e. no mpirunner is used.
mpi_runner_options: # String with optional options passed to the `mpi_runner` e.g. "--bind-to None"
shell_runner: # Used for running small sequential jobs on the front-end. Set it to None
# if mpirun or mpiexec are not available on the fron-end. If not
# given, small sequential jobs are executed with `mpi_runner`.
shell_runner_options: # Similar to mpi_runner_options but for the runner used on the front-end.
pre_run: # List of commands (strings) executed before the run (DEFAULT: empty)
post_run: # List of commands (strings) executed after the run (DEFAULT: empty)
# dictionary with the name of the queue and other optional parameters
# used to build/customize the header of the submission script.
queue:
qtype: # String defining the qapapter type e.g. slurm, shell ...
qname: # Name of the submission queue (string, MANDATORY)
qparams: # Dictionary with values used to generate the header of the job script
# We use the *normalized* version of the options i.e dashes in the official name
# are replaced by underscores e.g. ``--mail-type`` becomes ``mail_type``
# See pymatgen.io.abinit.qadapters.py for the list of supported values.
# Use ``qverbatim`` to pass additional options that are not included in the template.
# dictionary with the constraints that must be fulfilled in order to run on this queue.
limits:
min_cores: # Minimum number of cores (integer, DEFAULT: 1)
max_cores: # Maximum number of cores (integer, MANDATORY). Hard limit to hint_cores:
# it's the limit beyond which the scheduler will not accept the job (MANDATORY).
hint_cores: # The limit used in the initial setup of jobs.
# Fix_Critical method may increase this number until max_cores is reached
min_mem_per_proc: # Minimum memory per MPI process in MB, units can be specified e.g. 1.4 GB
# (DEFAULT: hardware.mem_per_core)
max_mem_per_proc: # Maximum memory per MPI process in MB, units can be specified e.g. `1.4GB`
# (DEFAULT: hardware.mem_per_node)
timelimit: # Initial time-limit. Accepts time according to slurm-syntax i.e:
# "days-hours" or "days-hours:minutes" or "days-hours:minutes:seconds" or
# "minutes" or "minutes:seconds" or "hours:minutes:seconds",
timelimit_hard: # The hard time-limit for this queue. Same format as timelimit.
# Error handlers could try to submit jobs with increased timelimit
# up to timelimit_hard. If not specified, timelimit_hard == timelimit
condition: # MongoDB-like condition (DEFAULT: empty, i.e. not used)
allocation: # String defining the policy used to select the optimal number of CPUs.
# possible values are in ["nodes", "force_nodes", "shared"]
# "nodes" means that we should try to allocate entire nodes if possible.
# This is a soft limit, in the sense that the qadapter may use a configuration
# that does not fulfill this requirement. In case of failure, it will try to use the
# smallest number of nodes compatible with the optimal configuration.
# Use `force_nodes` to enforce entire nodes allocation.
# `shared` mode does not enforce any constraint (DEFAULT: shared).
max_num_launches: # Limit to the number of times a specific task can be restarted (integer, DEFAULT: 5)
limits_for_task_class: # Dictionary mapping Task class names to a dictionary with limits to be used
# for this particular Task. Example (mind white spaces):
#
# limits_for_task_class: {
# NscfTask: {min_cores: 1, max_cores: 10},
# KerangeTask: {min_cores: 1, max_cores: 1, max_mem_per_proc: 1 GB},
# }
qtype supported: ['bluegene', 'moab', 'pbspro', 'sge', 'shell', 'slurm', 'torque']
Use `abirun.py . manager slurm` to have the list of qparams for slurm.
QPARAMS for slurm
#!/bin/bash
#SBATCH --partition=$${partition}
#SBATCH --job-name=$${job_name}
#SBATCH --nodes=$${nodes}
#SBATCH --total_tasks=$${total_tasks}
#SBATCH --ntasks=$${ntasks}
#SBATCH --ntasks-per-node=$${ntasks_per_node}
#SBATCH --cpus-per-task=$${cpus_per_task}
#####SBATCH --mem=$${mem}
#SBATCH --mem-per-cpu=$${mem_per_cpu}
#SBATCH --hint=$${hint}
#SBATCH --time=$${time}
#SBATCH --exclude=$${exclude_nodes}
#SBATCH --account=$${account}
#SBATCH --mail-user=$${mail_user}
#SBATCH --mail-type=$${mail_type}
#SBATCH --constraint=$${constraint}
#SBATCH --gres=$${gres}
#SBATCH --requeue=$${requeue}
#SBATCH --nodelist=$${nodelist}
#SBATCH --propagate=$${propagate}
#SBATCH --licenses=$${licenses}
#SBATCH --output=$${_qout_path}
#SBATCH --error=$${_qerr_path}
#SBATCH --qos=$${qos}
$${qverbatim}
Important
If you need to cancel all tasks that have been submitted to the resource manager, use:
abirun.py FLOWDIR cancel
Note that the script will ask for confirmation before killing all the jobs belonging to the flow.
Once you have a manager.yml
properly configured for your cluster, you can start
to use the scheduler to automate job submission.
Very likely your flows will require hours or even days to complete and, in principle,
you should maintain an active connection to the machine in order to keep your scheduler alive
(if your session expires, all subprocesses launched within your terminal,
including the python scheduler, will be automatically killed).
Fortunately there is a standard Unix tool called nohup
that comes to our rescue.
For long-running jobs, we strongly suggest to start the scheduler with:
nohup abirun.py FLOWDIR scheduler > sched.stdout 2> sched.stderr &
This command executes the scheduler in background and redirects the stdout
and stderr
to sched.log
and sched.err
, respectively.
The process identifier of the scheduler is saved in the _PyFlowScheduler.pid
file inside FLOWDIR
and this file is removed automatically when the scheduler completes its execution.
Thanks to the nohup
command, we can close our session, let the scheduler work overnight
and reconnect the day after to collect our data.
Important
Use abirun.py FLOWDIR cancel
to cancel the jobs of a flow that is being executed by
a scheduler. AbiPy will detect that there is a scheduler already attached to the flow
and will cancel the jobs of the flow and kill the scheduler as well.
Inspecting the Flow¶
abirun.py also provides tools to analyze the results of the flow at runtime. The simplest command is:
abirun.py FLOWDIR tail
that is the analogous of Unix tail but a little bit more smarter in the
sense that abirun.py
will only print to screen the final part of the output files
of the tasks that are RUNNING
.
If you have matplotlib installed, you may want to use:
$ abirun.py FLOWDIR inspect
Several AbiPy tasks, indeed, provide an inspect
method producing matplotlib figures
with data extracted from the output files.
For example, a GsTask
prints the evolution of the ground-state SCF cycle.
The inspect command of abirun.py just loops over the tasks of the flow and
calls the inspect
method on each of them.
The command:
abirun.py FLOWDIR inputs
prints the input files of the different tasks (can use --nids
to select a subset of
tasks or, alternatively, replace FLOWDIR
with the FLOWDIR/w0/t0
syntax)
The command:
abirun.py FLOWDIR listext EXTENSION
prints a table with the nodes of the flow who have produced an Abinit output file with the given extension. Use e.g.:
abirun.py FLOWDIR listext GSR.nc
to show the nodes of the flow who have produced a GSR.nc file.
The command:
abirun.py FLOWDIR notebook
generates a jupyter notebook with pre-defined python code that can be executed to get a graphical representation of the status of the flow inside a web browser (requires jupyter, nbformat and, obviously, a web browser).
Expert users may want to use:
abirun.py FLOWDIR ipython
to open the flow in the ipython shell to have direct access to the API provided by the flow.
Event handlers¶
An event handler is an action that is executed in response of a particular event. The AbiPy tasks are equipped with built-in events handlers that are be executed to fix typical Abinit runtime errors.
To list the event handlers installed in a given flow use:
abirun.py FLOWDIR handlers
The --verbose
option produces a more detailed description of the action performed
by the event handlers.
abirun.py FLOWDIR handlers --verbose
List of event handlers installed:
event name = !DilatmxError
event documentation:
This Error occurs in variable cell calculations when the increase in the
unit cell volume is too large.
handler documentation:
Handle DilatmxError. Abinit produces a netcdf file with the last structure before aborting
The handler changes the structure in the input with the last configuration and modify the value of dilatmx.
event name = !TolSymError
event documentation:
Class of errors raised by Abinit when it cannot detect the symmetries of the system.
The handler assumes the structure makes sense and the error is just due to numerical inaccuracies.
We increase the value of tolsym in the input file (default 1-8) so that Abinit can find the space group
and re-symmetrize the input structure.
handler documentation:
Increase the value of tolsym in the input file.
event name = !MemanaError
event documentation:
Class of errors raised by the memory analyzer.
(the section that estimates the memory requirements from the input parameters).
handler documentation:
Set mem_test to 0 to bypass the memory check.
event name = !MemoryError
event documentation:
This error occurs when a checked allocation fails in Abinit
The only way to go is to increase memory
handler documentation:
Handle MemoryError. Increase the resources requirements
Note
New error handlers will be added in the new versions of Abipy/Abinit. Please, let us know if you need handlers for errors commonly occuring in your calculations.
Troubleshooting¶
There are two abirun.py commands that are very useful especially if something goes wrong: events
and debug
.
To print the Abinit events (Warnings, Errors, Comments) found in the log files of the different tasks use:
abirun.py FLOWDIR events
To analyze error files and log files for possible error messages, use:
abirun.py FLOWDIR debug
By default, these commands will analyze the entire flow so the output on the terminal can be very verbose.
If you are interested in a particular task e.g. w0/t1
use the syntax:
abirun.py FLOWDIR/w0/t1 events
to select all the tasks in a work directory e.g. w0
use:
abirun.py FLOWDIR/w0 events
to select an arbitrary subset of nodes of the flow use the syntax:
abirun.py FLOWDIR events -nids=12,13,16
where nids
is a list of AbiPy node identifiers.
Tip
abirun.py events --help
is your best friend
$ abirun.py events --help
/usr/share/miniconda/envs/abipy/bin/abirun.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
__import__('pkg_resources').require('abipy==0.9.8')
usage: abirun.py [flowdir] events [-h] [-v] [--no-colors] [--no-logo]
[--loglevel LOGLEVEL] [--remove-lock]
[-n NIDS | -w WSLICE | -S TASK_STATUS | -t TASK_CLASS]
options:
-h, --help show this help message and exit
-v, --verbose verbose, can be supplied multiple times to increase
verbosity.
--no-colors Disable ASCII colors.
--no-logo Disable AbiPy logo.
--loglevel LOGLEVEL Set the loglevel. Possible values: CRITICAL, ERROR
(default), WARNING, INFO, DEBUG.
--remove-lock Remove the lock on the pickle file used to save the
status of the flow.
-n NIDS, --nids NIDS Node identifier(s) used to select the task. Accept
single integer, comma-separated list of integers or
python slice. Use `status` command to get the node
ids. Examples: --nids=12 --nids=12,13,16 --nids=10:12
to select 10 and 11 (slice syntax), --nids=2:5:2 to
select 2,4.
-w WSLICE, --wslice WSLICE
Select the list of works to analyze (python syntax for
slices): Examples: --wslice=1 to select the second
workflow, --wslice=:3 for 0,1,2, --wslice=-1 for the
last workflow, --wslice::2 for even indices.
-S TASK_STATUS, --task-status TASK_STATUS
Select only the tasks with the given status. Default:
None i.e. ignored. Possible values: ['Initialized',
'Locked', 'Ready', 'Submitted', 'Running', 'Done',
'AbiCritical', 'QCritical', 'Unconverged', 'Error',
'Completed'].
-t TASK_CLASS, --task-class TASK_CLASS
Select only tasks with the given class e.g. `-t
NscfTask`.
To get information on the Abinit executable called by AbiPy, use:
abirun.py abibuild
or the verbose variant:
abirun.py abibuild --verbose
TODO: How to reset tasks
TaskPolicy¶
At this point, you may wonder why we need to specify all these parameters in the configuration file.
The reason is that, before submitting a job to a resource manager, AbiPy will use the autoparal
feature of ABINIT to get all the possible parallel configurations with ncpus <= max_cores
.
On the basis of these results, AbiPy selects the “optimal” one, and changes the ABINIT input file
and the submission script accordingly .
(this is a very useful feature, especially for calculations done with paral_kgb=1
that require
the specification of npkpt
, npfft
, npband
, etc).
If more than one QueueAdapter
is specified, AbiPy will first compute all the possible
configuration and then select the “optimal” QueueAdapter
according to some kind of policy
In some cases, you may want to enforce some constraint on the “optimal” configuration.
For example, you may want to select only those configurations whose parallel efficiency is greater than 0.7
and whose number of MPI nodes is divisible by 4.
One can easily enforce this constraint via the condition
dictionary whose syntax is similar to
the one used in mongodb.
policy:
autoparal: 1
max_ncpus: 10
condition: {$and: [ {"efficiency": {$gt: 0.7}}, {"tot_ncpus": {$divisible: 4}} ]}
The parallel efficiency is defined as $epsilon = dfrac{T_1}{T_N * N}$ where $N$ is the number
of MPI processes and $T_j$ is the wall time needed to complete the calculation with $j$ MPI processes.
For a perfect scaling implementation $epsilon$ is equal to one.
The parallel speedup with N processors is given by $S = T_N / T_1$.
Note that autoparal = 1
will automatically change your job.sh
script as well as the input file
so that we can run the job in parallel with the optimal configuration required by the user.
For example, you can use paral_kgb = 1
in GS calculations and AbiPy will automatically set the values
of npband
, npfft
, npkpt
… for you!
Note that if no configuration fulfills the given condition, AbiPy will use the optimal configuration
that leads to the highest parallel speedup (not necessarily the most efficient one).
policy
This section governs the automatic parallelization of the run: in this case AbiPy will use the
autoparal
capabilities of Abinit to determine an optimal configuration with maximummax_ncpus
MPI nodes. Settingautoparal
to 0 disable the automatic parallelization. Other values of autoparal are not supported.