paral_kgb error?

charlesp · Post by **charlesp** » Fri Apr 24, 2015 8:00 am

Hi,

I have been running calculations on two different titanate perovskites, with supercell containing between 80 and 135 atoms.
I have a problem with parallelization. Although when I use paral_kgb 0, everything is running fine, each time I used paral_kgb 1, I get the same error after a while :
--- !ERROR
message: |
abi_xpotrf, info=9 (or 3 sometimes)
src_file: abi_xorthonormalize.f90
src_line: 106
...

Would anyone know if it is a bug in ABINIT, or if it is a compilation problem, and how to solve it?

Thanks a lot!

jzwanzig · Post by **jzwanzig** » Tue May 05, 2015 6:16 pm

Hi, we would need to know all about your computer system, build, and input file to make progress on this question.

Joe

charlesp · Post by **charlesp** » Wed May 06, 2015 9:07 am

Hi,

I am computing on a supercomputer made of Intel 12-Cores (E5-2690 @ 2.6Ghz) nodes, each node having 64 Go of memory.
Regarding compilation, I did not compile it myself, but as far as I know, ABINIT was compiled using the intel compiler (version 15), linked with the intel mkl libraries, and as for parallelization, with bullxmpi.

As for the input file, here is an example:

Code: Select all


paral_kgb 1
npkpt 2
npband 10
npfft 6

chkprim 0
nline 8

#Cut-Offs
ecut 20.0
pawecutdg 30.0

#points k
kptopt 1
ngkpt 2 2 2
occopt 3
tsmear 0.001
nband 800

#cycle SCF
toldff 2.0d-5
nstep 100

#optimisation
ionmov 3
tolmxf 4.0d-04
ntime 200

prtden 0
ixc 7


#geometrie
acell 7.36068155523333E+00 7.36068155523333E+00 7.36068155523333E+00
scalecart 3.0 3.0 3.0

natom 134
ntypat 3
znucl 82 22 8
typat 27*1 27*2 80*3

charge +2

nsppol 2
nspden 2
spinat 402*0
xred
# Pb atoms
0.000000000000       0.000000000000       0.000000000000
0.333333333333       0.000000000000       0.000000000000
0.666666666667       0.000000000000       0.000000000000
0.000000000000       0.333333333333       0.000000000000
0.333333333333       0.333333333333       0.000000000000
0.666666666667       0.333333333333       0.000000000000
0.000000000000       0.666666666667       0.000000000000
0.333333333333       0.666666666667       0.000000000000
0.666666666667       0.666666666667       0.000000000000
0.000000000000       0.000000000000       0.333333333333
0.333333333333       0.000000000000       0.333333333333
0.666666666667       0.000000000000       0.333333333333
0.000000000000       0.333333333333       0.333333333333
0.333333333333       0.333333333333       0.333333333333
0.666666666667       0.333333333333       0.333333333333
0.000000000000       0.666666666667       0.333333333333
0.333333333333       0.666666666667       0.333333333333
0.666666666667       0.666666666667       0.333333333333
0.000000000000       0.000000000000       0.666666666667
0.333333333333       0.000000000000       0.666666666667
0.666666666667       0.000000000000       0.666666666667
0.000000000000       0.333333333333       0.666666666667
0.333333333333       0.333333333333       0.666666666667
0.666666666667       0.333333333333       0.666666666667
0.000000000000       0.666666666667       0.666666666667
0.333333333333       0.666666666667       0.666666666667
0.666666666667       0.666666666667       0.666666666667
# Ti atoms
0.166666666667       0.166666666667       0.166666666667
0.500000000000       0.166666666667       0.166666666667
0.833333333333       0.166666666667       0.166666666667
0.166666666667       0.500000000000       0.166666666667
0.500000000000       0.500000000000       0.166666666667
0.833333333333       0.500000000000       0.166666666667
0.166666666667       0.833333333333       0.166666666667
0.500000000000       0.833333333333       0.166666666667
0.833333333333       0.833333333333       0.166666666667
0.166666666667       0.166666666667       0.500000000000
0.500000000000       0.166666666667       0.500000000000
0.833333333333       0.166666666667       0.500000000000
0.166666666667       0.500000000000       0.500000000000
0.500000000000       0.500000000000       0.500000000000
0.833333333333       0.500000000000       0.500000000000
0.166666666667       0.833333333333       0.500000000000
0.500000000000       0.833333333333       0.500000000000
0.833333333333       0.833333333333       0.500000000000
0.166666666667       0.166666666667       0.833333333333
0.500000000000       0.166666666667       0.833333333333
0.833333333333       0.166666666667       0.833333333333
0.166666666667       0.500000000000       0.833333333333
0.500000000000       0.500000000000       0.833333333333
0.833333333333       0.500000000000       0.833333333333
0.166666666667       0.833333333333       0.833333333333
0.500000000000       0.833333333333       0.833333333333
0.833333333333       0.833333333333       0.833333333333
# O1 atoms
0.166666666667       0.000000000000       0.166666666667
0.500000000000       0.000000000000       0.166666666667
0.833333333333       0.000000000000       0.166666666667
0.166666666667       0.333333333333       0.166666666667
0.500000000000       0.333333333333       0.166666666667
0.833333333333       0.333333333333       0.166666666667
0.166666666667       0.666666666667       0.166666666667
0.500000000000       0.666666666667       0.166666666667
0.833333333333       0.666666666667       0.166666666667
0.166666666667       0.000000000000       0.500000000000
0.500000000000       0.000000000000       0.500000000000
0.833333333333       0.000000000000       0.500000000000
0.166666666667       0.333333333333       0.500000000000
#0.500000000000       0.333333333333       0.500000000000
0.833333333333       0.333333333333       0.500000000000
0.166666666667       0.666666666667       0.500000000000
0.500000000000       0.666666666667       0.500000000000
0.833333333333       0.666666666667       0.500000000000
0.166666666667       0.000000000000       0.833333333333
0.500000000000       0.000000000000       0.833333333333
0.833333333333       0.000000000000       0.833333333333
0.166666666667       0.333333333333       0.833333333333
0.500000000000       0.333333333333       0.833333333333
0.833333333333       0.333333333333       0.833333333333
0.166666666667       0.666666666667       0.833333333333
0.500000000000       0.666666666667       0.833333333333
0.833333333333       0.666666666667       0.833333333333
# O2 atoms
0.000000000000       0.166666666667       0.166666666667
0.333333333333       0.166666666667       0.166666666667
0.666666666667       0.166666666667       0.166666666667
0.000000000000       0.500000000000       0.166666666667
0.333333333333       0.500000000000       0.166666666667
0.666666666667       0.500000000000       0.166666666667
0.000000000000       0.833333333333       0.166666666667
0.333333333333       0.833333333333       0.166666666667
0.666666666667       0.833333333333       0.166666666667
0.000000000000       0.166666666667       0.500000000000
0.333333333333       0.166666666667       0.500000000000
0.666666666667       0.166666666667       0.500000000000
0.000000000000       0.500000000000       0.500000000000
0.333333333333       0.500000000000       0.500000000000
0.666666666667       0.500000000000       0.500000000000
0.000000000000       0.833333333333       0.500000000000
0.333333333333       0.833333333333       0.500000000000
0.666666666667       0.833333333333       0.500000000000
0.000000000000       0.166666666667       0.833333333333
0.333333333333       0.166666666667       0.833333333333
0.666666666667       0.166666666667       0.833333333333
0.000000000000       0.500000000000       0.833333333333
0.333333333333       0.500000000000       0.833333333333
0.666666666667       0.500000000000       0.833333333333
0.000000000000       0.833333333333       0.833333333333
0.333333333333       0.833333333333       0.833333333333
0.666666666667       0.833333333333       0.833333333333
# O3 atoms
0.166666666667       0.166666666667       0.000000000000
0.500000000000       0.166666666667       0.000000000000
0.833333333333       0.166666666667       0.000000000000
0.166666666667       0.500000000000       0.000000000000
0.500000000000       0.500000000000       0.000000000000
0.833333333333       0.500000000000       0.000000000000
0.166666666667       0.833333333333       0.000000000000
0.500000000000       0.833333333333       0.000000000000
0.833333333333       0.833333333333       0.000000000000
0.166666666667       0.166666666667       0.333333333333
0.500000000000       0.166666666667       0.333333333333
0.833333333333       0.166666666667       0.333333333333
0.166666666667       0.500000000000       0.333333333333
0.500000000000       0.500000000000       0.333333333333
0.833333333333       0.500000000000       0.333333333333
0.166666666667       0.833333333333       0.333333333333
0.500000000000       0.833333333333       0.333333333333
0.833333333333       0.833333333333       0.333333333333
0.166666666667       0.166666666667       0.666666666667
0.500000000000       0.166666666667       0.666666666667
0.833333333333       0.166666666667       0.666666666667
0.166666666667       0.500000000000       0.666666666667
0.500000000000       0.500000000000       0.666666666667
0.833333333333       0.500000000000       0.666666666667
0.166666666667       0.833333333333       0.666666666667
0.500000000000       0.833333333333       0.666666666667
0.833333333333       0.833333333333       0.666666666667

Thanks a lot!
Charles.

Jordan · Post by **Jordan** » Wed May 06, 2015 2:16 pm

Hi,

Intel 15 is not officially supported yet.

As far as I can tell, the KGB parallelization often has troubles with orthogonalization even though it is efficient.
I would suggest to compile abinit with the --enable-zdot-bugfix if used with MKL

It works without paral_kbg because the algorithm and part of executed code are different.

Hope that helps

Jordan

charlesp · Post by **charlesp** » Thu May 07, 2015 2:32 pm

Hi,

This error actually also happens with the ABINIT version computed on my laptop, with the GNU compilers, openmpi, fftw3 and the ATLAS-BLAS and LAPACK. I build this version of ABINIT according the tutorial at http://www.youtube.com/watch?v=DppLQ-KQA68.

So I still get this error when using paral_kgb 1, even when I only use 1 processor for parallelisation on bands and fft, and for instance 5 for npkpt.

Any idea?

Charles.

Jordan · Post by **Jordan** » Thu May 07, 2015 5:05 pm

I'm running some test.

Can you tell where the code crashes ?
Which broyden iteration ?
Which SCF iteration ?
Or does it crash at the very beginning ?

Jordan

EDIT : I have been running your input file with JTH pseudo on 8 cpus (nkpt 2 npband 1 bandpp 2 npfft 4) and it is still running (18th scf of the 1st broyden)

charlesp · Post by **charlesp** » Sat May 09, 2015 8:38 pm

It actually depends on the system. For most of them, the code makes a few broyden iterations before crashing.
Here is a simpler input file where, on the contrary, the system does not make any broyden step and stops after 25 minutes at SCF step #9.

Code: Select all

getxred -1
getcell -1

# Allocated Ressources
# -------------------------------------------------------------------------------------------------
paral_kgb 1
npkpt 27
npband 4
npspinor 1
npfft 1
# -------------------------------------------------------------------------------------------------

ndtset 5

# Convergence Parameters
# -------------------------------------------------------------------------------------------------
# Plane-waves cut-offs :
ecut 30.0
pawecutdg: 30.0
pawecutdg+ 5.0
nline 10
nband 140
npulayit 14

# k-pt mesh
kptopt 1
ngkpt 6 6 6
occopt 3
tsmear 0.0001

# Convergence criteria
#Cycle SCF
toldff 1.0d-6
nstep 200
# Geometry Optimization
ionmov 3
optcell 2
ecutsm 0.5
tolmxf 5.0d-5
ntime 200
dilatmx 1.15

prtden 0
prtwf 0
# Physics Parameters
# -------------------------------------------------------------------------------------------------
# DFT + U
usepawu 1
lpawu -1 2 -1
upawu 0.000000000 0.110247972 0.000000000 # 4eV for Gd f band, 3eV for Ti d band, see Appl. Phys. Lett., 105, 172402 (2014)
jpawu 0.000000000 0.000000000 0.000000000

# Cell geometry 
# -------------------------------------------------------------------------------------------------
# Atoms
natom 20
ntypat 3
znucl 64 22 8
typat 4*1 4*2 12*3

# Space Group #62
spgroup 62
spgaxor 1 # --> Configuration of the axes of Pnma type
spgorig 1
brvltt 1

# Magnetism
nsppol 2
nspden 2
#spinat 7.0 2*0 7.0 2*0 7.0 2*0 7.0 2*0 -1.0 2*0 -1.0 2*0 -1.0 2*0 -1.0 2*0 36*0 # Gd 7mu_b; Ti 1 mu_b --> AFM coupled
spinat 2*0 0.0 2*0 0.0 2*0 0.0 2*0 0.0 2*0 1.0 2*0 1.0 2*0 1.0 2*0 1.0 36*0

# The pc axes are along [110],[-110] and [001] of the orthorombic axes
acell 1.0927882596E+01  1.4614453832E+01  1.0220501783E+01
angdeg 90.0 90.0 90.0
xred
# Gd atoms (0.0,0.0,0.0) in the pc cell
7.3064947075E-02  2.5000000000E-01  9.7718907754E-01
9.2693505293E-01  7.5000000000E-01  2.2810922462E-02
4.2693505293E-01  7.5000000000E-01  4.7718907754E-01
5.7306494707E-01  2.5000000000E-01  5.2281092246E-01
# Ti atoms (0.5,0.5,0.5) in the pc cell
5.0000000000E-01  0.0000000000E+00  0.0000000000E+00
-2.4884908616E-33 -2.7903770249E-33  5.0000000000E-01
5.0000000000E-01  5.0000000000E-01  0.0000000000E+00
-2.4884908616E-33  5.0000000000E-01  5.0000000000E-01
# O1 atoms (0.5,0.0,0.5) in the pc cell
1.8987146189E-01  9.4084057328E-01  1.9147676300E-01
8.1012853811E-01  5.9159426722E-02  8.0852323700E-01
8.1012853811E-01  4.4084057328E-01  8.0852323700E-01
1.8987146189E-01  5.5915942672E-01  1.9147676300E-01
# O2 atoms (0.0,0.5,0.5) in the pc cell
3.1012853811E-01  5.9159426722E-02  6.9147676300E-01
6.8987146189E-01  9.4084057328E-01  3.0852323700E-01
6.8987146189E-01  5.5915942672E-01  3.0852323700E-01
3.1012853811E-01  4.4084057328E-01  6.9147676300E-01
# O3 atoms (0.5,0.5,0.0) in the pc cell
9.5581685847E-01  2.5000000000E-01  3.7478305796E-01
4.5581685847E-01  2.5000000000E-01  1.2521694204E-01
5.4418314153E-01  7.5000000000E-01  8.7478305796E-01
4.4183141530E-02  7.5000000000E-01  6.2521694204E-01

# Exchange-Correlation functionnal
ixc 11 # to be used with PAW method

Thanks,
Charles.

charlesp · Post by **charlesp** » Sat May 09, 2015 8:42 pm

As for the bigger example that I gave earlier, it crashes after SCF iteration #8, Broyden step #5, after 10h of computing.
Charles.

Jordan · Post by **Jordan** » Mon May 11, 2015 10:45 am

I have not tried the smaller example but the bigger one given earlier is still running on my computer and is at broyden 8 SCF iteration8.
So it is much likely a architecture/compilation issue.

In any case, if you can perform a few broyden step, then you can restart your calculation every time it crashed but getting the last coordinates and lattices vector parameters and setting them in your input file.
Meanwhile, you can work on improving the compilation of abinit.
You should perhaps run the full testfarm on your installation to be sure that at least the tests work. If not, really change the way abinit was compiled.

If you have any question or issue regarding the configuration/compilation time, don't hesitate to ask.

Cheers,

Jordan

ABINIT Discussion Forums

paral_kgb error?

paral_kgb error?

Re: paral_kgb error?

Re: paral_kgb error?

Re: paral_kgb error?

Re: paral_kgb error?

Re: paral_kgb error?

Re: paral_kgb error?

Re: paral_kgb error?

Re: paral_kgb error?