Screening crashed at some q-points

GW, Bethe-Salpeter …

Moderators: maryam.azizi, bruneval

Locked
thanusit
Posts: 70
Joined: Thu Jan 14, 2010 4:20 am

Screening crashed at some q-points

Post by thanusit » Mon Mar 12, 2018 8:55 am

Dear all

I have tried to carry out my gw calculation by following the procedure given in tutoparal/tmbt_x. There were four steps in the calculation: gs_DEN, WFK, srceening, and sigma. There was no problem in the first two steps. But in the screening part, in which I performed at one q-point at a time, the runs crashed at 5 q-points (out of total of 35). The input file, the error message in "__ABI_MPIABORTFILE__" and the log file, and as well as the build information are given as follows. (Similar issue has been posted in this forum but there isn't any answer yet.)

I don't know where I got it wrong. All suggestions to solve or get around the problem will be greatly appreciated.

Best regards,
Thanusit

# input file

Code: Select all

# Crystalline Ga4P3Ti
# Calculation of the GW-corrected band structure.

# Part-3: Screening Calculation 
 optdriver   3   # Screening calculation
   irdwfk    1
    symchi   1   # Use symmetries to speedup the BZ integration

 gwcalctyp   2   # GW calculation using numerical integration (contour deformation method)
    spmeth   1    # Enable the spectral method.
  nomegasf   100  # Number of points for the spectral function.
    gwpara   2   # Parallelization over bands
      awtr   1   # Take advantage of time-reversal. Mandatory when gwpara=2 is used.

   ecutwfn   12  # Cutoff for the wavefunctions.
   ecuteps   12  # Cutoff for the polarizability.
     nband   60  # Number of bands in the RPA expression (16 occupied + 16 empty bands)
   inclvkb   2   # Correct treatment of the optical limit.

   nfreqim   5
   nfreqre   20
 freqremax   40 eV

   nqptdm   1
    qptdm 
#     1)     0.00000000E+00  0.00000000E+00  0.00000000E+00
#     2)     1.25000000E-01  0.00000000E+00  0.00000000E+00
#     3)     2.50000000E-01  0.00000000E+00  0.00000000E+00
#     4)     3.75000000E-01  0.00000000E+00  0.00000000E+00
#     5)     5.00000000E-01  0.00000000E+00  0.00000000E+00
#     6)     1.25000000E-01  1.25000000E-01  0.00000000E+00
#     7)     2.50000000E-01  1.25000000E-01  0.00000000E+00
#     8)     3.75000000E-01  1.25000000E-01  0.00000000E+00
#     9)     5.00000000E-01  1.25000000E-01  0.00000000E+00
#    10)     2.50000000E-01  2.50000000E-01  0.00000000E+00
#    11)     3.75000000E-01  2.50000000E-01  0.00000000E+00
#    12)     5.00000000E-01  2.50000000E-01  0.00000000E+00
#    13)     3.75000000E-01  3.75000000E-01  0.00000000E+00
#    14)     5.00000000E-01  3.75000000E-01  0.00000000E+00
#    15)     5.00000000E-01  5.00000000E-01  0.00000000E+00
#    16)     1.25000000E-01  1.25000000E-01  1.25000000E-01
#    17)     2.50000000E-01  1.25000000E-01  1.25000000E-01
#    18)     3.75000000E-01  1.25000000E-01  1.25000000E-01
#    19)     5.00000000E-01  1.25000000E-01  1.25000000E-01
             2.50000000E-01  2.50000000E-01  1.25000000E-01
#    21)     3.75000000E-01  2.50000000E-01  1.25000000E-01
#    22)     5.00000000E-01  2.50000000E-01  1.25000000E-01
#    23)     3.75000000E-01  3.75000000E-01  1.25000000E-01
#    24)     5.00000000E-01  3.75000000E-01  1.25000000E-01
#    25)     5.00000000E-01  5.00000000E-01  1.25000000E-01
#    26)     2.50000000E-01  2.50000000E-01  2.50000000E-01
#    27)     3.75000000E-01  2.50000000E-01  2.50000000E-01
#    28)     5.00000000E-01  2.50000000E-01  2.50000000E-01
#    29)     3.75000000E-01  3.75000000E-01  2.50000000E-01
#    30)     5.00000000E-01  3.75000000E-01  2.50000000E-01
#    31)     5.00000000E-01  5.00000000E-01  2.50000000E-01
#    32)     3.75000000E-01  3.75000000E-01  3.75000000E-01
#    33)     5.00000000E-01  3.75000000E-01  3.75000000E-01
#    34)     5.00000000E-01  5.00000000E-01  3.75000000E-01
#    35)     5.00000000E-01  5.00000000E-01  5.00000000E-01

########################################
#Common input variables

#Definition iof the unit cell
   acell  3*1.0306603546E+01   # From structural optimization with optcell2, iscf7
                           # ionmove3, occopt3, ngkpt888, tsmear0.01, ecut28, nband54, tolmxf-9, toldfe-12
   rprim  1.0 0.0 0.0      # Cartesian components of primitive vectors
          0.0 1.0 0.0      # of the simple cubic lattice
          0.0 0.0 1.0   

#Definition of the atomic types, number, and and positions
  ntypat  3           # There are three types of atom in a unit cell
   znucl  31 15 22    # the atomic number of Ga, P and Ti, respectively.
                      # The pseudopotentials mentioned in the "files" file must correspond
                      # to the types of atom.
   natom  8           # There are 4 Ga, 3 P and 1 Ti atoms.
   typat 1 1 1 1 3 2 2 2  # 1=Ga, 2=P, 3=Ti (doping position)
    xred               
           -1.7296466397E-02 -1.7296466397E-02 -1.7296466397E-02
            5.1729646640E-01  5.1729646640E-01 -1.7296466397E-02
            5.1729646640E-01 -1.7296466397E-02  5.1729646640E-01
           -1.7296466397E-02  5.1729646640E-01  5.1729646640E-01
            2.5000000000E-01  2.5000000000E-01  2.5000000000E-01
            2.5000000000E-01  7.5000000000E-01  7.5000000000E-01
            7.5000000000E-01  2.5000000000E-01  7.5000000000E-01
            7.5000000000E-01  7.5000000000E-01  2.5000000000E-01

#State occupation
  occopt  3     #metallic   
  tsmear  0.01

#Default k-points grid for k-centered wfk, scr, sigma calculations
    kptopt  1            # Option for the automatic generation of k points, taking
                          # into account the symmetry
     ngkpt  8 8 8         # Density of k point grids
   nshiftk  1
    shiftk  0.0 0.0 0.0  # This grid contains the Gamma point, which is the point at which
    istwfk *1            # For the GW computations, do not take advantage of the
                          # specificities of k points to reduce the number of components of the
                          # wavefunction.

#Include non-symmorphic operations
 symmorphi  1    #This is just to emphasize using of the default value.

#Exchange-correlation functional
       ixc  1    # coresponding to the pspnc used

#Definition of the planewave basis set
      ecut  32            # Maximal kinetic energy cut-off, in Hartree
    ecutsm  0.5
   dilatmx  1.2
                       
# add to conserve old < 6.7.2 behavior for calculating forces at each SCF step
 optforces 1

#data printing option
    prtvol 1
 


# Excecuting cmd: mpirun -n 7 abinit<Ga4P3Ti_lda_gw_band-3.files>& log

#Error message in _ABI_MPIABORTFILE__ (Similar errors, with different values of my_wr, occurred in the other falied runs)

Code: Select all

--- !BUG
src_file: m_chi0.F90
src_line: 2037
mpi_rank: 6
message: |

     Indices out of boundary
      my_wl = 2 iomegal = 1
      my_wr = 100 iomegar = 2
...



#log file (see attachment)
#part of the log file that seems to relate to the error message above

Code: Select all

=== Info on the real frequency mesh for spectral method ===
  maximum frequency =   30.453 [eV]
  nomegasf =   100
  domegasf =  0.31369 [eV]
 Using linear mesh for Im chi0
 my_wl and my_wr: 4 73
 memory required per spectral point:    67.8693 [Mb]
 memory required by sf_chi0:     4.6395 [Gb]

--- !WARNING
src_file: cchi0.F90
src_line: 398
message: |
    Memory required for sf_chi0 is larger than 2.0 Gb!
...


#Build version and platform

Code: Select all

- Abinit-8.6.3
- Intel Xenon under Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-112-generic x86_64) with gcc-5.4.0 and openmpi-1.10.3 (local built)
- locally-built external libs: netcdf-4.3.1 (with hdf5-1.8.12), fftw-3.3.4, atlast-3.10.1(with lapack-3.5.0), gsl-1.16, levmar-2.5


#build config.ac

Code: Select all

prefix="/data2/thanusit/local/apps/abinit-8.6.3"
enable_exports="yes"
enable_64bit_flags="yes"
enable_gw_dpc="yes"
enable_mpi="yes"
enable_mpi_io="yes"
enable_bse_unpacked="yes"
with_mpi_prefix="/data2/thanusit/local/apps/openmpi-1.10.3"
with_trio_flavor="netcdf"
with_netcdf_incs="-I/data2/thanusit/local/apps/netcdfmpi/include"
with_netcdf_libs="-L/data2/thanusit/local/apps/netcdfmpi/lib -lnetcdf -lnetcdff"
with_fft_flavor="fftw3"
with_fft_incs="-I/data2/thanusit/local/apps/fftw-3.3.4/include"
with_fft_libs="-L/data2/thanusit/local/apps/fftw-3.3.4/lib -lfftw3 -lfftw3f"
#with_fft_incs="-I/usr/include"
#with_fft_libs="-L/usr/lib/x86_64-linux-gnu -lfftw3 -lfftw3f"
with_linalg_flavor="atlas"
with_linalg_incs="-I/data2/thanusit/local/apps/atlas/include -I/data2/thanusit/local/apps/atlas/include/atlas"
with_linalg_libs="-L/data2/thanusit/local/apps/atlas/lib -ltatlas -L/usr/lib -lblacs-openmpi -lblacsCinit-openmpi -lblacsF77init-openmpi"
#with_linalg_incs="-I/usr/include"
#with_linalg_libs="-L/usr/lib -llapack -lf77blas -lcblas -latlas"
with_algo_flavor="levmar"
with_algo_incs="-I/data2/thanusit/local/apps/levmar-2.5/include"
with_algo_libs="-L/data2/thanusit/local/apps/levmar-2.5/lib -llevmar"
with_math_flavor="gsl"
with_math_incs="-I/data2/thanusit/local/apps/gsl-1.16/include/gsl"
with_math_libs="-L/data2/thanusit/local/apps/gsl-1.16/lib -lgsl -lgslcblas -lm"
with_dft_flavor="atompaw+bigdft+libxc+wannier90"


# config.log (see attachment)

# Test results (../abinit-8.6.3/tests/runtest.py -n2 -j1 -t0)

Code: Select all

failed    passed    succeeded    skipped    disabled
13             93               637              167    0
tot_etime = 49759.87
run_etime = 49611.25
no_pythreads = 1
no_MPI = 2
[MPI setup]
mpirun_np = mpirun -np
Attachments
log.in
THe log file from running the input.
(210.2 KiB) Downloaded 323 times
config-log.in
The config.log file of the built abinit-8.6.3
(180.8 KiB) Downloaded 331 times

ebousquet
Posts: 469
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: Screening crashed at some q-points

Post by ebousquet » Tue Mar 13, 2018 10:48 am

Dear Thanusit,
I don't know about the GW part of Abinit, but in your automatic tests you have 13 tests that failed. This means that your compilation of Abinit breaks some tests and thus it is might not be safe to you use it, mostly if these wrong tests are the GW ones... Could you check which are these 13 automatic tests?
Best wishes,
Eric

thanusit
Posts: 70
Joined: Thu Jan 14, 2010 4:20 am

Re: Screening crashed at some q-points

Post by thanusit » Thu Mar 15, 2018 3:51 pm

Dear Eric and all

Thank you for your reply.

Among the 13 failed tests, there is one that is indeed related to GW calculation, v67mbpt_t19. A close look into the test report and as well as the output file t19.out, it appears that the test got the RPA energy in datatset#3 that is obviously inconsistent to that given in the reference file (~../Ref/t19.out), as shown below. Note that the RPA results the 1st, 2nd, 3rd, and 4th row of each group are corresponding to that from dataset#2, #3, #4, and #5, respectively.

Code: Select all

(Test suite with np =2)
/Test_suite_full_-n2-j1-t0_13-failed_93-passed_637-succeeded/v67mbpt_t19$ grep "RPA" t19.out
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.00001626
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125

(../Ref/t19.out)
grep "RPA" /data2/thanusit/src_build/abinit-8.6.3/tests/v67mbpt/Refs/t19.out
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125


However, when I executed ../abinit-8.6.3/tests/runtest.py v67mbpt with only 1 proc (-n1 -j1 -t0), 3 tests in the set were passed while the remaining, including v67mbpt_t19, were all succeeded. Also, the RPA energies were all agree well with (within 1E10-7) that of the reference file, as shown below.

Code: Select all

 grep "RPA" ../Test_suite_v67mbpt_-n1_-j1_t0_0-failed/v67mbpt_t19/t19.out
 RPA energy [Ha] :     -0.32125627
 RPA energy [Ha] :     -0.32125627
 RPA energy [Ha] :     -0.41250685
 RPA energy [Ha] :     -0.21616125


In addition, I tried running directly with mpirun the v67mbpt_t19 using various n-proc, i.e. -n =1, 2, 3, 4, 5, 6, 10. The outputs, listed below, show that the RPA energy in dataset#3 is consisitent to the reference file only when n-proc=1, otherwise it has different values for different n-proc.

Code: Select all

/Test_v67mbpt_t19$ grep "RPA" v67mbpt_t19_np1/t19.out
 RPA energy [Ha] :     -0.32125627
 RPA energy [Ha] :     -0.32125627
 RPA energy [Ha] :     -0.41250685
 RPA energy [Ha] :     -0.21616125

/Test_v67mbpt_t19$ grep "RPA" v67mbpt_t19_np2/t19.out
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.00001626
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125

/Test_v67mbpt_t19$ grep "RPA" v67mbpt_t19_np3/t19.out
 RPA energy [Ha] :     -0.32125627
 RPA energy [Ha] :     -0.34181169
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125

/Test_v67mbpt_t19$ grep "RPA" v67mbpt_t19_np4/t19.out
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.00001626
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125

/Test_v67mbpt_t19_np1-10$ grep "RPA" v67mbpt_t19_np5/t19.out
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.53340146
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125

/Test_v67mbpt_t19_np1-10$ grep "RPA" v67mbpt_t19_np6/t19.out
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.02057168
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125

/Test_v67mbpt_t19$ grep "RPA" v67mbpt_t19_np10/t19.out
 RPA energy [Ha] :     -0.32125626
 RPA energy [Ha] :     -0.23271688
 RPA energy [Ha] :     -0.41250684
 RPA energy [Ha] :     -0.21616125



I don't know what are these results suggesting and I'm not sure if this relates to the crashed runs in question above. Is it still save to use this my build of abinit-8.6.3 for production, especially concerning GW work. Hope to get some advice.

Kind regards,
Thanusit

ebousquet
Posts: 469
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: Screening crashed at some q-points

Post by ebousquet » Fri Mar 16, 2018 4:00 pm

Dear Thanusit,
There is something wrong with the parallel compilation.
Could you add the following in your config file to be sure no extra flags are used:
enable_optim="standard"

Can't you install a more recent version of openmpi?

All the best,
Eric

thanusit
Posts: 70
Joined: Thu Jan 14, 2010 4:20 am

Re: Screening crashed at some q-points

Post by thanusit » Mon Mar 19, 2018 8:01 am

Dear Eric and all

1. Regarding the failed-test t19 of v67mbpt, I tried adding enable_optim="standard" in the config option. Runtest of v67mbpt with -n2 -j1 -t0 still failed with exactly the same result as before. I will try using a newer version of openmpi to see if that can help on the issue.

2. For the crashed screening runs, the problem was fixed if I set nband=56 for n-proc=7. The trial was just to make n-proc divides nband. I'm not sure if this is the proper solution to the problem.

Kind regards
Thanusit

ebousquet
Posts: 469
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: Screening crashed at some q-points

Post by ebousquet » Mon Mar 19, 2018 11:33 pm

thanusit wrote:1. Regarding the failed-test t19 of v67mbpt, I tried adding enable_optim="standard" in the config option. Runtest of v67mbpt with -n2 -j1 -t0 still failed with exactly the same result as before. I will try using a newer version of openmpi to see if that can help on the issue.

OK, I have to look a bit further then because with a quick look I don't see what could be wrong in your compilation. Sorry for the openmpi version, you have a version that is OK, I went too fast in reading your post and read 1.1 instead of 1.10...
thanusit wrote:2. For the crashed screening runs, the problem was fixed if I set nband=56 for n-proc=7. The trial was just to make n-proc divides nband. I'm not sure if this is the proper solution to the problem.

OK, good, it is always better to adapt the number of bands w.r.t. the number of CPU/kpoints.

thanusit
Posts: 70
Joined: Thu Jan 14, 2010 4:20 am

Re: Screening crashed at some q-points

Post by thanusit » Thu Mar 22, 2018 2:43 am

Dear Eric and all
Eric wrote :
Sorry for the openmpi version, you have a version that is OK...

Thank you for the update. Anyway, I did try building abinit-8.6.3 with openmpi-2.1.3. Still get the same failed test results of v67mbpt.

For the screening calculations, the _SCR files of each q-point were obtained and successfully merged with "mrgscr" utility. The merged _SCR
has been input for a sigma calculation, which is running at the moment. Hope these are good signs of getting things right. Wait to see if the sigma run get
through successfully.

Kind regards,
Thanusit

Locked