Improve efficiency in bulk water calculation

Daniel_M · Post by **Daniel_M** » Fri Dec 01, 2017 1:00 pm

Dear all,

I am making some tests with a bulk water system (128 H2O molecules, cubic cell a = 15.66 angstrom, structure obtained from classical MD at 300 K). I am interested just in single point calculations for getting energy and forces at the PBE level at gamma point only, so from what I understand the only way of improving the efficiency is playing with the parallelization at the band and fft level. I am doing the tests in a Tier-0 supercomputer but it seems to me that my calculations are very slow, compared to the same calculation with other codes (e.g. CP2K, Siesta...) in the same machine, hence I think there is something quite wrong with my setup or compilation.
First, the output of abinit -b is this:

Code: Select all

 DATA TYPE INFORMATION: 
 REAL:      Data type name: REAL(DP) 
            Kind value:      8
            Precision:      15
            Smallest nonnegligible quantity relative to 1:  0.22204460E-15
            Smallest positive number:                       0.22250739-307
            Largest representable number:                   0.17976931+309
 INTEGER:   Data type name: INTEGER(default) 
            Kind value: 4
            Bit size:   32
            Largest representable number: 2147483647
 LOGICAL:   Data type name: LOGICAL 
            Kind value: 4
 CHARACTER: Data type name: CHARACTER             Kind value: 1
  ==== Using MPI-2 specifications ==== 
  MPI-IO support is ON
  xmpi_tag_ub ................   2147483647
  xmpi_bsize_ch ..............            1
  xmpi_bsize_int .............            4
  xmpi_bsize_sp ..............            4
  xmpi_bsize_dp ..............            8
  xmpi_bsize_spc .............            8
  xmpi_bsize_dpc .............           16
  xmpio_bsize_frm ............            4
  xmpi_address_kind ..........            8
  xmpi_offset_kind ...........            8
  MPI_WTICK ..................   1.000000000000000E-006

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 CPP options activated during the build:

                  CC_INTEL                 CXX_INTEL                  FC_INTEL
 
 HAVE_FC_ALLOCATABLE_DT...             HAVE_FC_ASYNC  HAVE_FC_COMMAND_ARGUMENT
 
      HAVE_FC_COMMAND_LINE        HAVE_FC_CONTIGUOUS           HAVE_FC_CPUTIME
 
             HAVE_FC_ETIME              HAVE_FC_EXIT             HAVE_FC_FLUSH
 
             HAVE_FC_GAMMA            HAVE_FC_GETENV            HAVE_FC_GETPID
 
   HAVE_FC_IEEE_EXCEPTIONS             HAVE_FC_IOMSG     HAVE_FC_ISO_C_BINDING
 
  HAVE_FC_ISO_FORTRAN_2008        HAVE_FC_LONG_LINES        HAVE_FC_MOVE_ALLOC
 
           HAVE_FC_PRIVATE         HAVE_FC_PROTECTED         HAVE_FC_STREAM_IO
 
            HAVE_FC_SYSTEM                  HAVE_FFT        HAVE_FFT_FFTW3_MKL
 
              HAVE_FFT_MPI           HAVE_FFT_SERIAL        HAVE_LIBPAW_ABINIT
 
      HAVE_LIBTETRA_ABINIT               HAVE_LINALG         HAVE_LINALG_AXPBY
 
        HAVE_LINALG_GEMM3M  HAVE_LINALG_MKL_IMATCOPY   HAVE_LINALG_MKL_OMATADD
 
  HAVE_LINALG_MKL_OMATCOPY   HAVE_LINALG_MKL_THREADS        HAVE_LINALG_SERIAL
 
                  HAVE_MPI                 HAVE_MPI2       HAVE_MPI_IALLREDUCE
 
        HAVE_MPI_IALLTOALL       HAVE_MPI_IALLTOALLV        HAVE_MPI_INTEGER16
 
               HAVE_MPI_IO HAVE_MPI_TYPE_CREATE_S...             HAVE_OS_LINUX
 
         HAVE_TIMER_ABINIT              USE_MACROAVE                            
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 === Build Information === 
  Version       : 8.6.1
  Build target  : x86_64_linux_intel17.0
  Build date    : 20171130
 
 === Compiler Suite === 
  C compiler       : intel17.0
  C++ compiler     : intel17.0
  Fortran compiler : intel17.0
  CFLAGS           : -mkl
  CXXFLAGS         :  -g -O2 -vec-report0
  FCFLAGS          : -mkl
  FC_LDFLAGS       : 
 
 === Optimizations === 
  Debug level        : basic
  Optimization level : standard
  Architecture       : intel_xeon
 
 === Multicore === 
  Parallel build : yes
  Parallel I/O   : yes
  openMP support : no
  GPU support    : no
 
 === Connectors / Fallbacks === 
  Connectors on : yes
  Fallbacks on  : yes
  DFT flavor    : none
  FFT flavor    : fftw3-mkl
  LINALG flavor : mkl
  MATH flavor   : none
  TIMER flavor  : abinit
  TRIO flavor   : none
 
 === Experimental features === 
  Bindings            : @enable_bindings@
  Exports             : no
  GW double-precision : no
 
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Default optimizations:
   --- None ---
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

And the input (irrelevant parts abbreviated) is:

Code: Select all

atom 384
ntypat 2
znucl 8 1
typat 1 2 2 ....
ionmov 0

nstep 100
toldfe 1.0d-5

ecut 150 Ry

kptopt 0
nkpt 1
kpt 0 0 0
nband 528
paral_kgb 1
npband 48
npfft 6
fftalg 401

chksymbreak 0
acell 3*15.6626780696 angstrom
nsym 1
symrel 1 0 0 0 1 0 0 0 1
xangst  ...

Now, some observations:

- When I try using "fftalg 312", the calculation stops with "FFTW3 support not activated". This should be fixed by including the flags HAVE_FFT_FFTW3 and/or HAVE_FFT_FFTW3_THREADS when compiling, right?

- I tried with "autoparal 3", but it seems to me it is better to set manually npband and npfft. It seem also obvious that they must be such npband*npfft = total no. of cores, at least in this calculation where there are no k-points or spin involved. Am I right or am I missing something obvious?

- You may think that ecut is too high. This is bacause I am making tests to check convergence of the forces wrt the cutoff, which is not yet reached for 150 Ry. From what I know from some other codes the convergence may be improved by applying some smoothing on the density for the xc calculation (e.g. see Appendix in Jonchiere et al. JCP , 135, 154503 (2011)). Is it something similar implemented in abinit?

I will be very thankful for any comment that may help me improve this calculation. Thanks a lot,
D.

Annelinde · Post by **Annelinde** » Thu Jul 26, 2018 11:12 am

Hi,

I don't have a solution, but I also noticed bad scaling of the parallelisation in nanoribbons (lots of electronic bands, little kpoints) on Tier-1 and Tier-2 machines. I concluded that kpoint-parallelisation must be the most efficient in ABINIT, and when you can't make use of this kind of parallelisation, your calculation is terribly slow. I don't know why this is.

with kind regards,

Annelinde

ABINIT Discussion Forums

Improve efficiency in bulk water calculation

Improve efficiency in bulk water calculation

Re: Improve efficiency in bulk water calculation