Page 1 of 1
Problem with cuda
Posted: Thu Jul 10, 2014 8:14 am
by sheng
Hi I am configuring Abinit 7.6.4 using gcc 4.9.0 with cuda 6.0. Below is the configuration I use:
Code: Select all
enable_64bit_flags="yes"
enable_mpi="yes"
enable_mpi_io="yes"
enable_openmp="yes"
with_mpi_prefix="/usr/estools/openmpi/1.8.1/gcc4.9.0"
with_netcdf_incs="-I/usr/estools/netcdf/4.3.2/include"
with_netcdf_libs="-L/usr/estools/netcdf/4.3.2/lib -lnetcdf -lnetcdff"
with_etsf_io_incs="-I/usr/estools/etsf_io/1.0.4/include/gcc"
with_etsf_io_libs="-L/usr/estools/etsf_io/1.0.4/lib -letsf_io_low_level -letsf_io_utils -letsf_io"
with_fft_flavor="fftw3-mpi"
with_fft_incs="-I/usr/estools/fftw/3.3.4/include"
with_fft_libs="-L/usr/estools/fftw/3.3.4/lib -lfftw3 -lfftw3f -lfftw3_mpi -lfftw3f_mpi"
with_linalg_flavor="atlas+magma+scalapack"
with_linalg_incs="-I/usr/estools/ScaLAPACK/include -I/usr/estools/magma/1.5.0-beta2/include -I/usr/estools/ATLAS/3.10.1/include"
with_linalg_libs="-L/usr/estools/ScaLAPACK/lib -lscalapack -L/usr/estools/magma/1.5.0-beta2/lib -lmagma -L/usr/estools/ATLAS/3.10.1/lib -llapack -lf77blas -lcblas -latlas"
with_algo_incs="-I/usr/estools/levmar/2.6/include"
with_algo_libs="-L/usr/estools/levmar/2.6/lib -llevmar -L/usr/estools/ATLAS/3.10.1/lib -llapack -lf77blas -lcblas -latlas"
with_math_incs="-I/usr/estools/gsl/1.16/include"
with_math_libs="-L/usr/estools/gsl/1.16/lib -lgsl -lgslcblas"
with_atompaw_bins="/usr/estools/atompaw/4.0.0.8/bin"
with_atompaw_incs="-I/usr/estools/atompaw/4.0.0.8/include"
with_atompaw_libs="-L/usr/estools/atompaw/4.0.0.8/lib -latompaw"
with_bigdft_incs="-I/usr/estools/bigdft/1.7.1/include"
with_bigdft_libs="-L/usr/estools/libarchive/3.1.2/lib -larchive -L/usr/estools/bigdft/1.7.1/lib -lyaml -ls_gpu -lbigdft-1 -labinit -L/usr/estools/etsf_io/1.0.4/lib -letsf_io_low_level -letsf_io_utils -letsf_io -L/usr/estools/netcdf/4.3.2/lib -lnetcdf -lnetcdff -L/usr/local/cuda-6.0/lib64 -lcublas -lcufft -lcudart"
with_libxc_incs="-I/usr/estools/libxc/2.0.2/include"
with_libxc_libs="-L/usr/estools/libxc/2.0.2/lib -lxc"
with_wannier90_bins="/usr/estools/wannier90/2.0.0/bin"
with_wannier90_incs="-I/usr/estools/wannier90/2.0.0/include"
with_wannier90_libs="-L/usr/estools/wannier90/2.0.0/lib -lwannier"
with_dft_flavor="atompaw+bigdft+libxc+wannier90"
with_trio_flavor="netcdf+etsf_io"
with_algo_flavor="levmar"
with_math_flavor="gsl"
enable_gw_dpc="yes"
enable_gpu="yes"
with_gpu_flavor="cuda-double"
with_gpu_incs="-I/usr/local/cuda-6.0/include"
with_gpu_libs="-L/usr/local/cuda-6.0/lib64 -lcublas -lcufft -lcudart"
Configure step can be completed successfully. However i encounter error with abinit-7.6.4/src/15_gpu_toolbox/ during the make process:
Code: Select all
../../../abinit-7.6.4/src/15_gpu_toolbox/dev_spec.cu(19): warning: function "prt_dev_info" was declared but never referenced
../../../abinit-7.6.4/src/incs/cuda_header.h(58): error: identifier "sprintf" is undefined
1 error detected in the compilation of "/tmp/tmpxft_00000751_00000000-6_gpu_linalg.cpp1.ii".
../../../abinit-7.6.4/src/incs/cuda_header.h(58): error: identifier "sprintf" is undefined
../../../abinit-7.6.4/src/15_gpu_toolbox/timing_cuda.cu(42): error: identifier "printf" is undefined
../../../abinit-7.6.4/src/15_gpu_toolbox/timing_cuda.cu(57): error: identifier "printf" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_00000758_00000000-6_timing_cuda.cpp1.ii".
make[4]: *** [gpu_linalg.o] Error 2
make[4]: *** Waiting for unfinished jobs....
make[4]: *** [timing_cuda.o] Error 2
../../../abinit-7.6.4/src/15_gpu_toolbox/dev_spec.cu(19): warning: function "prt_dev_info" was declared but never referenced
make[4]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build/src/15_gpu_toolbox'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build/src'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build'
make: *** [multi] Error 2
Did I do anything wrong?
Re: Problem with cuda
Posted: Sat Jul 12, 2014 11:32 am
by sheng
I have solved the mentioned problem simply by adding header files
#include <stdio.h>
to abinit-7.6.4/src/incs/cuda_header.h and abinit-7.6.4/src/15_gpu_toolbox/timing_cuda.cu.
However new errors appear:
Code: Select all
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:79.20:
wvl%atoms%geocode = 'P'
1
Error: 'geocode' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:81.20:
wvl%atoms%geocode = 'F'
1
Error: 'geocode' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:83.20:
wvl%atoms%geocode = 'S'
1
Error: 'geocode' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:85.22:
write(wvl%atoms%units, "(A)") "Bohr"
1
Error: Syntax error in WRITE statement at (1)
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:91.28:
write(wvl%atoms%atomnames(itype), "(A,I2)") "At. type", itype
1
Error: Syntax error in WRITE statement at (1)
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:93.16:
wvl%atoms%alat1 = acell(1)
1
Error: 'alat1' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:94.16:
wvl%atoms%alat2 = acell(2)
1
Error: 'alat2' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:95.16:
wvl%atoms%alat3 = acell(3)
1
Error: 'alat3' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:96.17:
wvl%atoms%iatype = typat
1
Error: 'iatype' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:110.14:
wvl%atoms%sym%symObj = 0
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:111.22:
nullify(wvl%atoms%sym%irrzon)
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set.F90:112.22:
nullify(wvl%atoms%sym%phnons)
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/nullify_wvl_data.F90:54.23:
& nullify_diis_objects, nullify_wfn_metadata, nullify_p2pcomms,&
1
Error: Symbol 'nullify_wfn_metadata' referenced at (1) not found in module 'bigdft_api'
../../../abinit-7.6.4/src/43_wvl_wrappers/nullify_wvl_data.F90:108.43:
call nullify_wfn_metadata(wvl%wfs%ks%wfnmd)
1
Error: 'wfnmd' at (1) is not a member of the 'dft_wavefunction' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/nullify_wvl_data.F90:110.39:
call nullify_p2pcomms(wvl%wfs%ks%comon)
1
Error: 'comon' at (1) is not a member of the 'dft_wavefunction' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/nullify_wvl_data.F90:116.39:
call nullify_p2pcomms(wvl%wfs%ks%comrp)
1
Error: 'comrp' at (1) is not a member of the 'dft_wavefunction' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/nullify_wvl_data.F90:118.39:
call nullify_p2pcomms(wvl%wfs%ks%comsr)
1
Error: 'comsr' at (1) is not a member of the 'dft_wavefunction' structure
make[4]: *** [nullify_wvl_data.o] Error 1
make[4]: *** Waiting for unfinished jobs....
make[4]: *** [wvl_descr_atoms_set.o] Error 1
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_denspot_set.F90:115.66:
call denspot_communications(me,nproc,ixc,nsppol,wvl%atoms%geocode,"NONE",den%d
1
Error: 'geocode' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_denspot_set.F90:127.27:
den%symObj = wvl%atoms%sym%symObj
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set_sym.F90:98.14:
wvl%atoms%sym%symObj = -1
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set_sym.F90:99.22:
nullify(wvl%atoms%sym%irrzon)
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set_sym.F90:100.22:
nullify(wvl%atoms%sym%phnons)
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set_sym.F90:103.40:
call symmetry_set_n_sym(wvl%atoms%sym%symObj, nsym, symrel, tnons, symafm, e
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set_sym.F90:105.14:
wvl%atoms%sym%irrzon => irrzon
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_descr_atoms_set_sym.F90:106.14:
wvl%atoms%sym%phnons => phnons
1
Error: 'sym' at (1) is not a member of the 'atoms_data' structure
make[4]: *** [wvl_denspot_set.o] Error 1
make[4]: *** [wvl_descr_atoms_set_sym.o] Error 1
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_denspot_free.F90:42.5:
use defs_wvltypes
1
Warning: Although not referenced, 'generic interface 'memocc'' has ambiguous interfaces at (1)
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_denspot_free.F90:77.9:
nullify(den%denspot%pkernel)
1
Error: Non-POINTER in pointer association context (pointer assignment) at (1)
../../../abinit-7.6.4/src/43_wvl_wrappers/wvl_denspot_free.F90:78.9:
nullify(den%denspot%pkernelseq)
1
Error: Non-POINTER in pointer association context (pointer assignment) at (1)
make[4]: *** [wvl_denspot_free.o] Error 1
make[4]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build/src/43_wvl_wrappers'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build/src'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/home/sheng/Desktop/program/Abinit/abinit-7.6.4-build'
make: *** [multi] Error 2
Any help is appreciated.
Re: Problem with cuda
Posted: Mon Jul 14, 2014 10:20 am
by sheng
The above seems to be avoided in the latest Abinit 7.8.1, only to end at another error:
Code: Select all
../../../abinit-7.8.1/src/98_main/mrgscr.F90:2386.51:
!$OMP PRIVATE(ig1,ig2,refval,imfval,phase)
1
Error: Object 'one' is not a variable at (1)
update: the error seems to be associated with levmar. I can finish building abinit by dropping levmar support.
Without levmar support, I have built abinit and try to run the test ./runtests.py -t0 -j 4 fast.
As usual some errors comes out again. All test sections fail and I list down the error for one section below:
Code: Select all
[fast][t16][np=1]: fldiff.pl fatal error:
The diff analysis cannot be done: the number of lines to be analysed differ.
File /home/sheng/Desktop/program/Abinit/abinit-7.8.1/tests/fast/Refs/t16.out: 103 lines, 18 ignored
File /home/sheng/Desktop/program/Abinit/abinit-7.8.1-build/tests/Test_suite/fast_t03-t05-t06-t07-t08-t09-t11-t12-t14-t16/t16.out: 105 lines, 21 ignored
#0 0x7FC144227387
#1 0x1216CFD in __m_errors_MOD_msg_hndl
#2 0xC4E901 in chkinp_
#3 0x411A3B in MAIN__ at abinit.F90:416
chkinp : ERROR -
When GPU is in use (use_gpu_cuda=1), ngfft(4:6) must be equal to ngfft(1:3) !
Action: suppress ngfft in input file or change it.
--- !ERROR
message: |
Checking consistency of input data against itself gave 1 inconsistencies.
src_file: chkinp.F90
src_line: 3003
...
Any help is greatly appreciated.
Re: Problem with cuda
Posted: Mon Jul 14, 2014 10:18 pm
by jbeuken
Hi,
Code: Select all
chkinp : ERROR -
When GPU is in use (use_gpu_cuda=1), ngfft(4:6) must be equal to ngfft(1:3) !
Action: suppress ngfft in input file or change it.
Have you done a few things in relation to the suggested action ?
another general remark : you use a lot of not officially tested "options" on our test farm :
for example :
- levmar ( not tested nightly )
- magma 1.5 beta ( we test cuda with magma 1.2.1 )
- GPU + scalapack + OMP ( we don't use GPU and OMP together )
- the test suite does not covered the scalapack part of code then scalapack is not tested, only the compilation is OK
- cuda 6 ( we use CUDA 4.2 and CUDA 5 )
- gcc 4.9.0 ( ok, it seems mature but it's a "zero" version )
- there are only 4 GPU tests tested in the testsuite on our test farm ( for example, tests "fast" are not tester on the CUDA bot because references are different…)
( see :
http://buildbot.abinit.org/builders/bud ... gs/summary)
I can understand that you try to get a "fast" ABINIT version for production but it's a very complicated configuration…
regards
jmb
Re: Problem with cuda
Posted: Tue Jul 15, 2014 11:00 pm
by roginovicci
Is it possible to talk about fast configuration when code compiled with gnu C? I've found PGI or Intel compilators produces boost in performance up to 70%. That was a couple years thought.
I'm not sure topic starter use linux platform, but I can assume based on my experience that best linux distros for abinit is redhat/centos or debian. I've found some problem using abinit with arch linux (same problems could occur in fedora or ubuntu) because its testing distros in fact. Anyway here is part of my script working quite well with cuda.
export NVCC="$CUDA/bin/nvcc"
export FC=$MPICH2/bin/mpiifort
export CC=$MPICH2/bin/mpiicc
export CXX=$MPICH2/bin/mpiicpc
./configure --prefix=/opt/abinit --enable-mpi --enable-mpi-io \
--with-linalg-incs="-I$INTEL_COMP/mkl/include" \
--with-fft-flavor="fftw3-mkl" --with-fft-incs="-I$INTEL_COMP/mkl/include/fftw" \
--with-fft-libs="-L$INTEL_COMP/mkl/lib/intel64 -lfftw3xf_intel" \
--with-linalg-flavor="mkl+magma" \
--with-linalg-libs="-L$INTEL_COMP/mkl/lib/intel64 -L/opt/magma/lib -lmagma -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread -limf -lsvml -lirc -lmkl_sequential -lmkl_blacs_intelmpi_lp64 -lmkl_scalapack_lp64 " \
--with-linalg-incs="-I/opt/magma/include" \
--with-dft-flavor="atompaw+bigdft+libxc+wannier90" \
--enable-gpu --with-gpu-flavor=cuda-double --with-gpu-libs="-L$CUDA/lib64 -lcublas -lcufft -lcudart" \
--with-gpu-incs="-I$CUDA/include"
Can't help with levmar and OMP though, sorry.
Re: Problem with cuda
Posted: Wed Jul 16, 2014 5:50 pm
by Jordan
Hi,
I have exactly one of the problems above.
I compiled abinit-7.8.2 with cuda 4.2 and magam 1.2.1 (intel14+MKL11+openmpi-1.5.4) when I run the gpu tests (runtests.py gpu) test gpu_t01 succeeded but the other failed with the fft problem mentioned above (dimension 1:3 disagree with 4:6).
The ngfft is not set in the input file so abinit should be able to manage it correctly (I never use it anyway since abinit does it perfectly for me).
I've sent a mail to Marc and I'm waiting for his reply.
I would like to debug the code but I don't have any GUI debugger on the cluster so it will take time untill I find the bug and try to crrect it
Jordan
Re: Problem with cuda
Posted: Thu Jul 17, 2014 4:48 pm
by sheng
Thanks for all the great replies and I am indeed using linux. Unfortunately I am out of town now. Further attempts will be done without levmar and openmp and I will report the results as soon as possible.
If the fft problem appears again, any idea on how enforce the fft(1:3) to be equal to fft(4:6)? I am under the impression that the parameter ngfft is an array of three numbers only.
Thank you.
Re: Problem with cuda
Posted: Fri Jul 18, 2014 4:41 pm
by Jordan
I may have found what causes gpu tests/runs/ to fail.
When generating ngfft, abinit checks 2 things at the end : if we use cuda then ngfft(4:6)=ngfft(1:3); and then if we use FFTW3 (or DFTI) update ngfft(4:5) to ngfft(1:2)+1 only if ngfft(1:2) is even.
If you have a look in test gpu_t02, ngfft(1:2)=[12,12] so at the end of the routine ngfft(4:5)=[13,13].
I am not an expert in FFT so what I would suggest is to force fftalg to 112 instead of 312 in the input file to avoid the last if condition.
I trie to add this fftalg 112 into gpu_t02/t02.in and it works.
I'll try to find the answer regarding the use of FFTW3 with cuda....
Cheers,
Jordan
Re: Problem with cuda
Posted: Sat Jul 19, 2014 4:54 pm
by sheng
I have recompiled abinit 7.8.1 without openmp and levmar, and have downgrade magma and cuda to version 1.2.1 and 5.0 respectively.
Following the advice of Jordan. I modified the parameter fftalg manually for the input files of the gpu tests according the values shown in the referenced output file. Here is the output:
Code: Select all
[sheng@theory3 tests]$ ../../abinit-7.8.1/tests/runtests.py -t0 gpu
FortranCompiler: gfortran None
../../abinit-7.8.1/tests/runtests.py:279: UserWarning: Cannot find timeout executable at: /usr/bin/timeout
warn("Cannot find timeout executable at: %s" % build_env.path_of_bin("timeout"))
Test_suite directory already exists! Old files will be removed
Running ntests = 4, MPI_nprocs = 1, py_nthreads = 1...
[gpu][t01][np=1]: failed: relative error 2.512e-08 > 9e-09
[gpu][t02][np=1]: succeeded
[gpu][t03][np=1]: failed: relative error 0.7398 > 0.08
[gpu][t04][np=1]: failed: erroneous lines 177 > 0
Test suite completed in 56.63 s (average time for test = 13.88 s, stdev = 13.23 s)
failed: 3, succeeded: 1, passed: 0, skipped: 0, disabled: 0
Suite failed passed succeeded skipped disabled run_etime tot_etime
gpu 3 0 1 0 0 55.51 56.05
Test suite results in HTML format are available in Test_suite/suite_report.html
The errors are larger than the tests' tolerance but at least the tests can be run, though I don't know about error of the fourth tests.
By the ways, by some experimenting I found that the fftw cannot be used together with Cuda (when fflalg=312). For any calculation using Cuda I have to modify fftalg to either 112 or 401. What puzzles me is that fftalg should be of no importance when cuda is enabled according to the documantation of Abinit.
The question now is (I cannot find it in the tutorial):
What is the command I should use when I am using the cuda-enabled abinit? Is it just a simple normal serial abinit command and abinit will automatically distribute the workload among the gpu cores, or I should use the mpirun -np N command where N is the allocated gpu cores?
And how can I confirm that Abinit is indeed using Cuda apart of examining the parameter use_gpu_cuda in the output file?
Thank you.
Re: Problem with cuda
Posted: Sun Jul 20, 2014 11:21 pm
by jbeuken
Hi,
What is the command I should use when I am using the cuda-enabled abinit?
in our test farm, the bot has 4 GPU Tesla C1060 and 2 Quad cores Xeon,
we start the GPU tests with "mpirun -np=1 ..." (and I think that we use only one card )
And how can I confirm that Abinit is indeed using Cuda apart of examining the parameter use_gpu_cuda in the output file?
in the stdout, we found ( but it's not the proof that we use the GPU, but it's a good beginning
)
Code: Select all
setdevice_cuda : COMMENT -
GPU 0 has been properly initialized, continuing...
________________________________________________________________________________
________________________ Graphic Card Properties _______________________________
Device 0 : Tesla C1060
Revision number: 1.3
Total amount of global memory: 4095.8 Mbytes
Clock rate: 1.3 GHz
Max GFLOP: 311 GFP
Total constant memory: 65536 bytes
Shared memory per block: 16384 bytes
Number of registers per block: 16384
________________________________________________________________________________
my 5¢
Re: Problem with cuda
Posted: Mon Jul 21, 2014 5:17 am
by sheng
Thanks jbeuken for your quick reply. It is will be great if anyone can elucidate me on the following:
1. Since the command given is 'mpirun -np 1 ...' and only one card is used, can I assume that the variable N in 'mpirun -np N ...' refers to the number of gpu cards instead of number of cores?
2. Actually I am now reading tutorial GSPW where the parameters such as paral_kgb, autoparal, max_ncpus, npband, npfft and npkpt are used. Are these parameters still apply when gpu is enabled? In order words, does KGB parallelization scheme works in gpu and we have to define the number or processors (or gpu cores) needed on each level of the KGB parallelization as described in the tutorial? Or is it automatically done by abinit in gpu?
3. I am doing a gou calculation with a Quadro K600 with a stated maximum floating point operation of over 300 GFLOPS. However the graphic card section in the log file shows that the card is operating at 7 GFLOPS only. The time I use to do the tutorial tgspw_02.files is about 2500 secs, which is even longer than the serial calculation which takes only about 1800 secs. I have tried on 3 workstations equiped with the same card model with similar amount of time consumed. What could have possibly goes wrong?
Thanks all this is sure a responsive forum.
Re: Problem with cuda
Posted: Mon Jul 21, 2014 9:35 pm
by Jordan
sheng wrote:1. Since the command given is 'mpirun -np 1 ...' and only one card is used, can I assume that the variable N in 'mpirun -np N ...' refers to the number of gpu cards instead of number of cores?
I am doing some tries with GPU right now. What is sure is "mpirun -n X" means X MPI process and not GPU (-n is equivalent to -np).
Each MPI process will try to allocate a GPU. If so, it will use it otherwise just use the CPU version. Therefore you should not use more CPU than the number of GPU you have (This is what Marc Torrent recommended to me and I confirm).
sheng wrote:2. Actually I am now reading tutorial GSPW where the parameters such as paral_kgb, autoparal, max_ncpus, npband, npfft and npkpt are used. Are these parameters still apply when gpu is enabled? In order words, does KGB parallelization scheme works in gpu and we have to define the number or processors (or gpu cores) needed on each level of the KGB parallelization as described in the tutorial? Or is it automatically done by abinit in gpu?
Even if I use the GPU, I am still using the KGB parallelization. From what I understand, the parallelization is still valid and instead of doing the calculation for the given K-point/band group/fft group on the CPU, everything (as much as possible) is deported onto the GPU. Note that for a good scaling you need quite a big system.
sheng wrote:3. I am doing a gou calculation with a Quadro K600 with a stated maximum floating point operation of over 300 GFLOPS. However the graphic card section in the log file shows that the card is operating at 7 GFLOPS only. The time I use to do the tutorial tgspw_02.files is about 2500 secs, which is even longer than the serial calculation which takes only about 1800 secs. I have tried on 3 workstations equiped with the same card model with similar amount of time consumed. What could have possibly goes wrong?
I'll do the test and let you know. Maybe the frequency of yur GPU is reduced if it has nothing to do.
EDIT : I did the test. Abinit reports
Code: Select all
________________________________________________________________________________
________________________ Graphic Card Properties _______________________________
Device 0 : Tesla M2090
Revision number: 2.0
Total amount of global memory: 5375.4 Mbytes
Clock rate: 1.3 GHz
Max GFLOP: 166 GFP
Total constant memory: 65536 bytes
Shared memory per block: 49152 bytes
Number of registers per block: 32768
________________________________________________________________________________
And the calculation on 1GPU is
Code: Select all
overall_cpu_time: 786.0
overall_wall_time: 806.9
Note that Nvidia says more than 1TFLOPS for this card when Abinit only reports 166GFLOPS. I don't know how this value is obtained, though.
Re: Problem with cuda
Posted: Mon Jul 21, 2014 9:43 pm
by Jordan
sheng wrote:I have recompiled abinit 7.8.1 without openmp and levmar, and have downgrade magma and cuda to version 1.2.1 and 5.0 respectively.
The errors are larger than the tests' tolerance but at least the tests can be run, though I don't know about error of the fourth tests.
It means the output file has 177 lines that differ from the reference (regardless the tolerence) and only "0" different line are allowed.
We only test the GPU test on one bot so we don't know much about the tolerences we should allow. I guess it is very GPU dependent.
sheng wrote:By the ways, by some experimenting I found that the fftw cannot be used together with Cuda (when fflalg=312). For any calculation using Cuda I have to modify fftalg to either 112 or 401. What puzzles me is that fftalg should be of no importance when cuda is enabled according to the documantation of Abinit.
Yes you have to do it as I suggested but it has no impact since the FFT is done by the GPU with cuFFT(as far as I understand). This is a tiny bug and we are working on it. Hopefully it will be corrected for the next release.
[/quote]
The question now is (I cannot find it in the tutorial):
What is the command I should use when I am using the cuda-enabled abinit? Is it just a simple normal serial abinit command and abinit will automatically distribute the workload among the gpu cores, or I should use the mpirun -np N command where N is the allocated gpu cores?
And how can I confirm that Abinit is indeed using Cuda apart of examining the parameter use_gpu_cuda in the output file?
If you have compiled Abinit with GPU support AND a GPU is available at the runtime AND the calculation can be port onto the GPU then Abinit automatically uses the GPU.
Use mpirun -n/np N for N MPI processes, each one will try to allocate a GPU. The best choice is N=number of GPU you want to use (each MPI process uses one GPU)
Jordan
Re: Problem with cuda
Posted: Tue Jul 22, 2014 4:49 am
by sheng
Thanks Jordan for your quick information.
Even if I use the GPU, I am still using the KGB parallelization. From what I understand, the parallelization is still valid and instead of doing the calculation for the given K-point/band group/fft group on the CPU, everything (as much as possible) is deported onto the GPU. Note that for a good scaling you need quite a big system.
Does it means that unlike in cpu, the distribution of calculation over K-point/band group/fft is automatically handled in gpu (means I don't have to care about the variable max_ncpus, npband, npfft or npkpt anymore)?
Maybe the frequency of yur GPU is reduced if it has nothing to do.
The report from abinit shows:
Code: Select all
________________________________________________________________________________
________________________ Graphic Card Properties _______________________________
Device 0 : Quadro K600
Revision number: 3.0
Total amount of global memory: 1023.3 Mbytes
Clock rate: 0.9 GHz
Max GFLOP: 7 GFP
Total constant memory: 65536 bytes
Shared memory per block: 49152 bytes
Number of registers per block: 65536
Code: Select all
overall_cpu_time: 3335.0
overall_wall_time: 3356.0
The Max GFLOP is unreasonably low as shown above and the time taken is over 3300 secs, significantly longer than a serial calculation without gpu. Since the calculation is over 107 gold atoms, I consider the scaling should be good enough to compensate for the overhead. The gpu timing taken would defeat the whole purpose of running cuda.
A thing to ask though, does the benchtest for gpu cuda comparable to the conventional parallel cpu mpi used by abinit?
Re: Problem with cuda [SOLVED]
Posted: Tue Jul 22, 2014 4:07 pm
by Jordan
sheng wrote:Does it means that unlike in cpu, the distribution of calculation over K-point/band group/fft is automatically handled in gpu (means I don't have to care about the variable max_ncpus, npband, npfft or npkpt anymore)?
No, according to my understanding, you still have to define those parameters but instead of using a large number of cpu for npband you may want to reduce it.
For example, if you have 5 GPUs and 10 kpt, 200 bands, what I do is put npkpt to 5 and npband to 1 with bandpp to 2 or 4. when for a CPU calculation you woud like to have npkpt 5 npband 10 bandpp 2 npfft 1 for 50 cpus. In this example, the calculation on 50 cpus would be a little more than twice faster (I hope).
If you have 10 GPUs and 1000 bands you may consider to pu npband to 2 to split the band load. So the parallelization is the same but with "small numbers" I did not do benchmarks but I think GPUs should handle easier larger values of bandpp.
BTW, I never use the max_ncpus variable and always define myself the np* variables.
The report from abinit shows:
Code: Select all
________________________________________________________________________________
________________________ Graphic Card Properties _______________________________
Device 0 : Quadro K600
Revision number: 3.0
Total amount of global memory: 1023.3 Mbytes
Clock rate: 0.9 GHz
Max GFLOP: 7 GFP
Total constant memory: 65536 bytes
Shared memory per block: 49152 bytes
Number of registers per block: 65536
According to google, your K600 quadro gpus is "graphic card" - meaning with outputs for diplays. I don't know your configuration but can you confirm you use your GPU only for the calculations and not for you screen/desktop ? If you only have one GPU for both, then that might be the reason. You must have a dedicated GPU for calculations, otherwise I have no clue. What driver are you using ? the last one from Nvidia website ?
Since the calculation is over 107 gold atoms, I consider the scaling should be good enough to compensate for the overhead. The gpu timing taken would defeat the whole purpose of running cuda.
I agree with you. I did not try the test on on CPU to get the speed up on GPU but after running some other cases with 32 atoms, I still have good timing (6 GPUs are faster than 24 CPUs)
The timing reported by Abinit is the wall time used by CPUs so if a GPU does a task in 10s instead of 100s on the CPU, you will have wall cpu time=10 since from the begin to the end the CPU has run for 10s. The timings are thus comparables between GPUs/CPUs
Re: Problem with cuda
Posted: Tue Jul 22, 2014 9:03 pm
by jbeuken
Hi everybody,
Thank you Jordan for your relevant answers…
And , effectively, "mpirun -np=1" is only applicable to "core" and ,
in OUR GPU TESTS FROM OUR TESTSUITE, there is only one core with one GPU card
but still managed by the magma level !
Then, in production, with correct parameters ( paral_kgb, autoparal, max_ncpus, npband, npfft,.. ),
you can potentially use all resources of your "machine"...
I know that there are some teams (
) using ABINIT, can reach to use several multicore CPU with many GPU cards in production with a good speedup…
BUT it's not so easy…
the good new, it's there are some ABINIT' developers work hard to optimize / facilitate parallelism ( core/CPU/node/GPU/MIC/... ) in ABINIT: a little patience …
my 50¢
jmb
Re: Problem with cuda
Posted: Wed Jul 23, 2014 5:28 am
by sheng
Thanks all.
My quadro K600 is indeed used for both display and computing. I have tried to get rid of display by booting into cmd mode but with the same low Max FLOP. Maybe this is purely related to card itself and I have submitted a question on Nvidia forum. Thank you all for your great information and patience with me.
Re: Problem with cuda
Posted: Wed Jul 23, 2014 5:25 pm
by sheng
Sorry to bump this thread again, but I have read that Nvidia has out stripped double precision floating point operations from all Kepler-based Quadros, concentrating instead on single precision.
In light of that I change the configuration: with_gpu_flavor="cuda-single" instead of cuda-double. However Abinit demands the double-precision version when I tried run it.
Code: Select all
Input variables use_gpu_cuda is on but abinit hasn't been built
with (double precision) gpu mode enabled !
Action : change the input variable use_gpu_cuda
or re-compile ABINIT with double-precision Cuda enabled.
Does that means that Abinit only works with the double-precision version?
Re: Problem with cuda
Posted: Thu Jul 24, 2014 6:40 pm
by Jordan
sheng wrote:
Does that means that Abinit only works with the double-precision version?
I don't really know, but in the code, file src/57_iovars/chkinp.F90, there is
Code: Select all
2698 #ifndef HAVE_GPU_CUDA_DP
2699 write(message,'(10a)') ch10,&
2700 & ' invars0: ERROR -',ch10,&
2701 & ' Input variables use_gpu_cuda is on but abinit hasn''t been built',ch10,&
2702 & ' with gpu mode in DOUBLE PRECISION enabled !',ch10,&
2703 & ' Action : change the input variable use_gpu_cuda',ch10,&
2704 & ' or re-compile ABINIT with double precision Cuda enabled.'
2705 call wrtout(std_out,message,'COLL')
2706 ierr=ierr+1
2707 #endif
You can find the same code in 57_iovars/invars0.F90 (look for HAVE_GPU_CUDA_DP)
So the error is triggered any time you don't use double precision.
You may try to uncomment those 2 parts and compile again in single precision.
A quick look in the source file makes me feel that the code will run with precision of 1e-7 instead of 1e-12 in DP mode.
You should check carefully your results if that works before going to productoin of course.
If the single precision feature has been disable, it might have a reason... but what reason...don't know
Jordan
EDIT: BTW, if you don't want to use your K600 for your display, you need an other GPU for it (or IGP on recent intel cpus). You can also try to complitely unplug your display and access the compurter via ssh. If it still does not desactivate the GPU, remove all X11 service at boot stage.