Skip to content
Commits on Source (4)
......@@ -4,9 +4,23 @@ Description and Building of the QCD Benchmark
Description
===========
The QCD benchmark is, unlike the other benchmarks in the PRACE
application benchmark suite, not a full application but a set of 5
application benchmark suite, not a full application but a set of 7
kernels which are representative of some of the most compute-intensive
parts of QCD calculations.
parts of QCD calculations for different architectures.
The QCD benchmark suite consists of three different main application,
part_cpu is based on 5 kernels of major QCD application and different
major codes. The applications contains in part_1 and part_2 are
suitable to run benchmarks on HPC machines equiped with accelerators
like Nvidia GPUs or Intel Xeon Phi processors.
In the following building instruction of the part_cpu part
are descriped, see for describtion of the other parts in the different
subdirectories.
============================================
PART_CPU:
Test Cases
----------
......@@ -18,13 +32,13 @@ a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with
dynamical standard Wilson fermions. The computations take place on a
four-dimensional regular grid with periodic boundary conditions. The
kernel is a standard conjugate gradient solver with even/odd
pre-conditioning. Lattice size is 322 x 642.
pre-conditioning. Lattice size is 32^2 x 64^2.
Kernel B is derived from SU3_AHiggs, a lattice quantum chromodynamics
(QCD) code intended for computing the conditions of the Early
Universe. Instead of "full QCD", the code applies an effective field
theory, which is valid at high temperatures. In the effective theory,
the lattice is 3D. Lattice size is 2563.
the lattice is 3D. Lattice size is 256^3.
Kernel C Lattice size is 84. Note that Kernel C can only be run in a
weak scaling mode, where each CPU stores the same local lattice size,
......@@ -33,10 +47,10 @@ therefore corresponds to constant execution time, and performance per
peak TFlop/s is simply the reciprocal of the execution time.
Kernel D consists of the core matrix-vector multiplication routine for
standard Wilson fermions. The lattice size is 644 .
standard Wilson fermions. The lattice size is 64^4 .
Kernel E consists of a full conjugate gradient solution using Wilson
fermions. Lattice size is 643 x 3.
fermions. Lattice size is 64^4.
Building the QCD Benchmark in the JuBE Framework
================================================
......
Running the QCD Benchmarks in the JuBE Framework
================================================
Running the QCD Benchmarks contain in ./part_cpu in the JuBE Framework
==========================================================================
Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your
choice.
After unpacking the Benchmark the following directory structure is available:
The folder ./part_cpu contains following subfolders
PABS/
applications/
bench/
......
......@@ -2,5 +2,5 @@
Matter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of
neutrons and protons, which comprise quarks bound together by gluons.
The QCD benchmark benefits of two different implementations.
It can be found the two different sub-directory `part_1` and `part_2`.
The QCD benchmark benefits of several different implementations.
It can be found the different sub-directory `part_1`, `part_2` and `part_cpu`.
......@@ -188,3 +188,24 @@ Chip Time (s)
Intel KNL Xeon Phi 9.72E+01
NVIDIA P100 GPU 5.60E+01
**********************************************************************
Prace 5IP - Results (see White Paper for more):
Irene KNL Irene SKL Juwels Marconi-KNL MareNostrum PizDaint Davide Frioul Deep Mont-Blanc 3
1 148,68 219,6 182,49 133,38 186,40 53,73 53.4 151 656,41 206,17
2 79,35 114,22 91,83 186,14 94,63 32,38 113 86.9 432,93 93,48
4 48,07 58,11 46,58 287,17 47,22 19,13 21.4 52.7 277,67 49,95
8 28,42 32,09 25,37 533,49 25,86 12,78 14.8 36.5 189,83 25,19
16 17,08 14,35 11,77 1365,72 11,64 9,20 10.1 17.8 119,14 12,55
32 10,56 7,28 5,43 2441,29 5,59 6,35 6.94 15.6
64 9,01 4,18 2,65 2,65 6,41 11.7
128 5,08 1,39 2,48 5,95
256 1,38 5,84
512 0,89
Results in [sec]
for V=8x64x64x64
CFLAGS = $(DEFINES) -DARCH=0 -DVVL=8 -DAoS -O3 -xMIC-AVX512 -std=c99 -fma -align -finline-functions
LDFLAGS = -lm -qopenmp
CC=mpicc
TARGETCC=mpicc
TARGETCFLAGS=-x c -qopenmp $(CFLAGS)
#CPU (CRAY XC30) configuration file
CFLAGS = $(DEFINES) -std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512 -qopenmp -DARCH=0
LDFLAGS = -lm -qopenmp
CC=mpicc
TARGETCC=mpicc
TARGETCFLAGS=-x c -qopenmp $(CFLAGS) -DVVL=4 -DAoSoA
#CPU (CRAY XC30) configuration file
CFLAGS = $(DEFINES) -std=gnu89 -O2 -DARCH=0 -Ofast -march=armv8-a -mcpu=cortex-a72 -fomit-frame-pointer
LDFLAGS = -lm -fopenmp
CC=mpicc
TARGETCC=mpicc
TARGETCFLAGS=-x c -fopenmp $(CFLAGS) -DVVL=4 -DAoSoA
#lattice
nx #NX#
ny #NY#
nz #NZ#
nt #NT#
totnodes #PX# #PY# #PZ# #PT#
#wilson
mass_wilson 0.01
#max iterations
max_cg_iters 1000
#etc
verbose 1
##prepare Kernel E
##
##
##
##
nx=$1
ny=$2
nz=$3
nt=$4
px=$5
py=$6
pz=$7
pt=$8
folder=$9
echo creating input file in $folder
sed 's/#NX#/'${nx}'/g' kernel_E.input_template > test
mv test kernel_E.input_tmp
sed 's/#NY#/'${ny}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#NZ#/'${nz}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#NT#/'${nt}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PX#/'${px}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PY#/'${py}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PZ#/'${pz}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PT#/'${pt}'/g' kernel_E.input_tmp > test
mv test $folder/kernel_E.input
##
##
##
##
##
time=$1
nodes=$2
n=$3
g=$4
omp=$5
perm=$6
src=$7
folder=$8
echo Creating submit-script in $folder
cp submit_job_part1.sh.template ${folder}/.
cd $folder
sed 's/#NODES#/'${nodes}'/g' submit_job_part1.sh.template > test
mv test submit_job_part1.temp
sed 's/#NTASK#/'${n}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#TASKPERNODE#/'${g}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#OMPTHREADS#/'${omp}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#TIME#/'${time}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
wrc=$(pwd)
echo $wrc
sed 's #WRC# '${wrc}' g' submit_job_part1.temp > test
mv test ${src}
if [ $perm -eq 1 ];then
chmod +x $src
fi
rm submit_job_part1.temp
rm submit_job_part1.sh.template
cd ..
##
## RUN - Part 1
##
## Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
## Additional in the script submit_job the execution command has to be adjusted to the local machine.
##
##
## Script for Part 1 of the UEABS Benchmarksuite
##
#!/bin/bash
EXE=/gpfs/projects/pr1ehq00/bench/part1/bench
## Set scaling-mode: Strong or Weak
sca_mode="Strong"
#sca_mode="Weak"
mode="Analysis"
#mode="Run"
## sbatch_on=1
exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
## lattice size (size strong 1)
gx=64
gt=8
g=8 ##MPItaskperNODE
openmp=6 ##OMP
## lattice size (size strong 2) - there is no other testcase yet
#gx=64
#gt=128
## lattice size (size weak 1)
#gx=48
#gt=24
## use smaller lattice size of weak scaling mode: like gx=24 gt=24
##
lt=$gt
lx=$gx
ly=$gx
lz=$gx
#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
#for n in 8;do
px=$n
py=1
pz=1
pt=1
if [ $n -eq 16 ];then
py=2
px=8
fi
if [ $n -eq 32 ];then
py=4
px=8
fi
if [ $n -eq 64 ];then
pt=2
py=4
px=8
fi
if [ $n -eq 128 ];then
pz=2
py=8
px=8
fi
if [ $n -eq 256 ];then
pz=4
py=8
px=8
fi
if [ $n -eq 512 ];then
pz=8
py=8
px=8
fi
if [ $n -eq 1024 ];then
pz=8
py=8
px=8
pt=2
fi
if [ $n -eq 2048 ];then
pz=8
py=8
px=16
pt=2
fi
if [ $n -eq 4096 ];then
pz=8
py=16
px=16
pt=2
fi
if [ $n -eq 8192 ];then
pz=16
py=16
px=16
pt=2
fi
if [ $n -eq 16384 ];then
pz=16
py=16
px=16
pt=2
fi
if [ $sca_mode = "Strong" ];then
lt1=$((gt/pt))
lx1=$((gx/px))
ly1=$((gx/py))
lz1=$((gx/pz))
else
lt1=$lt
lx1=$lx
ly1=$ly
lz1=$lz
lt=$((gt*pt))
lx=$((gx*px))
ly=$((gx*py))
lz=$((gx*pz))
fi
node=$((n/g))
name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
if [ $mode != "Analysis" ];then
echo $name
mkdir $folder
submitscript=submit_job_part1_N${n}.sh
./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${exe_perm} ${submitscript} ${folder}
./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
cd $folder
echo sbatch $submitscript $EXE $name
sbatch ./$submitscript $EXE $name
sleep 1
cd ..
## Scaning the output and save the data in dat_nameif
else
echo $name >> Part1_$mode.log
less $folder/$name | grep "sec" >> Part1_$mode.log
fi
done
##
## RUN - Part 1
##
## Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
## Additional in the script submit_job the execution command has to be adjusted to the local machine.
##
##
## Script for Part 1 of the UEABS Benchmarksuite
##
#!/bin/bash
EXE=/gpfs/projects/pr1ehq00/bench/part1/bench
## Set scaling-mode: Strong or Weak
sca_mode="Strong"
#sca_mode="Weak"
## mode="Analysis"
mode="Run"
## sbatch_on=1
exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
## lattice size (size strong 1)
gx=64
gt=8
g=8 ##MPItaskperNODE
openmp=6 ##OMP
## lattice size (size strong 2) - there is no other testcase yet
#gx=64
#gt=128
## lattice size (size weak 1)
#gx=48
#gt=24
## use smaller lattice size of weak scaling mode: like gx=24 gt=24
##
lt=$gt
lx=$gx
ly=$gx
lz=$gx
#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
#for n in 8;do
px=$n
py=1
pz=1
pt=1
if [ $n -eq 16 ];then
py=2
px=8
fi
if [ $n -eq 32 ];then
py=4
px=8
fi
if [ $n -eq 64 ];then
pt=2
py=4
px=8
fi
if [ $n -eq 128 ];then
pz=2
py=8
px=8
fi
if [ $n -eq 256 ];then
pz=4
py=8
px=8
fi
if [ $n -eq 512 ];then
pz=8
py=8
px=8
fi
if [ $n -eq 1024 ];then
pz=8
py=8
px=8
pt=2
fi
if [ $n -eq 2048 ];then
pz=8
py=8
px=16
pt=2
fi
if [ $n -eq 4096 ];then
pz=8
py=16
px=16
pt=2
fi
if [ $n -eq 8192 ];then
pz=16
py=16
px=16
pt=2
fi
if [ $n -eq 16384 ];then
pz=16
py=16
px=16
pt=2
fi
if [ $sca_mode = "Strong" ];then
lt1=$((gt/pt))
lx1=$((gx/px))
ly1=$((gx/py))
lz1=$((gx/pz))
else
lt1=$lt
lx1=$lx
ly1=$ly
lz1=$lz
lt=$((gt*pt))
lx=$((gx*px))
ly=$((gx*py))
lz=$((gx*pz))
fi
node=$((n/g))
name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
if [ $mode != "Analysis" ];then
echo $name
mkdir $folder
submitscript=submit_job_part1_N${n}.sh
./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${exe_perm} ${submitscript} ${folder}
./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
cd $folder
echo sbatch $submitscript $EXE $name
sbatch ./$submitscript $EXE $name
sleep 1
cd ..
## Scaning the output and save the data in dat_nameif
else
echo $name >> Part1_$mode.log
less $folder/$name | grep "sec" >> Part1_$mode.log
fi
done
#!/bin/bash
#SBATCH --job-name=TheM1
#SBATCH --workdir=#WRC#
#SBATCH --output=mpi_%j_#NODES#.out
#SBATCH --error=mpi_%j_#NODES#.err
#SBATCH --ntasks=#NTASK#
#SBATCH --time=00:10:00
#SBATCH --constraint=highmem
#SBATCH --nodes=#NODES#
#SBATCH --ntasks-per-node=#TASKPERNODE#
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=#NODES#
#SBATCH --exclusive
#set -e
#export OMP_NUM_THREADS=#OMPTHREADS#
#export KMP_AFFINITY=compact,1,0,granularity=fine,verbose
#export KMP_HW_SUBSET=1T
export OMP_NUM_THREADS=#OMPTHREADS#
export KMP_AFFINITY=balanced,granularity=fine
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=6
module load intel/2018.4 impi/2018.4
module load hdf5
EXE=$1
name=$2
echo "mpirun -n #NTASK# $EXE "
mpirun -n #NTASK# $EXE > $name
#lattice
nx #NX#
ny #NY#
nz #NZ#
nt #NT#
totnodes #PX# #PY# #PZ# #PT#
#wilson
mass_wilson 0.01
#max iterations
max_cg_iters 1000
#etc
verbose 1
##prepare Kernel E
##
##
##
##
nx=$1
ny=$2
nz=$3
nt=$4
px=$5
py=$6
pz=$7
pt=$8
folder=$9
echo creating input file in $folder
sed 's/#NX#/'${nx}'/g' kernel_E.input_template > test
mv test kernel_E.input_tmp
sed 's/#NY#/'${ny}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#NZ#/'${nz}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#NT#/'${nt}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PX#/'${px}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PY#/'${py}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PZ#/'${pz}'/g' kernel_E.input_tmp > test
mv test kernel_E.input_tmp
sed 's/#PT#/'${pt}'/g' kernel_E.input_tmp > test
mv test $folder/kernel_E.input
##
##
##
##
##
time=$1
nodes=$2
n=$3
g=$4
omp=$5
cpuptask=$6
perm=$7
src=$8
folder=$9
exe=${10}
name=${11}
echo Creating submit-script in $folder
cp submit_job_part1.sh.template ${folder}/.
cd $folder
sed 's/#NODES#/'${nodes}'/g' submit_job_part1.sh.template > test
mv test submit_job_part1.temp
sed 's/#NTASK#/'${n}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#TASKPERNODE#/'${g}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#OMPTHREADS#/'${omp}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#CPUPTASK#/'${cpuptask}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#TIME#/'${time}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's #EXE# '${exe}' g' submit_job_part1.temp > test
mv test submit_job_part1.temp
sed 's/#NAME#/'${name}'/g' submit_job_part1.temp > test
mv test submit_job_part1.temp
wrc=$(pwd)
echo $wrc
sed 's #WRC# '${wrc}' g' submit_job_part1.temp > test
mv test ${src}
if [ $perm -eq 1 ];then
chmod +x $src
fi
rm submit_job_part1.temp
rm submit_job_part1.sh.template
cd ..
##
## RUN - Part 1
##
## Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
## Additional in the script submit_job the execution command has to be adjusted to the local machine.
##
##
## Script for Part 1 of the UEABS Benchmarksuite
##
#!/bin/bash
EXE=/ccc/cont005/home/unicy/finkenrj/run/part1/bench
## Set scaling-mode: Strong or Weak
sca_mode="Strong"
#sca_mode="OneNode"
#sca_mode="Weak"
mode="Analysis"
##mode="Run"
## sbatch_on=1
exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
## lattice size (size strong 1)
gx=64
gy=64
gz=64
gt=8
g=8 ##MPItaskperNODE
openmp=6 ##OMP
cpuptask=6 ## CPUPERTASK
## lattice size (size strong 2) - there is no other testcase yet
#gx=64
#gt=128
## lattice size (size weak 1)
#gx=48
#gt=24
## use smaller lattice size of weak scaling mode: like gx=24 gt=24
##
#gy=$gx
#gz=$gx
lt=$gt
lx=$gx
ly=$gy
lz=$gz
#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
#for n in 8; do
for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576; do
#for n in 8;do
px=$n
py=1
pz=1
pt=1
if [ $n -eq 16 ];then
py=2
px=8
fi
if [ $n -eq 32 ];then
py=4
px=8
fi
if [ $n -eq 64 ];then
pt=2
py=4
px=8
fi
if [ $n -eq 128 ];then
pz=2
py=8
px=8
fi
if [ $n -eq 256 ];then
pz=4
py=8
px=8
fi
if [ $n -eq 512 ];then
pz=8
py=8
px=8
fi
if [ $n -eq 1024 ];then
pz=8
py=8
px=8
pt=2
fi
if [ $n -eq 2048 ];then
pz=8
py=8
px=16
pt=2
fi
if [ $n -eq 4096 ];then
pz=8
py=16
px=16
pt=2
fi
if [ $n -eq 8192 ];then
pz=16
py=16
px=16
pt=2
fi
if [ $n -eq 16384 ];then
pz=16
py=16
px=16
pt=4
fi
if [ $n -eq 32768 ];then
px=16
py=16
pz=16
pt=8
fi
if [ $n -eq 65536 ];then
px=16
py=16
pz=16
pt=16
fi
if [ $n -eq 131072 ];then
px=16
py=16
pz=16
pt=32
fi
if [ $n -eq 262144 ];then
px=16
py=32
pz=32
pt=16
fi
if [ $n -eq 524288 ];then
px=32
py=32
pz=32
pt=16
fi
if [ $n -eq 1048576 ];then
px=32
py=32
pz=32
pt=32
fi
if [ $sca_mode = "Strong" ];then
lt1=$((gt/pt))
lx1=$((gx/px))
ly1=$((gy/py))
lz1=$((gz/pz))
elif [ $sca_mode = "OneNode" ]; then
lx1=$((gx*px))
ly1=$((gy*py))
lz1=$((gz*pz))
lt1=$((gt*pt/g))
n=$g
lx=$((gx*px))
ly=$((gy*py))
lz=$((gz*pz))
lt=$((gt*pt))
px=1
py=1
pz=1
pt=$g
else
lt1=$lt
lx1=$lx
ly1=$ly
lz1=$lz
lt=$((gt*pt))
lx=$((gx*px))
ly=$((gy*py))
lz=$((gz*pz))
fi
node=$((n/g))
name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
if [ $mode != "Analysis" ];then
echo $name
mkdir $folder
submitscript=submit_job_part1_N${n}.sh
./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${cpuptask} ${exe_perm} ${submitscript} ${folder} ${EXE} ${name}
./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
cd $folder
echo sbatch $submitscript
ccc_msub ./$submitscript
sleep 1
cd ..
## Scaning the output and save the data in dat_nameif
else
echo $name >> Part1_$mode.log
less $folder/$name | grep "sec" >> Part1_$mode.log
fi
done
##
## RUN - Part 1
##
## Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
## Additional in the script submit_job the execution command has to be adjusted to the local machine.
##
##
## Script for Part 1 of the UEABS Benchmarksuite
##
#!/bin/bash
EXE=/ccc/cont005/home/unicy/finkenrj/run/part1/bench
## Set scaling-mode: Strong or Weak
sca_mode="Strong"
#sca_mode="OneNode"
#sca_mode="Weak"
## mode="Analysis"
mode="Run"
## sbatch_on=1
exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
## lattice size (size strong 1)
gx=64
gy=64
gz=64
gt=8
g=8 ##MPItaskperNODE
openmp=6 ##OMP
cpuptask=6 ## CPUPERTASK
## lattice size (size strong 2) - there is no other testcase yet
#gx=64
#gt=128
## lattice size (size weak 1)
#gx=48
#gt=24
## use smaller lattice size of weak scaling mode: like gx=24 gt=24
##
#gy=$gx
#gz=$gx
lt=$gt
lx=$gx
ly=$gy
lz=$gz
#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
#for n in 8; do
for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576; do
#for n in 8;do
px=$n
py=1
pz=1
pt=1
if [ $n -eq 16 ];then
py=2
px=8
fi
if [ $n -eq 32 ];then
py=4
px=8
fi
if [ $n -eq 64 ];then
pt=2
py=4
px=8
fi
if [ $n -eq 128 ];then
pz=2
py=8
px=8
fi
if [ $n -eq 256 ];then
pz=4
py=8
px=8
fi
if [ $n -eq 512 ];then
pz=8
py=8
px=8
fi
if [ $n -eq 1024 ];then
pz=8
py=8
px=8
pt=2
fi
if [ $n -eq 2048 ];then
pz=8
py=8
px=16
pt=2
fi
if [ $n -eq 4096 ];then
pz=8
py=16
px=16
pt=2
fi
if [ $n -eq 8192 ];then
pz=16
py=16
px=16
pt=2
fi
if [ $n -eq 16384 ];then
pz=16
py=16
px=16
pt=4
fi
if [ $n -eq 32768 ];then
px=16
py=16
pz=16
pt=8
fi
if [ $n -eq 65536 ];then
px=16
py=16
pz=16
pt=16
fi
if [ $n -eq 131072 ];then
px=16
py=16
pz=16
pt=32
fi
if [ $n -eq 262144 ];then
px=16
py=32
pz=32
pt=16
fi
if [ $n -eq 524288 ];then
px=32
py=32
pz=32
pt=16
fi
if [ $n -eq 1048576 ];then
px=32
py=32
pz=32
pt=32
fi
if [ $sca_mode = "Strong" ];then
lt1=$((gt/pt))
lx1=$((gx/px))
ly1=$((gy/py))
lz1=$((gz/pz))
elif [ $sca_mode = "OneNode" ]; then
lx1=$((gx*px))
ly1=$((gy*py))
lz1=$((gz*pz))
lt1=$((gt*pt/g))
n=$g
lx=$((gx*px))
ly=$((gy*py))
lz=$((gz*pz))
lt=$((gt*pt))
px=1
py=1
pz=1
pt=$g
else
lt1=$lt
lx1=$lx
ly1=$ly
lz1=$lz
lt=$((gt*pt))
lx=$((gx*px))
ly=$((gy*py))
lz=$((gz*pz))
fi
node=$((n/g))
name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
if [ $mode != "Analysis" ];then
echo $name
mkdir $folder
submitscript=submit_job_part1_N${n}.sh
./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${cpuptask} ${exe_perm} ${submitscript} ${folder} ${EXE} ${name}
./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
cd $folder
echo sbatch $submitscript
ccc_msub ./$submitscript
sleep 1
cd ..
## Scaning the output and save the data in dat_nameif
else
echo $name >> Part1_$mode.log
less $folder/$name | grep "sec" >> Part1_$mode.log
fi
done
#! /bin/bash
#MSUB -r Test1 # Request name
#MSUB -n #NTASK# # Number of Task
#MSUB -c #OMPTHREADS# # Number of Threads per Task
#MSUB -N #NODES# # Number of Nodes
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o bench_out_%I.o # Standart output %I is the job ID
#MSUB -o bench_out_%I.e # Error output %I is the job ID
#MSUB -q skylake # see ccc_mpinfo: skylake or
#MSUB -A pa4564 # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
export OMP_NUM_THREADS=#OMPTHREADS#
export BRIDGE_MSUB_NCORE=#CPUPTASK# # number of requested cores per process
#module unload feature/openmpi/net/auto feature/openmpi/mpi_compiler/intel mpi/openmpi/2.0.4
#module unload mpi/openmpi/2.0.4
#module unload .tuning/openmpi/2.0.4
#module unload feature/openmpi/net/auto feature/openmpi/mpi_compiler/intel mpi/openmpi/2.0.4
#module load mpi/intelmpi/2018.0.3.222
#module load python3
echo "mpirun -n #NTASK# #EXE#"
ccc_mprun -n #NTASK# -N #NODES# #EXE# > #NAME#
......@@ -272,11 +272,9 @@ and for QPhix
git clone https://github.com/JeffersonLab/qphix
```
Note that the AVX512 instructions, which are needed for an optimal run on
KNLs, are not yet part of the main branch. The AVX512 instructions are available
in the avx512-branch ("git checkout avx512). The provided
source file is using the avx512-branch (Status 01/2017).
Note that for running on Skylake chips it is recommended to utilize
the branch develop of QPhix which needs additional packages
like qdp++ (Status 04/2019).
#### 2.1 Compile
......@@ -313,8 +311,23 @@ or for KNL's
by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder
of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
Note that the avx512-branch will compile additional executable which has dependencies
on the package QDP (which will generate an error at the end of the compilation process).
Note for the develop branch the package qdp++ has to be compiled.
QDP++ can be configure using (here for skylake chip)
``` shell
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR
```
Now QPhix executable can be compiled by using:
``` shell
cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
```
The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
##### 2.1.1 Example compilation on PRACE machines
......