Jacob Finkenrath · Jacob Finkenrath · Jacob Finkenrath · Jacob Finkenrath · b3d9d655 · b3d9d655
--- a/qcd/QCD_Build_README.txt
+++ b/qcd/QCD_Build_README.txt
@@ -4,9 +4,23 @@ Description and Building of the QCD Benchmark
 Description
 ===========
 The QCD benchmark is, unlike the other benchmarks in the PRACE
-application benchmark suite, not a full application but a set of 5
+application benchmark suite, not a full application but a set of 7
 kernels which are representative of some of the most compute-intensive
-parts of QCD calculations.  
+parts of QCD calculations for different architectures.
+
+The QCD benchmark suite consists of three different main application,
+part_cpu is based on 5 kernels of major QCD application and different
+major codes. The applications contains in part_1 and part_2 are 
+suitable to run benchmarks on HPC machines equiped with accelerators
+like Nvidia GPUs or Intel Xeon Phi processors.
+
+In the following building instruction of the part_cpu part
+are descriped, see for describtion of the other parts in the different
+subdirectories.
+
+============================================
+
+PART_CPU:

 Test Cases
 ----------
@@ -18,13 +32,13 @@ a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with
 dynamical standard Wilson fermions. The computations take place on a
 four-dimensional regular grid with periodic boundary conditions. The
 kernel is a standard conjugate gradient solver with even/odd
-pre-conditioning. Lattice size is 322 x 642. 
+pre-conditioning. Lattice size is 32^2 x 64^2. 

 Kernel B is derived from SU3_AHiggs, a lattice quantum chromodynamics
 (QCD) code intended for computing the conditions of the Early
 Universe. Instead of "full QCD", the code applies an effective field
 theory, which is valid at high temperatures. In the effective theory,
-the lattice is 3D. Lattice size is 2563. 
+the lattice is 3D. Lattice size is 256^3. 

 Kernel C Lattice size is 84. Note that Kernel C can only be run in a
 weak scaling mode, where each CPU stores the same local lattice size,
@@ -33,10 +47,10 @@ therefore corresponds to constant execution time, and performance per
 peak TFlop/s is simply the reciprocal of the execution time.  

 Kernel D consists of the core matrix-vector multiplication routine for
-standard Wilson fermions. The lattice size is 644 . 
+standard Wilson fermions. The lattice size is 64^4 . 

 Kernel E consists of a full conjugate gradient solution using Wilson
-fermions. Lattice size is 643 x 3. 
+fermions. Lattice size is 64^4. 

 Building the QCD Benchmark in the JuBE Framework
 ================================================

--- a/qcd/QCD_Run_README.txt
+++ b/qcd/QCD_Run_README.txt
-Running the QCD Benchmarks in the JuBE Framework
-================================================
+Running the QCD Benchmarks contain in ./part_cpu in the JuBE Framework
+==========================================================================

-Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your
-choice. 
-
-After unpacking the Benchmark the following directory structure is available:
+The folder ./part_cpu contains following subfolders
+     
     PABS/
     applications/
     bench/

--- a/qcd/README_ACC.md
+++ b/qcd/README_ACC.md
@@ -2,5 +2,5 @@
 Matter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of
 neutrons and protons, which comprise quarks bound together by gluons.

-The QCD benchmark benefits of two different implementations.
-It can be found the two different sub-directory `part_1` and `part_2`.
+The QCD benchmark benefits of several different implementations.
+It can be found the different sub-directory `part_1`, `part_2` and `part_cpu`.
--- a/qcd/part_1/README
+++ b/qcd/part_1/README
@@ -188,3 +188,24 @@ Chip				Time (s)
 Intel KNL Xeon Phi 	   	9.72E+01	
 NVIDIA P100 GPU			5.60E+01

+**********************************************************************
+
+Prace 5IP - Results (see White Paper for more):
+
+	Irene KNL  Irene SKL    Juwels	Marconi-KNL	MareNostrum	PizDaint	Davide	 Frioul	  Deep	    Mont-Blanc 3
+1	148,68	  219,6        182,49	 133,38 	186,40		53,73		53.4	 151 	  656,41    206,17
+2	 79,35	  114,22        91,83	 186,14 	 94,63 		32,38		113 	  86.9 	  432,93     93,48
+4	 48,07	   58,11        46,58	 287,17 	 47,22 		19,13		21.4	  52.7 	  277,67     49,95
+8	 28,42	   32,09        25,37	 533,49          25,86 		12,78		14.8	  36.5 	  189,83     25,19
+16	 17,08	   14,35        11,77	 1365,72         11,64 		 9,20		10.1	  17.8 	  119,14     12,55
+32	 10,56	    7,28         5,43	 2441,29          5,59 	    	 6,35		 6.94 	  15.6 		
+64	 9,01	    4,18         2,65			  2,65 		 6,41	                  11.7 	
+128	 5,08		         1,39			  2,48 		 5,95
+256			         1,38					 5,84	
+512				 0,89
+
+Results in [sec]
+for V=8x64x64x64
+
+
+
--- a/qcd/part_1/config/config.mk_IreneKNL
+++ b/qcd/part_1/config/config.mk_IreneKNL
+
+ CFLAGS = $(DEFINES) -DARCH=0 -DVVL=8 -DAoS -O3 -xMIC-AVX512 -std=c99 -fma -align -finline-functions
+ LDFLAGS = -lm -qopenmp 
+ CC=mpicc
+ TARGETCC=mpicc
+ TARGETCFLAGS=-x c -qopenmp $(CFLAGS) 
+
--- a/qcd/part_1/config/config.mk_IreneSKL
+++ b/qcd/part_1/config/config.mk_IreneSKL
+#CPU (CRAY XC30) configuration file
+
+ CFLAGS = $(DEFINES) -std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512 -qopenmp -DARCH=0 
+ LDFLAGS = -lm -qopenmp 
+ CC=mpicc
+ TARGETCC=mpicc
+ TARGETCFLAGS=-x c -qopenmp $(CFLAGS) -DVVL=4 -DAoSoA
--- a/qcd/part_1/config/config.mk_MontBlanc3
+++ b/qcd/part_1/config/config.mk_MontBlanc3
+#CPU (CRAY XC30) configuration file
+
+ CFLAGS = $(DEFINES) -std=gnu89 -O2 -DARCH=0 -Ofast -march=armv8-a -mcpu=cortex-a72 -fomit-frame-pointer
+ LDFLAGS = -lm -fopenmp
+ CC=mpicc
+ TARGETCC=mpicc
+ TARGETCFLAGS=-x c -fopenmp $(CFLAGS) -DVVL=4 -DAoSoA
--- a/qcd/part_1/run/run_scripts/BSC_Marenostrum4/kernel_E.input_template
+++ b/qcd/part_1/run/run_scripts/BSC_Marenostrum4/kernel_E.input_template
+#lattice
+nx #NX#
+ny #NY#
+nz #NZ#
+nt #NT#
+totnodes #PX# #PY# #PZ# #PT#
+
+#wilson
+mass_wilson 0.01
+
+#max iterations
+max_cg_iters 1000
+
+#etc
+verbose 1
--- a/qcd/part_1/run/run_scripts/BSC_Marenostrum4/prepare_kernelE_input.sh
+++ b/qcd/part_1/run/run_scripts/BSC_Marenostrum4/prepare_kernelE_input.sh
+##prepare Kernel E
+##
+##
+##
+##
+
+nx=$1
+ny=$2
+nz=$3
+nt=$4
+px=$5
+py=$6
+pz=$7
+pt=$8
+folder=$9
+echo creating input file in $folder
+
+sed 's/#NX#/'${nx}'/g' kernel_E.input_template > test
+mv test kernel_E.input_tmp
+sed 's/#NY#/'${ny}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#NZ#/'${nz}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#NT#/'${nt}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PX#/'${px}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PY#/'${py}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PZ#/'${pz}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PT#/'${pt}'/g' kernel_E.input_tmp > test
+mv test $folder/kernel_E.input
--- a/qcd/part_1/run/run_scripts/BSC_Marenostrum4/prepare_submit_job_part1.sh
+++ b/qcd/part_1/run/run_scripts/BSC_Marenostrum4/prepare_submit_job_part1.sh
+##
+##
+##
+##
+##
+
+time=$1
+nodes=$2
+n=$3
+g=$4
+omp=$5
+perm=$6
+src=$7
+folder=$8
+echo Creating submit-script in $folder
+
+cp submit_job_part1.sh.template ${folder}/.
+cd $folder
+sed 's/#NODES#/'${nodes}'/g' submit_job_part1.sh.template > test
+mv test submit_job_part1.temp
+sed 's/#NTASK#/'${n}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#TASKPERNODE#/'${g}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#OMPTHREADS#/'${omp}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#TIME#/'${time}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+wrc=$(pwd)
+echo $wrc
+sed 's #WRC# '${wrc}' g' submit_job_part1.temp > test
+
+mv test ${src}
+
+if [ $perm -eq 1 ];then
+	chmod +x $src
+fi
+rm submit_job_part1.temp
+rm submit_job_part1.sh.template
+cd ..
--- a/qcd/part_1/run/run_scripts/BSC_Marenostrum4/run_ana.sh
+++ b/qcd/part_1/run/run_scripts/BSC_Marenostrum4/run_ana.sh
+##
+##  RUN - Part 1
+##
+##  Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
+##  Additional in the script submit_job the execution command has to be adjusted to the local machine.
+##  
+##
+##  Script for Part 1 of the UEABS Benchmarksuite
+##
+#!/bin/bash
+
+EXE=/gpfs/projects/pr1ehq00/bench/part1/bench
+## Set scaling-mode: Strong or Weak
+sca_mode="Strong"
+#sca_mode="Weak"
+mode="Analysis"
+#mode="Run"
+
+##	sbatch_on=1
+exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
+
+## lattice size (size strong 1)
+gx=64
+gt=8
+g=8        ##MPItaskperNODE
+openmp=6   ##OMP
+## lattice size (size strong 2) - there is no other testcase yet
+#gx=64
+#gt=128
+## lattice size (size weak 1)
+#gx=48
+#gt=24
+
+## use smaller lattice size of weak scaling mode: like gx=24 gt=24
+##
+
+lt=$gt
+lx=$gx
+ly=$gx
+lz=$gx
+
+#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
+for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
+#for n in 8;do
+            px=$n
+            py=1
+            pz=1
+            pt=1
+            if [ $n -eq 16 ];then
+	        py=2
+                px=8
+	    fi
+	    if [ $n -eq 32 ];then
+	        py=4
+                px=8
+	    fi
+            if [ $n -eq 64 ];then
+                pt=2
+                py=4
+                px=8
+            fi
+            if [ $n -eq 128 ];then
+                pz=2
+                py=8
+                px=8
+            fi
+            if [ $n -eq 256 ];then
+                pz=4
+                py=8
+                px=8
+            fi
+            if [ $n -eq 512 ];then
+                pz=8
+                py=8
+                px=8
+            fi
+            if [ $n -eq 1024 ];then
+                pz=8
+	        py=8		
+                px=8
+                pt=2
+            fi
+            if [ $n -eq 2048 ];then
+                pz=8
+                py=8
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 4096 ];then
+                pz=8
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 8192 ];then
+                pz=16
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 16384 ];then
+                pz=16
+                py=16
+                px=16
+                pt=2
+            fi
+
+            if [ $sca_mode = "Strong" ];then
+                lt1=$((gt/pt))
+                lx1=$((gx/px))
+                ly1=$((gx/py))
+                lz1=$((gx/pz))
+            else
+                lt1=$lt
+                lx1=$lx
+                ly1=$ly
+                lz1=$lz
+
+                lt=$((gt*pt))
+                lx=$((gx*px))
+                ly=$((gx*py))
+                lz=$((gx*pz))
+            fi
+            node=$((n/g))
+	    name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
+            folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
+            if [ $mode != "Analysis" ];then
+            	echo $name
+	        mkdir $folder
+         	submitscript=submit_job_part1_N${n}.sh
+                  
+		./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${exe_perm} ${submitscript} ${folder}
+                ./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
+                cd $folder
+                echo sbatch $submitscript $EXE $name
+		sbatch ./$submitscript $EXE $name
+                sleep 1
+                cd ..
+            ## Scaning the output and save the data in dat_nameif
+      	    else   
+                echo $name >> Part1_$mode.log
+                less $folder/$name | grep "sec" >> Part1_$mode.log
+
+            fi
+    done
+
--- a/qcd/part_1/run/run_scripts/BSC_Marenostrum4/run_sca.sh
+++ b/qcd/part_1/run/run_scripts/BSC_Marenostrum4/run_sca.sh
+##
+##  RUN - Part 1
+##
+##  Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
+##  Additional in the script submit_job the execution command has to be adjusted to the local machine.
+##  
+##
+##  Script for Part 1 of the UEABS Benchmarksuite
+##
+#!/bin/bash
+
+EXE=/gpfs/projects/pr1ehq00/bench/part1/bench
+## Set scaling-mode: Strong or Weak
+sca_mode="Strong"
+#sca_mode="Weak"
+## mode="Analysis"
+mode="Run"
+
+##	sbatch_on=1
+exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
+
+## lattice size (size strong 1)
+gx=64
+gt=8
+g=8        ##MPItaskperNODE
+openmp=6   ##OMP
+## lattice size (size strong 2) - there is no other testcase yet
+#gx=64
+#gt=128
+## lattice size (size weak 1)
+#gx=48
+#gt=24
+
+## use smaller lattice size of weak scaling mode: like gx=24 gt=24
+##
+
+lt=$gt
+lx=$gx
+ly=$gx
+lz=$gx
+
+#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
+for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
+#for n in 8;do
+            px=$n
+            py=1
+            pz=1
+            pt=1
+            if [ $n -eq 16 ];then
+	        py=2
+                px=8
+	    fi
+	    if [ $n -eq 32 ];then
+	        py=4
+                px=8
+	    fi
+            if [ $n -eq 64 ];then
+                pt=2
+                py=4
+                px=8
+            fi
+            if [ $n -eq 128 ];then
+                pz=2
+                py=8
+                px=8
+            fi
+            if [ $n -eq 256 ];then
+                pz=4
+                py=8
+                px=8
+            fi
+            if [ $n -eq 512 ];then
+                pz=8
+                py=8
+                px=8
+            fi
+            if [ $n -eq 1024 ];then
+                pz=8
+	        py=8		
+                px=8
+                pt=2
+            fi
+            if [ $n -eq 2048 ];then
+                pz=8
+                py=8
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 4096 ];then
+                pz=8
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 8192 ];then
+                pz=16
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 16384 ];then
+                pz=16
+                py=16
+                px=16
+                pt=2
+            fi
+
+            if [ $sca_mode = "Strong" ];then
+                lt1=$((gt/pt))
+                lx1=$((gx/px))
+                ly1=$((gx/py))
+                lz1=$((gx/pz))
+            else
+                lt1=$lt
+                lx1=$lx
+                ly1=$ly
+                lz1=$lz
+
+                lt=$((gt*pt))
+                lx=$((gx*px))
+                ly=$((gx*py))
+                lz=$((gx*pz))
+            fi
+            node=$((n/g))
+	    name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
+            folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
+            if [ $mode != "Analysis" ];then
+            	echo $name
+	        mkdir $folder
+         	submitscript=submit_job_part1_N${n}.sh
+                  
+		./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${exe_perm} ${submitscript} ${folder}
+                ./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
+                cd $folder
+                echo sbatch $submitscript $EXE $name
+		sbatch ./$submitscript $EXE $name
+                sleep 1
+                cd ..
+            ## Scaning the output and save the data in dat_nameif
+      	    else   
+                echo $name >> Part1_$mode.log
+                less $folder/$name | grep "sec" >> Part1_$mode.log
+
+            fi
+    done
+
--- a/qcd/part_1/run/run_scripts/BSC_Marenostrum4/submit_job_part1.sh.template
+++ b/qcd/part_1/run/run_scripts/BSC_Marenostrum4/submit_job_part1.sh.template
+#!/bin/bash
+#SBATCH --job-name=TheM1
+#SBATCH --workdir=#WRC#
+#SBATCH --output=mpi_%j_#NODES#.out
+#SBATCH --error=mpi_%j_#NODES#.err
+#SBATCH --ntasks=#NTASK#
+#SBATCH --time=00:10:00
+#SBATCH --constraint=highmem
+#SBATCH --nodes=#NODES#
+#SBATCH --ntasks-per-node=#TASKPERNODE#
+#SBATCH --cpus-per-task=1
+#SBATCH --ntasks=#NODES#
+#SBATCH --exclusive
+
+#set -e
+#export OMP_NUM_THREADS=#OMPTHREADS#
+#export KMP_AFFINITY=compact,1,0,granularity=fine,verbose
+#export KMP_HW_SUBSET=1T
+
+export OMP_NUM_THREADS=#OMPTHREADS#
+export KMP_AFFINITY=balanced,granularity=fine
+export I_MPI_PIN=1
+export I_MPI_PIN_DOMAIN=6
+
+module load intel/2018.4 impi/2018.4
+module load hdf5
+
+EXE=$1
+name=$2
+
+echo "mpirun -n #NTASK# $EXE "
+
+mpirun -n #NTASK# $EXE > $name
+
--- a/qcd/part_1/run/run_scripts/Irene_Curie_SKL/kernel_E.input_template
+++ b/qcd/part_1/run/run_scripts/Irene_Curie_SKL/kernel_E.input_template
+#lattice
+nx #NX#
+ny #NY#
+nz #NZ#
+nt #NT#
+totnodes #PX# #PY# #PZ# #PT#
+
+#wilson
+mass_wilson 0.01
+
+#max iterations
+max_cg_iters 1000
+
+#etc
+verbose 1
--- a/qcd/part_1/run/run_scripts/Irene_Curie_SKL/prepare_kernelE_input.sh
+++ b/qcd/part_1/run/run_scripts/Irene_Curie_SKL/prepare_kernelE_input.sh
+##prepare Kernel E
+##
+##
+##
+##
+
+nx=$1
+ny=$2
+nz=$3
+nt=$4
+px=$5
+py=$6
+pz=$7
+pt=$8
+folder=$9
+echo creating input file in $folder
+
+sed 's/#NX#/'${nx}'/g' kernel_E.input_template > test
+mv test kernel_E.input_tmp
+sed 's/#NY#/'${ny}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#NZ#/'${nz}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#NT#/'${nt}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PX#/'${px}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PY#/'${py}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PZ#/'${pz}'/g' kernel_E.input_tmp > test
+mv test kernel_E.input_tmp
+sed 's/#PT#/'${pt}'/g' kernel_E.input_tmp > test
+mv test $folder/kernel_E.input
--- a/qcd/part_1/run/run_scripts/Irene_Curie_SKL/prepare_submit_job_part1.sh
+++ b/qcd/part_1/run/run_scripts/Irene_Curie_SKL/prepare_submit_job_part1.sh
+##
+##
+##
+##
+##
+
+time=$1
+nodes=$2
+n=$3
+g=$4
+omp=$5
+cpuptask=$6
+perm=$7
+src=$8
+folder=$9
+exe=${10}
+name=${11}
+echo Creating submit-script in $folder
+
+cp submit_job_part1.sh.template ${folder}/.
+cd $folder
+sed 's/#NODES#/'${nodes}'/g' submit_job_part1.sh.template > test
+mv test submit_job_part1.temp
+sed 's/#NTASK#/'${n}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#TASKPERNODE#/'${g}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#OMPTHREADS#/'${omp}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#CPUPTASK#/'${cpuptask}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#TIME#/'${time}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's #EXE# '${exe}' g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+sed 's/#NAME#/'${name}'/g' submit_job_part1.temp > test
+mv test submit_job_part1.temp
+wrc=$(pwd)
+echo $wrc
+sed 's #WRC# '${wrc}' g' submit_job_part1.temp > test
+
+mv test ${src}
+
+if [ $perm -eq 1 ];then
+	chmod +x $src
+fi
+rm submit_job_part1.temp
+rm submit_job_part1.sh.template
+cd ..
--- a/qcd/part_1/run/run_scripts/Irene_Curie_SKL/run_ana.sh
+++ b/qcd/part_1/run/run_scripts/Irene_Curie_SKL/run_ana.sh
+##
+##  RUN - Part 1
+##
+##  Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
+##  Additional in the script submit_job the execution command has to be adjusted to the local machine.
+##  
+##
+##  Script for Part 1 of the UEABS Benchmarksuite
+##
+#!/bin/bash
+
+EXE=/ccc/cont005/home/unicy/finkenrj/run/part1/bench
+## Set scaling-mode: Strong or Weak
+sca_mode="Strong"
+#sca_mode="OneNode"
+#sca_mode="Weak"
+mode="Analysis"
+##mode="Run"
+
+##	sbatch_on=1
+exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
+
+## lattice size (size strong 1)
+gx=64
+gy=64
+gz=64
+gt=8
+g=8        ##MPItaskperNODE
+openmp=6   ##OMP
+cpuptask=6 ## CPUPERTASK
+## lattice size (size strong 2) - there is no other testcase yet
+#gx=64
+#gt=128
+## lattice size (size weak 1)
+#gx=48
+#gt=24
+
+## use smaller lattice size of weak scaling mode: like gx=24 gt=24
+##
+
+#gy=$gx
+#gz=$gx
+
+lt=$gt
+lx=$gx
+ly=$gy
+lz=$gz
+
+#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
+#for n in 8; do
+for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576; do
+#for n in 8;do
+            px=$n
+            py=1
+            pz=1
+            pt=1
+            if [ $n -eq 16 ];then
+	        py=2
+                px=8
+	    fi
+	    if [ $n -eq 32 ];then
+	        py=4
+                px=8
+	    fi
+            if [ $n -eq 64 ];then
+                pt=2
+                py=4
+                px=8
+            fi
+            if [ $n -eq 128 ];then
+                pz=2
+                py=8
+                px=8
+            fi
+            if [ $n -eq 256 ];then
+                pz=4
+                py=8
+                px=8
+            fi
+            if [ $n -eq 512 ];then
+                pz=8
+                py=8
+                px=8
+            fi
+            if [ $n -eq 1024 ];then
+                pz=8
+	        py=8		
+                px=8
+                pt=2
+            fi
+            if [ $n -eq 2048 ];then
+                pz=8
+                py=8
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 4096 ];then
+                pz=8
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 8192 ];then
+                pz=16
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 16384 ];then
+                pz=16
+                py=16
+                px=16
+                pt=4
+            fi
+            if [ $n -eq 32768 ];then
+                px=16
+                py=16
+                pz=16
+                pt=8
+            fi 
+            if [ $n -eq 65536 ];then
+                px=16
+                py=16
+                pz=16
+                pt=16
+            fi
+            if [ $n -eq 131072 ];then
+                px=16
+                py=16
+                pz=16
+                pt=32
+            fi
+            if [ $n -eq 262144 ];then
+                px=16
+                py=32
+                pz=32
+                pt=16
+            fi
+            if [ $n -eq 524288 ];then
+                px=32
+                py=32
+                pz=32
+                pt=16
+            fi
+            if [ $n -eq 1048576 ];then
+                px=32
+                py=32
+                pz=32
+                pt=32
+            fi
+
+            if [ $sca_mode = "Strong" ];then
+                lt1=$((gt/pt))
+                lx1=$((gx/px))
+                ly1=$((gy/py))
+                lz1=$((gz/pz))
+            elif [ $sca_mode = "OneNode" ]; then
+                lx1=$((gx*px))
+                ly1=$((gy*py))
+                lz1=$((gz*pz))
+                lt1=$((gt*pt/g))
+                n=$g
+                lx=$((gx*px))
+                ly=$((gy*py))
+                lz=$((gz*pz))
+                lt=$((gt*pt))
+                px=1
+                py=1
+                pz=1
+                pt=$g
+            else
+                lt1=$lt
+                lx1=$lx
+                ly1=$ly
+                lz1=$lz
+
+                lt=$((gt*pt))
+                lx=$((gx*px))
+                ly=$((gy*py))
+                lz=$((gz*pz))
+            fi
+            node=$((n/g))
+	    name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
+            folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
+            if [ $mode != "Analysis" ];then
+            	echo $name
+	        mkdir $folder
+         	submitscript=submit_job_part1_N${n}.sh
+                  
+		        ./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${cpuptask} ${exe_perm} ${submitscript} ${folder} ${EXE} ${name}
+                ./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
+                cd $folder
+                echo sbatch $submitscript
+         		ccc_msub ./$submitscript
+                sleep 1
+                cd ..
+            ## Scaning the output and save the data in dat_nameif
+      	    else   
+                echo $name >> Part1_$mode.log
+                less $folder/$name | grep "sec" >> Part1_$mode.log
+
+            fi
+    done
+
--- a/qcd/part_1/run/run_scripts/Irene_Curie_SKL/run_sca.sh
+++ b/qcd/part_1/run/run_scripts/Irene_Curie_SKL/run_sca.sh
+##
+##  RUN - Part 1
+##
+##  Before starting this job-script replace "SUBMIT" with the submition-command of the local queing system.
+##  Additional in the script submit_job the execution command has to be adjusted to the local machine.
+##  
+##
+##  Script for Part 1 of the UEABS Benchmarksuite
+##
+#!/bin/bash
+
+EXE=/ccc/cont005/home/unicy/finkenrj/run/part1/bench
+## Set scaling-mode: Strong or Weak
+sca_mode="Strong"
+#sca_mode="OneNode"
+#sca_mode="Weak"
+## mode="Analysis"
+mode="Run"
+
+##	sbatch_on=1
+exe_perm=1 ## use chmod to allow execution of submit_job_Nx_Gx.sh
+
+## lattice size (size strong 1)
+gx=64
+gy=64
+gz=64
+gt=8
+g=8        ##MPItaskperNODE
+openmp=6   ##OMP
+cpuptask=6 ## CPUPERTASK
+## lattice size (size strong 2) - there is no other testcase yet
+#gx=64
+#gt=128
+## lattice size (size weak 1)
+#gx=48
+#gt=24
+
+## use smaller lattice size of weak scaling mode: like gx=24 gt=24
+##
+
+#gy=$gx
+#gz=$gx
+
+lt=$gt
+lx=$gx
+ly=$gy
+lz=$gz
+
+#for n in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384; do
+#for n in 8; do
+for n in 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576; do
+#for n in 8;do
+            px=$n
+            py=1
+            pz=1
+            pt=1
+            if [ $n -eq 16 ];then
+	        py=2
+                px=8
+	    fi
+	    if [ $n -eq 32 ];then
+	        py=4
+                px=8
+	    fi
+            if [ $n -eq 64 ];then
+                pt=2
+                py=4
+                px=8
+            fi
+            if [ $n -eq 128 ];then
+                pz=2
+                py=8
+                px=8
+            fi
+            if [ $n -eq 256 ];then
+                pz=4
+                py=8
+                px=8
+            fi
+            if [ $n -eq 512 ];then
+                pz=8
+                py=8
+                px=8
+            fi
+            if [ $n -eq 1024 ];then
+                pz=8
+	        py=8		
+                px=8
+                pt=2
+            fi
+            if [ $n -eq 2048 ];then
+                pz=8
+                py=8
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 4096 ];then
+                pz=8
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 8192 ];then
+                pz=16
+                py=16
+                px=16
+                pt=2
+            fi
+            if [ $n -eq 16384 ];then
+                pz=16
+                py=16
+                px=16
+                pt=4
+            fi
+            if [ $n -eq 32768 ];then
+                px=16
+                py=16
+                pz=16
+                pt=8
+            fi 
+            if [ $n -eq 65536 ];then
+                px=16
+                py=16
+                pz=16
+                pt=16
+            fi
+            if [ $n -eq 131072 ];then
+                px=16
+                py=16
+                pz=16
+                pt=32
+            fi
+            if [ $n -eq 262144 ];then
+                px=16
+                py=32
+                pz=32
+                pt=16
+            fi
+            if [ $n -eq 524288 ];then
+                px=32
+                py=32
+                pz=32
+                pt=16
+            fi
+            if [ $n -eq 1048576 ];then
+                px=32
+                py=32
+                pz=32
+                pt=32
+            fi
+
+            if [ $sca_mode = "Strong" ];then
+                lt1=$((gt/pt))
+                lx1=$((gx/px))
+                ly1=$((gy/py))
+                lz1=$((gz/pz))
+            elif [ $sca_mode = "OneNode" ]; then
+                lx1=$((gx*px))
+                ly1=$((gy*py))
+                lz1=$((gz*pz))
+                lt1=$((gt*pt/g))
+                n=$g
+                lx=$((gx*px))
+                ly=$((gy*py))
+                lz=$((gz*pz))
+                lt=$((gt*pt))
+                px=1
+                py=1
+                pz=1
+                pt=$g
+            else
+                lt1=$lt
+                lx1=$lx
+                ly1=$ly
+                lz1=$lz
+
+                lt=$((gt*pt))
+                lx=$((gx*px))
+                ly=$((gy*py))
+                lz=$((gz*pz))
+            fi
+            node=$((n/g))
+	    name=${sca_mode}_part1_${px}x${py}x${pz}x${pt}_${lx}x${ly}x${lz}x${lt}_${n}
+            folder=N${node}_NtaskpN${g}_${lx}x${ly}x${lz}x${lt}
+            if [ $mode != "Analysis" ];then
+            	echo $name
+	        mkdir $folder
+         	submitscript=submit_job_part1_N${n}.sh
+                  
+		        ./prepare_submit_job_part1.sh '01:30:00' ${node} ${n} ${g} ${openmp} ${cpuptask} ${exe_perm} ${submitscript} ${folder} ${EXE} ${name}
+                ./prepare_kernelE_input.sh ${lx} ${ly} ${lz} ${lt} ${px} ${py} ${pz} ${pt} $folder
+                cd $folder
+                echo sbatch $submitscript
+         		ccc_msub ./$submitscript
+                sleep 1
+                cd ..
+            ## Scaning the output and save the data in dat_nameif
+      	    else   
+                echo $name >> Part1_$mode.log
+                less $folder/$name | grep "sec" >> Part1_$mode.log
+
+            fi
+    done
+
--- a/qcd/part_1/run/run_scripts/Irene_Curie_SKL/submit_job_part1.sh.template
+++ b/qcd/part_1/run/run_scripts/Irene_Curie_SKL/submit_job_part1.sh.template
+#! /bin/bash
+#MSUB -r Test1              # Request name
+#MSUB -n #NTASK#            # Number of Task
+#MSUB -c #OMPTHREADS#           # Number of Threads per Task
+#MSUB -N #NODES#            # Number of Nodes
+#MSUB -T 1800               # Elapsed time limit in seconds
+#MSUB -o bench_out_%I.o     # Standart output %I is the job ID
+#MSUB -o bench_out_%I.e     # Error output %I is the job ID
+#MSUB -q skylake            # see ccc_mpinfo:  skylake or
+#MSUB -A pa4564             # Project ID
+set -x
+cd ${BRIDGE_MSUB_PWD}
+
+export OMP_NUM_THREADS=#OMPTHREADS#
+export BRIDGE_MSUB_NCORE=#CPUPTASK#  # number of requested cores per process
+
+#module unload feature/openmpi/net/auto feature/openmpi/mpi_compiler/intel mpi/openmpi/2.0.4
+#module unload mpi/openmpi/2.0.4
+#module unload .tuning/openmpi/2.0.4
+#module unload feature/openmpi/net/auto feature/openmpi/mpi_compiler/intel mpi/openmpi/2.0.4
+#module load mpi/intelmpi/2018.0.3.222
+#module load python3
+
+echo "mpirun -n #NTASK# #EXE#"
+
+ccc_mprun -n #NTASK# -N #NODES# #EXE# > #NAME#
+
--- a/qcd/part_2/README.md
+++ b/qcd/part_2/README.md
@@ -272,11 +272,9 @@ and for QPhix
 git clone https://github.com/JeffersonLab/qphix
 ```

-
-Note that the AVX512 instructions, which are needed for an optimal run on
-KNLs, are not yet part of the main branch. The AVX512 instructions are available
-in the avx512-branch ("git checkout avx512). The provided
-source file is using the avx512-branch (Status 01/2017).
+Note that for running on Skylake chips it is recommended to utilize
+the branch develop of QPhix which needs additional packages
+like qdp++ (Status 04/2019).

 #### 2.1 Compile

@@ -313,8 +311,23 @@ or for KNL's

 by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder
 of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
-Note that the avx512-branch will compile additional executable which has dependencies
-on the package QDP (which will generate an error at the end of the compilation process).
+
+
+Note for the develop branch the package qdp++ has to be compiled.
+QDP++ can be configure using (here for skylake chip)
+
+``` shell
+./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR
+```
+
+Now QPhix executable can be compiled by using:
+
+
+``` shell
+cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
+```
+
+The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.

 ##### 2.1.1 Example compilation on PRACE machines