The lattice quatum chromodynamics (LQCD) benchmark is a compilation
of up to five (three in the moment) LQCD kernels. The kernels are:

label                      : kernel_A
short label                : KA
kernel origin              : Berlin Quantum ChromoDynamics program (BQCD), 
                             DEISA benchmark suite
kernel contact person      : Hinnerk Stueben
kernel code status         : 2008/08/25
problem size parameter     : KA_N{X,Y,Z,T}, 4D lattice
problem run time parameter : KA_MAXITER, iteration steps
other needed parameter     : KA_P{X,Y,Z,T}, distribution of processes in 4D
                             KA_LIBCOMM, see readme section
                             KA_LIBCLOVER, see readme section 
                             KA_LIBD, see readme section
notes                      : 


label                      : kernel_B
short label                : KB
kernel origin              : University of Oulu, Finland
                             DEISA benchmark suite
kernel contact person      : Kari Rummukainen
kernel code status         : 2008/08/22
problem size parameter     : KB_NX, x component of the 3D grid
                             KB_NY, y component of the 3D grid
                             KB_NZ, z component of the 3D grid
problem run time parameter : KB_MAXITER, iteration steps
other needed parameter     : 
notes                      : number of processes needs to be a power of 2


label                      : kernel_C
short label                : KC
kernel origin              : privat communication
                             
kernel contact person      : Jacob Finkenrath
kernel code status         : 2016/08/24
problem size parameter     : KC_L{X,Y,Z,T}, local size of the 4D grid in {x,y,z,t}-direction
problem run time parameter : 
other needed parameter     : KC_P{X,Y,Z,T}, number of processes in {x,y,z,t}-direction
notes                      : 


label                      : kernel_D
short label                : KD
kernel origin              : privat communication
                             
kernel contact person      : Jacob Finkenrath
kernel code status         : 2016/08/24
problem size parameter     : K_L{X,Y,Z,T}, size of the 4D grid in {x,y,z,t}-direction
problem run time parameter : 
other needed parameter     : K_N{X,Y,Z,T}, number of processes in {x,y,z,t}-direction
notes                      : 


label                      : kernel_E
short label                : KE
kernel origin              : privat communication
                             
kernel contact person      : Stefan Krieg
kernel code status         : 2008/11/10
problem size parameter     : K_L{X,Y,Z,T}, size of the 4D grid in {x,y,z,t}-direction
problem run time parameter : 
other needed parameter     : K_N{X,Y,Z,T}, number of processes in {x,y,z,t}-direction
notes                      : 


######################################################################
kernel_A README

-----------
BQCD readme
-----------



Note: all base information taken from the
BQCD document; updated with JuBE and new ported platforms

Subdirectories in src:
~~~~~~~~~~~~~~~~~~~~

clover	  routines for the clover improvement

comm	  communication routines

d	  multiplication of a vector with "D-slash"

modules   (some) Fortran90 modules

platform  Makefiles and service routines for various platforms


General remarks
~~~~~~~~~~~~~~

BQCD has been ported to various platforms (see platform/Makefile-*.var): 

# Makefile-altix.var - settings on SGI-Altix 3700 and SGI-Altix 4700
# Makefile-bgl.var - settings on BlueGene/L
# Makefile-cray.var - settings on Cray T3E and Cray XT4
# Makefile-hitachi-omp.var - settings on Hitachi SR8000 
# Makefile-hitachi.var - settings on Hitachi SR8000 (pure MPI version)
# Makefile-hp.var - settings for HP-UX Fortran Compiler
# Makefile-ibm.var - settings on IBM
# Makefile-intel.var - settings for Intel Fortran Compiler
# Makefile-nec.var - settings on NEC SX-8
# Makefile-sun.var - settings on Sun

The corresponding files 

    platform/service-*.F90

contain interfaces to service routines / system calls.

Not all of these files have been used recently.  There are kept as a
starting point.

A "Makefile.var" and a "service.F90" have to be provided in the source
directory that work correctly with your system.
The contents of these files is explained in:

   platform/Makefile-EXPLAINED.var
   platform/service-EXPLAINED.var

"gmake prep-<platform>" will create symbolic links accordingly:  

berni1> gmake prep-ibm
gmake prep PLATFORM=ibm
rm -f Makefile.var service.F90
ln -s platform/Makefile-ibm.var Makefile.var
ln -s platform/service-ibm.F90 service.F90



Resource requirements
~~~~~~~~~~~~~~~~~~~~

The resource requirements are approximately:

benchmark  lattice      total memory  size of output  execution time
---------------------------------------------------------------------------
MPP        48*48*48*96  497 GByte     4    GByte      268.2 s at 758.52 GFlop/s
SMP        24*24*24*48   37 GByte     0.25 GByte       44.4 s at 608.96 GFlop/s


Standart porting
~~~~~~~~~~~~~~~

*** make

The Makefiles use the makro $(MAKE) and the "include" statement.  Some
of Makefile-*.var are quite standard, some require GNU-make.

"make fast" can be used for a parallel "make".

"make fast" builds the binary "bqcd."

Without "make fast" one has to enter:

   make Modules
   make libs
   make bqcd

JuBE porting
~~~~~~~~~~~
For Altix:  
 Change the following lines in the execution file  bensh:
  the first line:   #!/usr/local/bin/perl -w
   to 
   #!/usr/bin/perl -w 
 the line 1235:  $cmd="cp -rp $srcdir/$file $dir/src/";
to
  $cmd="cp -rp $srcdir/* $dir/src/";






*** ANSI C preprocessor

The C preprocessor is needed for building the source.  The C
preprocessor must be able to handle the string concatenation macro "##".

Recent versions of the GNU C Proprocesse do not work because they
refuse to process the Fortran90 separator "%".


*** Service routines and "stderr"

Service routines are needed for aborting, measuring CPU-time, to get
arguments from the command line, etc.  The corresponding routines have
to be inserted in the file service.F90.

It is assumed that Fortran unit 0 is pre-connected to stderr. If this
is not the case on your machine you should re-#define STDERR in "defs.h".

For the time measurements it is important to use a time function with
high resolution in the function "sekunden".



*** Message passing / Communication library

Originally the communication was programmed with the shmem library on
a Cray T3E.

Now MPI is mainly used.  There is also a single CPU version (that
needs no communication library) and a combination of shmem for the
most time consuming part and MPI.  

See $(LIBCOMM) in platform/Makefile-EXPLAINED.var and "Hints for
optimisation" below.


*** OpenMP

In addition to setting your compiler's OpenMP option you have to add 
"-D_OPENMP" in "Makefile.var":

   MYFLAGS = ... -D_OPENMP



Verification
~~~~~~~~~~~

*** Random numbers

Correctness of random numbers can be checked by:

   make the_ranf_test

The test is done by comparison with reference output.  On most
platforms there is no difference.  However, on Intel "diff"
usually reports differences in the last digit of the floating point
representation of the random numbers; the integer representations
match exactly, eg:

<      1                    4711      0.5499394951912783
---
>      1                    4711      0.5499394951912784


*** Argments from the command line

Try option -V:

berni1> ./bqcd -V
 This is bqcd benchmark2
    input format:     4
    conf info format: 3
    MAX_TEMPER:       50
    real kind:        8
    version of D:     2
    D3: buffer vol:   0
    communication:   single_pe + OpenMP



*** BQCD

To check that the BQCD works correctly execute the following sequence
of commands:

berni1> cd work
berni1> ../bqcd input.TEST > out.TEST
berni1> grep ' %[fim][atc]' out.TEST > out.tmp
berni1> grep ' %[fim][atc]' out.TEST.reference | diff - out.tmp
18c18
<   %fa   -1    1   0.4319366404   1.0173348431      43     407      38
---
>   %fa   -1    1   0.4319366404   1.0173348433      43     407      38

The test can be run for any domain decomposition and any number of
threads.  In any case result should agree.  Floating point numbers
might differ in the last digit as shown above.
(In total 20 lines containing floating point numbers are compared.)


*** Check sums

BQCD writes restart files in the working directory.  The extension of
the file containing information on the run is ".info".  It contains
check sums of the binary data files (the example was run after the
test run):

berni1> tail -6 bqcd.000.1.info
 >BeginCheckSum
 bqcd.000.1.00.u 286125633 24576
 bqcd.000.1.01.u 804770858 24576
 bqcd.000.1.02.u 657813015 24576
 bqcd.000.1.03.u 3802083338 24576
 >EndCheckSum

These check sums should be identical to check sums calculated by the
"cksum" command:

berni1> cksum bqcd.000.1.*.u | awk '{print $3, $1, $2}'
bqcd.000.1.00.u 286125633 24576
bqcd.000.1.01.u 804770858 24576
bqcd.000.1.02.u 657813015 24576
bqcd.000.1.03.u 3802083338 24576



Structure of the input
~~~~~~~~~~~~~~~~~~~~~

run 204                               names of restart files will contain "run"
                                      can be set to 0

lattice  24 24 24 48                  lattice size, can e.g. be modified for
                                      weak scaling analysis

processes 1  2  2  4                  number of MPI-proceses per direction
                                      (1 1 1 1 in the pure OpenMP case)

boundary_conditions_fermions 1 1 1 -1 do not change

beta  5                               do not change
kappa 0.13                            do not change
csw   2.3327                          do not change

hmc_test            0                 do not change
hmc_model           C                 do not change
hmc_rho             0.1               do not change
hmc_trajectory_length 0.2             do not change
hmc_steps           10                can be lowered -> shorter execution time
hmc_accept_first    1                 do not change
hmc_m_scale         3                 do not change

start_configuration cold              do not change
start_random        default           do not change

mc_steps            1                 do not change
mc_total_steps      100               do not change

solver_rest                  1e-99    do not change
solver_maxiter               100      can be lowered -> shorter execution time
solver_ignore_no_convergence 2        do not change (CG will not converge,
                                      the numbers of iterations per call
                                      will be exactly "solver_maxiter")
solver_mre_vectors           7        do not change





Hints on optimisation
~~~~~~~~~~~~~~~~~~~~

Before starting any optimisation one should find the fastest variant
in the existing code.  There are two libraries to look at: $(LIBD) and
$(LIBCOMM).



*** LIBCOMM ("communication", directory: comm)

There are the following variants:

lib_single_pe.a:  Single CPU version (PE: "processing element").

lib_mpi.a:        MPI version.

lib_shmempi.a:    shmem for nearest neighbour communication, MPI for the rest.


*** Caveat

Not all combinations of LIBD and LIBCOMM have been implemented.

The following combinations should work (lib_mpi.a always works):

LIBD	  LIBCOMM
--------------------------------------------------
libd.a    lib_single_pe.a lib_mpi.a
libd2.a   lib_single_pe.a lib_mpi.a lib_schmempi.a
libd3.a   lib_mpi.a
libd21.a  lib_single_pe.a lib_mpi.a lib_schmempi.a



Rules for time measurements
~~~~~~~~~~~~~~~~~~~~~~~~~~

In "Makefile.var" "-DTIMING" must always be set:

   MYFLAGS = -DTIMING ...

All time measurements (TIMING_START() ... TIMING_STOP()) must be kept.
There is one exception: If you restructure routines d() and d_dag()
it might occur that the current regions of time measurements (which
are per direction) do not make sense.  (For example, this would occur
when combining loops from more than one direction.)

In that case, please report in addition the best measurement obtained
with the existing code.



######################################################################
kernel_B README

This is the README file for the SU3_AHiggs application benchmark,
distributed with the DEISA Benchmark Suite:
http://www.deisa.eu/science/benchmarking/

Last modified by the DEISA Benchmark Team on 2008-08-22.



-----------------
SU3_AHiggs readme
-----------------



Contents
--------

1 General description
2 Code structure
3 Parallelisation
4 Building
5 Execution
6 Verification
7 Input data
8 Output data


1 General description
=====================

SU3_AHiggs is a lattice quantum chromodynamics (QCD) code intended for
computing the conditions of the Early Universe. Instead of the "full QCD", the
code applies an effective field theory, which is valid at high
temperatures. In the effective theory, the lattice is 3D. For this reason,
SU3_AHiggs stresses different parts of the architecture than the conventional
QCD applications using 4D lattices.

SU3_AHiggs has roots in the MILC code, but it is heavily rewritten by
Prof. Kari Rummukainen (University of Oulu, Finland). The code is written
solely in C and it uses MPI communications. No external libraries are needed
to run the program.

The directory SU3/src contains several closely related QCD programs: 

   * SU3_4D
   * SU3_AHiggs
   * SU3_Gauge

In the DEISA benchmarks, only the code SU3_AHiggs is used. If you find errors
in any of the files in the SU3 package, please contact benchmarking@deisa.eu.


2 Code structure
================

In SU3_AHiggs, the spacetime is discretised and replaced with a 3D cubic
lattice. Every lattice vertex contains a 3 x 3 traceless Hermitian
matrix. From each vertex, in turn, there are six edges to nearest-neighbour
vertices. Edges are 3 x 3 unitary matrices.

The aim of the SU3_AHiggs computation is to generate lattice configurations
from the microcanonical distribution, which is the statistical equilibrium
state of the system. The program uses heat-bath and over-relaxation algorithms
to update lattice vertices and links. The computation starts from a random
initial configuration.

The main function of SU3_AHiggs is in the file su3h_n/control.c. After the
initial setup, main calls the function runthis, which in turn calls other
functions in the SU3 package. If the dataset is sufficiently large, most of
the computing time is spent on lattice updates (functions updategauge and
updatehiggs in files su3h_n/updategauge.c and su3h_n/updatehiggs.c). If the
dataset is too small, in turn, the computation becomes communication
bound. MPI routines are not called directly but with customised communication
functions defined in generic/comdefs.h and generic/com_mpi.c.


3 Parallelisation
=================

SU3_AHiggs uses a 3D domain decomposition method for parallelisation. Each MPI
task communicates with six neighbouring tasks only. The communication routines
are defined in the files generic/comdefs.h and generic/com_mpi.c. The most
important routines are:

   * start_get()
     
     This function starts asynchronous sends and receives required to gather 
     neighbouring lattice vertices and links. The call graph looks like this: 

         start_get()  -->  start_gather()  -->  MPI_Irecv(), MPI_Isend()
     
   * wait_get()

     This function waits for receives to finish, ensuring that the data has
     actually arrived. The call graph looks like this:

         wait_get()  -->  wait_gather()   -->  MPI_Wait()

With a 32^3 lattice, the program performs well up to 256 processes. With a
256^3 lattice, the speedup is almost linear with the number of processes. The
highest processor number tested so far is 2048. The lattice size and the
number of iterations are controlled by four user-adjustable parameters.


4 Building
==========

To build SU3_AHiggs with the JUBE tool on a new architecture (NEWARCH), do the
following steps:

   1) Create a new top-level XML file for the new architecture
      (bench-NEWARCH.xml). In this task, you can use the already existing
      files as a starting point: bench-Cray-XT4-Louhi.xml,
      bench-IBM-SP4-Jump.xml, and bench-SGI-Altix-HLRB2.xml. Normally you have
      to change the values of $nodes and $taskspernode only.

   2) Edit compile.xml: Create a new section <compile cname="NEWARCH">, where
      NEWARCH is the same as in the file
      DEISA_BENCH/platform/platform.xml. Substitute values in the new compile
      section with those proper for the new architecture. Normally you need to
      change #CFLAGS# and #LDFLAGS#. Possibly you want to change #CC# and
      #MPI_CC# also.

   3) Run the compile step within the benchmark "test": Edit bench-NEWARCH.xml
      and make sure that you have:

         <benchmark name="test" active="1">
             <compile     cname="$platform" version="new" />
             ...
         </benchmark>

      Then run: perl ../../bench/jube -debug bench-NEWARCH.xml

If the compile step fails, go to the directory where JUBE has run the compile
step:

    tmp/SU3_NEWARCH_test_i000001/.../src

Then try to run the command make manually. Analyze the error and try to fix it
modifying the file Makefile.defs. After the problem is solved, edit the file
compile.xml accordingly. If you cannot solve the problem just by editing
compile.xml, please contact benchmarking@deisa.eu.


5 Execution
===========

To run SU3_AHiggs with the JUBE tool, do the following steps: 

   1) Before running the benchmarks you need an execute script template, such
      as:

         DEISA_BENCH/platform/Cray-XT4-Louhi/cray_qsub.job.in

   2) Edit execute.xml: Create a new section <execute cname="NEWARCH">, and
      match the values in the new section with the execute script template.

   3) Run a benchmark: Select a benchmark by setting active="1" in the file 
      bench-NEWARCH.xml. Then run: perl ../../bench/jube -submit
      bench-NEWARCH.xml

To run SU3_AHiggs manually (without JUBE), do the following steps: 

   1) Copy the SU3_AHiggs executable to a directory that is accessible from
      compute nodes. The name of the SU3_AHiggs executable is:

         src/su3h_n/su3_ahiggs

   2) Copy the input files beta, parameter, and status to the same
      directory. In the directory input, there are several sets of input files
      available:

         input/lat_256/*   (256^3 lattice, 100 iterations)
         input/lat_32/*    (32^3 lattice, 10000 iterations)
         input/test/*      (32^3 lattice, 100 iterations)

   3) Start the program with a MPI launcher available in your system, for
      example:

         aprun -n 8 ./su3_ahiggs

      The test benchmark takes approximately 10 seconds with 8 processor
      cores. Other benchmarks run longer: approximately 1 minute with 1024
      cores.

Important: The number of tasks in su3_ahiggs must be a power of 2. Otherwise
the program cannot layout the lattice, and the execution stops.


6 Verification
==============

JUBE verifies benchmark results automatically as part of the result analysis
step. In SU3_AHiggs, the verification cannot be done directly by comparing
benchmark results with some reference results. The reason to this is that the
results are very sensitive to compiler optimizations and the number of MPI
tasks as well. This can make results to appear very different if compared with
the reference results. Everything can still be all right, as long as the
results are statistically the same.

Therefore SU3_AHiggs uses a statistical comparison test to verify benchmark
results (Student's t-test). Significance level is chosen to be 1e-4 (correct
results are rejected once every 10000 times). First 50 iterations are not
included in the comparison. The reference results are found at:

    reference/lat_256/higgs.out  (256^3 lattice, 100 iterations)
    reference/lat_32/higgs.out   (32^3 lattice, 10000 iterations)
    reference/test/higgs.out     (32^3 lattice, 100 iterations)

These files contain the Higgs field at each iteration for a given lattice
size.

To verify benchmark results manually (without JUBE), do the following steps:

   1) Copy the executable src/aa/aa to the directory SU3/run. 

   2) Run the following command in the directory SU3:

         perl run/check_results_su3.pl output.xml stdout.log stderr.log \ 
         $RUNDIR reference/lat_256

      The environment variable $RUNDIR should point to the directory where 
      SU3_AHiggs has been executed. 

   3) If the benchmark results are correct, the file output.xml includes the
      following lines: 

         <parm name="vcheck" value="1" type="bool" unit="" />
         <parm name="vcomment" value="ok" type="string" unit=""/>

      If not, the same lines look like this: 

         <parm name="vcheck" value="0" type="bool" unit="" />
         <parm name="vcomment" value="not ok" type="string" unit=""/>


7 Input data
============

Input data for SU3_AHiggs consist of three short ASCII files containing
simulation parameters related to temperature, lattice size, iterations, etc.

For example, the files related to the test benchmark look like this: 

input/test/beta:

    betag   12
    x       0.06
    y       0.69025056

input/test/parameters: 

    nx      32
    ny      32
    nz      32
    micro steps     4
    n_measurement   1
    n_correlation   10000
    w_correlation   100000
    n_save          -1000
    blocking levels 1
    level 0         1
    level 1         1

input/test/status:

    restart                          0
    n_iteration                      100
    n_thermal                        0
    seed                             479817384
    run status                       
    iteration                        
    time: gauge                      
    time: higgs                      
    time: rest                       

It is easy to create new datasets by changing the lattice size (variables nx,
ny, and nz), number of iterations (n_iteration), and seed number for the
random number generator (seed). The duration of a simulation is roughly
proportional to:

    nx * ny * nz * n_iteration  

SU3_AHiggs has currently three datasets: 

    test    32^3 lattice, 100 iterations
    small   32^3 lattice, 10000 iterations  (artificial dataset) 
    large   256^3 lattice, 100 iterations   (real research dataset)

The test dataset is designed to help porting to new architectures. The small
dataset, in turn, is designed for benchmarking purposes. With it, benchmark
timings depend strongly on the interconnect speed.


8 Output data
=============

During the benchmarks, SU3_AHiggs writes out its result to the following files: 

    correl
    measure
    status

Note that the file named status is both input and output file; SU3_AHiggs
modifies it during the computation. The file measure is a binary file that
contains simulation results at each iteration. Its contents can be read with
the tool named aa available in the directory src/aa.

The benchmark timings are written to the standard output. JUBE reads them
automatically as part of the analysis step. To get benchmark timings manually,
grep for "total time in seconds" in the standard output.



######################################################################
kernel_C README


This document is short guide to get started and run the speed tests. For
more detailed information see the README.extended.


PROGRAMS

The benchmark programs are provided in source form and must be
compiled by the user on the machine that is to be tested.

In addition the openQCD-1.4 package is needed. A tar-file of the
source code can be obtained from

http://luscher.web.cern.ch/luscher/openQCD/

and should be extracted in the same directory level as this package.

PROGRAM FEATURES

All programs parallelize in 0,1,2,3 or 4 dimensions, depending on what is
specified at compilation time. They are highly optimized for machines with
current Intel or AMD processors, but will run correctly on any system that
complies with the ISO C89 (formerly ANSI C) and the MPI 1.2 standards.

For the purpose of testing and code development, the programs can also
be run on a desktop or laptop computer. All what is needed for this is
a compliant C compiler and a local MPI installation such as Open MPI.


DOCUMENTATION

The simulation program has a modular form, with strict prototyping and a
minimal use of external variables. Each program file contains a small number
of externally accessible functions whose functionality is described at the top
of the file.

The data layout is explained in various README files and detailed instructions
are given on how to run the main programs. A set of further documentation
files are included in the doc directory, where the normalization conventions,
the chosen algorithms and other important program elements are described.


COMPILATION

The compilation of the programs requires an ISO C89 compliant compiler and a
compatible MPI installation that complies with the MPI standard 1.2 (or later).

In the main and devel directories, a GNU-style Makefile is included which
compiles and links the programs (just type "make" to compile everything; "make
clean" removes the files generated by "make"). The compiler options can be set
by editing the CFLAGS line in the Makefiles.

The Makefiles assume that the following environment variables are set:

  GCC             GNU C compiler command [Example: /usr/bin/gcc].

  MPI_HOME        MPI home directory [Example: /usr/lib64/mpi/gcc/openmpi].
                  The mpicc command used is the one in $MPI_HOME/mpicc and
                  the MPI libraries are expected in $MPI_HOME/lib.

  MPI_INCLUDE     Directory where the mpi.h file is to be found.

All programs are then compiled using the $MPI_HOME/bin/mpicc command. The
compiler options that can be set in the CFLAGS line depend on which C compiler
the mpicc command invokes (the GCC compiler command is only used to resolve
the dependencies on the include files).


SSE/AVX ACCELERATION

Current Intel and AMD processors are able to perform arithmetic operations on
short vectors of floating-point numbers in just one or two machine cycles,
using SSE and/or AVX instructions. The arithmetic performed by these
instructions fully complies with the IEEE 754 standard.

Many programs in the module directories include SSE and AVX inline-assembly
code. On 64bit systems, and if the GNU or Intel C compiler is used, the code
can be activated by setting the compiler flags -Dx64 and -DAVX, respectively.
In addition, SSE prefetch instructions will be used if one of the following
options is specified:

  -DP4     Assume that prefetch instructions fetch 128 bytes at a time
           (Pentium 4 and related Xeons).

  -DPM     Assume that prefetch instructions fetch 64 bytes at a time
           (Athlon, Opteron, Pentium M, Core, Core 2 and related Xeons).

  -DP3     Assume that prefetch instructions fetch 32 bytes at a time
           (Pentium III).

These options have an effect only if -Dx64 or -DAVX is set. The option
-DAVX implies -Dx64.

On recent x86-64 machines with AMD Opteron or Intel Xeon processors, for
example, the recommended compiler flags are

    -std=c89 -O -mno-avx -DAVX -DPM

For older machines that do not support the AVX instruction set, the
recommended flags are

    -std=c89 -O -mno-avx -Dx64 -DPM

More aggressive optimization levels such as -O2 and -O3 tend to have little
effect on the execution speed of the programs, but the risk of generating
wrong code is higher.

AVX instructions and the option -mno-avx may not be known to old versions
of the compilers, in which case one is limited to SSE accelerations with
option string -std=c89 -O -Dx64 -DPM.


DEBUGGING FLAGS

For troubleshooting and parameter tuning, it may helpful to switch on some
debugging flags at compilation time. The simulation program then prints a
detailed report to the log file on the progress made in specified subprogram.

The available flags are:

-DCGNE_DBG         CGNE solver.

-DFGCR_DBG         GCR solver.

-FGCR4VD_DBG       GCR solver for the little Dirac equation.

-DMSCG_DBG         MSCG solver.

-DDFL_MODES_DBG    Deflation subspace generation.

-DMDINT_DBG        Integration of the molecular-dynamics equations.

-DRWRAT_DBG        Computation of the rational function reweighting
                   factor.


RUNNING A SIMULATION

The simulation programs reside in the directory "main". For each program,
there is a README file in this directory which describes the program
functionality and its parameters.

Running a simulation for the first time requires its parameters to be chosen,
which tends to be a non-trivial task. The syntax of the input parameter files
and the meaning of the various parameters is described in some detail in
main/README.infiles and doc/parms.pdf. Examples of valid parameter files are
contained in the directory main/examples.


EXPORTED FIELD FORMAT

The field configurations generated in the course of a simulation are written
to disk in a machine-independent format (see modules/misc/archive.c).
Independently of the machine endianness, the fields are written in little
endian format. A byte-reordering is therefore not required when machines with
different endianness are used for the simulation and the physics analysis.


AUTHORS

The initial release of the openQCD package was written by Martin Lüscher and
Stefan Schaefer. Support for Schrödinger functional boundary conditions was
added by John Bulava. Several modules were taken over from the DD-HMC program
tree, which includes contributions from Luigi Del Debbio, Leonardo Giusti,
Björn Leder and Filippo Palombi.


ACKNOWLEDGEMENTS

In the course of the development of the openQCD code, many people suggested
corrections and improvements or tested preliminary versions of the programs.
The authors are particularly grateful to Isabel Campos, Dalibor Djukanovic,
Georg Engel, Leonardo Giusti, Björn Leder, Carlos Pena and Hubert Simma for
their communications and help.


LICENSE

The software may be used under the terms of the GNU General Public Licence
(GPL).


BUG REPORTS

If a bug is discovered, please send a report to <j.finkenrath@cyi.ac.cy>.


ALTERNATIVE PACKAGES AND COMPLEMENTARY PROGRAMS

There is a publicly available BG/Q version of openQCD that takes advantage of
the machine-specific features of IBM BlueGene/Q computers. The version is
available at <http://hpc.desy.de/simlab/codes/>.

The openQCD programs currently do not support reweighting in the quark
masses, but a module providing this functionality can be downloaded from
<http://www-ai.math.uni-wuppertal.de/~leder/mrw/>.

Previously generated gauge-field configurations are often used as initial
configuration for a new run. If the old and new lattices or boundary
conditions are not the same, the old configuration may however need to be
adapted, using a field conversion tool such as the one available at
<http://hpc.desy.de/simlab/codes/>, before the new run is started.

######################################################################
kernel_D README

Important compiler defines XXX are (-DXXX)
MPI -> switch on parallelisation
PARALLELXYZT -> 4-dimensional parallelisation
PARALLELXYT  -> 3-dim
PARALLELXT   -> 2-dim
PARALLELT    -> 1-dim
SSE2         -> SSE2 inline assembly (to be used with one of the two following)
P4           -> pentium 4
OPTERON      -> opteron
_GAUGE_COPY  -> non-strided memory access for gauge fields, but more memory required
BGL          -> Blue Gene /L
BGP          -> Blue Gene /P, to be used on top of BGL

If none of them are used, you will get a serial version of the program.

The local lattice size in the case of the one dimensional
prallelisation is controlled by the parameters in the file
benchmark.input:

T = 32
L = 16

which will give a 32 x 16^3 global lattice.

NrXProcs = 2

needs only to be set in case of a parallel compilation and sets
the number of processes in x-direction. The same holds for NrYProcs and NrZProcs.
The number of processes in
t-direction is computed from NrX|Y|ZProcs and the total number of processes.
You should only take care that all this fits with the lattice size.

the package size of the data that are send and recieved is 
192 * (1/2) * L^3 Byte in case of the one dimensional parallelisation.
In case of the two dimensional parallelisation it is
192 * (1/2) ((L*L*L/N_PROC_X)+(T*L*L)) Byte.

A run of the benchmark takes about one minute.

The out-put of the program is something like this: (T=2,L=16)

The number of processes is 12 
The local lattice size is 2 x 16 ^3 
total time 4.681349e+00 sec, Variance of the time 6.314982e-03 sec 

 (297 Mflops [64 bit arithmetic])

communication switched off 
 (577 Mflops [64 bit arithmetic])

The size of the package is 393216 Byte 
The bandwidth is 84.49 + 84.49   MB/sec


If you use the serial version of course the part depending on the
parallel setup will be missing.


Compilation examples (you need a c-compiler with c99 standard, otherwise you may need to define inline, restrict etc. to nothing):

in general (gcc)
gcc -std=c99 -I. -I./ -I.. -o benchmark -D_GAUGE_COPY -O Hopping_Matrix.c mpi_init.c geometry_eo.c test/check_xchange.c test/check_geometry.c boundary.c start.c ranlxd.c init_gauge_field.c init_geometry_indices.c init_moment_field.c init_spinor_field.c read_input.c benchmark.c update_backward_gauge.c D_psi.c ranlxs.c -lm

gcc and OPTERON (64 Bit architecture):
gcc -std=c99 -I. -I./ -I.. -o benchmark -DOPTERON -DSSE2 -mfpmath=387 -fomit-frame-pointer -ffloat-store -D_GAUGE_COPY  -O Hopping_Matrix.c mpi_init.c geometry_eo.c test/check_xchange.c test/check_geometry.c boundary.c start.c ranlxd.c init_gauge_field.c init_geometry_indices.c init_moment_field.c init_spinor_field.c read_input.c benchmark.c update_backward_gauge.c D_psi.c ranlxs.c -lm

gcc and pentium4:
gcc -std=c99 -I. -I./ -I.. -o benchmark -DSSE2 -DP4 -march=pentium4  -malign-double -fomit-frame-pointer -ffloat-store -D_GAUGE_COPY  -O Hopping_Matrix.c mpi_init.c geometry_eo.c test/check_xchange.c test/check_geometry.c boundary.c start.c ranlxd.c init_gauge_field.c init_geometry_indices.c init_moment_field.c init_spinor_field.c read_input.c benchmark.c update_backward_gauge.c D_psi.c ranlxs.c -lm

mpicc (gcc) general, four dimensional parallelisation:
mpicc -std=c99 -I. -I./ -I.. -o benchmark -O3  -DMPI -DPARALLELXYZT -D_GAUGE_COPY  -O Hopping_Matrix.c Hopping_Matrix_nocom.c xchange_deri.c xchange_field.c xchange_gauge.c xchange_halffield.c xchange_lexicfield.c  mpi_init.c geometry_eo.c test/check_xchange.c test/check_geometry.c boundary.c start.c ranlxd.c init_gauge_field.c init_geometry_indices.c init_moment_field.c init_spinor_field.c read_input.c benchmark.c update_backward_gauge.c D_psi.c ranlxs.c init_dirac_halfspinor.c -lm


######################################################################
kernel_E README

none