========================================================================
README file for PRACE Accelerator Benchmark Code PFARM (stage EXDIG)
========================================================================
Author: Andrew Sunderland (andrew.sunderland@stfc.ac.uk).

The code download should contain the following directories:

pfarm/cpu: Source files and example scripts for running on CPUs (using serial or threaded LAPACK or MKL)
pfarm/gpu: Source files and example scripts for running on CPU/GPU nodes (using serial or threaded LAPACK or MKL and MAGMA)
benchmark/lib: Directory of library files used (static XDR library file)
benchmark/src_xdr: XDR library src files
benchmark/data: Data files for the benchmark test cases (to be created and downloaded separately (see below))

* Download benchmark data files
Create data directories:
$> cd pfarm
$> mkdir data
$> cd data
$> mkdir test_case_1_atom
$> mkdir test_case_2_mol


Copy files phzin.ctl, XJTARMOM and HX0J030 to the test_case_1_atom directory from:
https://repository.prace-ri.eu/ueabs/PFARM/2.2/test_case_1_atom/phzin.ctl 
https://repository.prace-ri.eu/ueabs/PFARM/2.2/test_case_1_atom/XJTARMOM 
https://repository.prace-ri.eu/ueabs/PFARM/2.2/test_case_1_atom/HXJ030

Copy files phzin.ctl and H to the test_case_2_mol directory from:
https://repository.prace-ri.eu/ueabs/PFARM/2.2/test_case_2_mol/phzin.ctl 
https://repository.prace-ri.eu/ueabs/PFARM/2.2/test_case_2_mol/H 

The code uses the eXternal Data Representation library (XDR) for cross-platform
compatibility of unformatted data files. The XDR source files are provided with this code bundle.
and can be obtained from various sources, including
http://meteora.ucsd.edu/~pierce/fxdr_home_page.html
http://people.redhat.com/rjones/portablexdr/

* Install XDR
Build XDR library: 
update DEFS file for your compiler and environment
$> cd src_xdr
$> make
(ignore warnings related to float/double type mismatches in xdr_rmat64.c - this is not relevant for this benchmark) 
The validity of the XDR library can be tested by running test_xdr
$> ./test_xdr
rpc headers may not be available for XDR on the target platform, leading to compilation errors of the type:
cannot open source file "rpc/rpc.h"
  #include <rpc/rpc.h>

For this case use the make include file DEFS_Intel_rpc


* Install CPU version (MPI and OpenMP)
$> cd cpu
Update DEFS file for your setup, ensuring you are linking to a LAPACK or MKL library (or equivalent).
This is usually facilitated by e.g. compiling with -mkl=parallel (Intel compiler) or loading the appropriate library modules.

** To install the atomic version of the code (recommended as the default benchmark)
$> cd src_mpi_omp_atom
$> make

** To install the molecular version of the code
$> cd src_mpi_omp_mol
$> make
The -ltirpc option for 'STATIC_LIBS' in 'DEFS' should only be included when the XDR library has used 'DEFS_Intel_rpc'. 

* Install GPU version (MPI / OpenMP / MAGMA / CUDA )
Set the MAGMADIR, CUDADIR environment variables to point to MAGMA and CUDA installations.
The numerical library MAGMA may be provided through the modules system of the platform.
Please check target platform user guides for linking instructions.

$> load module magma
If unavailable via a module, then MAGMA may need to be installed (see below)
$> cd gpu

Update DEFS file for your setup

* To install the atomic version of the code (recommended as the default benchmark)
$> cd src_mpi_gpu_atom
$> make

** To install the molecular version of the code
$> cd src_mpi_gpu_mol
$> make
The -ltirpc option for 'STATIC_LIBS' in 'DEFS' should only be included when the XDR library has used 'DEFS_Intel_rpc'.

----------------------------------------------------------------------------
* Installing (MAGMA for GPU Only)
Download MAGMA (current version magma-2.2.0)  from http://icl.utk.edu/magma/
Install MAGMA : Modify the make.inc file to indicate your C/C++
 compiler, Fortran compiler, and determine where CUDA, CPU BLAS, and 
 LAPACK are installed on your system. Refer to MAGMA documentation for further details
----------------------------------------------------------------------------


----------------------------------------------------------------------------
* Running PFARM
=================

For the atomic dataset, the program requires the following input files,
located in data/test_case_1_atom:
phzin.ctl
XJTARMOM
HXJ030

For the molecular dataset, the program requires the following input files,
located in data/test_case_2_mol:
phzin.ctl
H

It is recommended that the settings in the input file phzin.ctl are not changed for the benchmark runs and
problem size, runtime etc, are better controlled via the environment variables listed below.

To setup run directories with the correct executables and datafiles, bash script files are provided:
cpu/setup_run_cpu_atom.scr
cpu/setup_run_cpu_mol.scr
gpu/setup_run_gpu_atom.scr
gpu/setup_run_gpu_mol.scr

Example submission job scripts for cpu / gpu / atomic and molecular cases are provided in the directories
cpu/example_job_scripts
gpu/example_job_scripts

* Run-time environment variable settings

It is recommended that the RMX (PFARM) specific environment variables are set to those specified in
the example scripts, as this provides a suitably sized, physically meaningful benchmark. However, if
users wish to experiment with settings there is a guide here.

The following environmental variables that e.g. can be set inside the script allow the H sector matrix 
to easily change dimensions and also allows the number of sectors to change when undertaking benchmarks.
These can be adapted by the user to suit benchmark load requirements e.g. short vs long runs.
Each MPI Task will pickup a sector calculation which will then be distributed amongst available threads per node (for CPU and KNL)
or offloaded (for GPU). The maximum number of MPI tasks for a region calculation should not exceed the number of sectors specified.
There is no limit for threads, though for efficient performance on current hardware, it would be generaly recommended to set between
16 to 64 threads per MPI tasks. The distribution of sectors among MPI tasks is simple round-robin.
 
RMX_NGPU : refers to the number of shared GPUs per node (only for RMX_MAGMA_GPU)
RMX_NSECT_FINE : sets the number of sectors for the Fine region (e.g. 16 for smaller runs, 256 for larger-scale runs).
The molecular case is limited to a maximum of 512 sectors for this benchmark.
RMX_NSECT_COARSE : sets the number of sectors for the Coarse region (e.g. 16 for smaller runs, 256 for larger-scale runs).
The molecular case is limited to a maximum of 512 sectors for this benchmark.
RMX_NL_FINE : sets the number of basis functions for the Fine region sector calculations
(this will determine the size of the sector Hamiltonian matrix for the Fine region calculations). 
RMX_NL_COARSE : sets the number of basis functions for the Coarse region sector calculations
(this will determine the size of the sector Hamiltonian matrix for the Coarse region calculations). 
Hint: To aid ideal scaling across nodes, the number of MPI tasks in the job script should ideally be a factor of RMX_NSECT_FINE.

For representative test cases: 
RMX_NL_FINE should take values in the range 6:25
RMX_NL_COARSE should take values in the range 5:10 

For accuracy reasons, RMX_NL_FINE should always be great than RMX_NL_COARSE. 
The following value pairs for RMX_NL_FINE and RMX_NL_COARSE provide representative calculations:

12,6
14,8
16,10
18,10
20,10
25,10

If RMX_NSECT and RMX_NL variables are not set, the benchmark code defaults to calculating NL and NSECT, giving:
RMX_NSECT_FINE=5
RMX_NSECT_COARSE=20
RMX_NL_FINE=12
RMX_NL_COARSE=6

* Results
For the atomic case:
1 AMPF output file will be generated for each fine-region sector
1 AMPC output file will be generated for each coarse-region sector
All output AMPF files will be the same size and all output AMPC files will be the same size (bytes).

For the molecular case:
1 AMPF output file will be generated for each MPI task
1 AMPC output file will be generated for each MPI task

The Hamiltonian matrix dimension will be output along 
with the Wallclock time it takes to do each individual DSYEVD (eigensolver) call.

Performance is measured in Wallclock time and is displayed 
on the screen or output log at the end of the run.

** Validation of Results

For the atomic dataset runs, run the atomic problem configuration supplied in the 'example_job_scripts' directory .
From the results directory issue the command:

awk '/Sector 16/ && /eigenvalues/' <stdout.filename>

replacing <stdout.filename> with the stdout file produced by the run.

The output should match the values below.

Mesh 1, Sector 16: first five eigenvalues =     -4329.7161    -4170.9100    -4157.3112    -4100.9751    -4082.1108
 Mesh 1, Sector 16: final five eigenvalues =      4100.9751     4157.3114     4170.9125     4329.7178     4370.5405
 Mesh 2, Sector 16: first five eigenvalues =      -313.6307     -301.0096     -298.8824     -293.3929     -290.6190
 Mesh 2, Sector 16: final five eigenvalues =       290.6190      293.3929      298.8824      301.0102      313.6307

For the molecular dataset runs, run the molecular problem configuration supplied in 'example_job_scripts' directory.
From the results directory issue the command:

awk '/Sector 64/ && /eigenvalues/' <stdout.filename>

replacing <stdout.filename> with the stdout file produced by the run.

The output should match the values below.

Mesh 1, Sector 64: first five eigenvalues =     -3850.8443    -3593.9843    -3483.8338    -3466.7307    -3465.7194
 Mesh 1, Sector 64: final five eigenvalues =      3465.7194     3466.7307     3483.8338     3593.9855     3850.8443

----------------------------------------------------------------------------
