# Alya - Large Scale Computational Mechanics Alya is a simulation code for high performance computational mechanics. Alya solves coupled multiphysics problems using high performance computing techniques for distributed and shared memory supercomputers, together with vectorization and optimization at the node level. Homepage: https://www.bsc.es/research-development/research-areas/engineering-simulations/alya-high-performance-computational Alya is avaialble to collaboratoring projects and a specific version is being distributed as part of the PRACE Unified European Applications Benchmark Suite (http://www.prace-ri.eu/ueabs/#ALYA) ## Building Alya for GPU accelerators The library currently supports four solvers:GMRES, Deflated Conjugate Gradient, Conjugate Gradient, and Pipelined Conjugate Gradient. The only pre-conditioner supported at the moment is 'diagonal'. Keywords to use the solvers: ```shell NINJA GMRES : GGMR NINJA Deflated CG : GDECG NINJA CG : GCG NINJA Pipelined CG : GPCG PRECONDITIONER : DIAGONAL ``` Other options are same a CPU based solver. ### GPGPU Building This version was tested with the Intel Compilers 2017.1, bullxmpi-1.2.9.1 and NVIDIA CUDA 7.5. Ensure that the wrappers `mpif90` and `mpicc` point to the correct binaries and that `$CUDA_HOME` is set. Alya can be used with just MPI or hybrid MPI-OpenMP parallelism. Standard execution mode is to rely on MPI only. - Uncompress the source and configure the depending Metis library and Alya build options: ```shell tar xvf alya-prace-acc.tar.bz2 ``` - Edit the file `Alya/Thirdparties/metis-4.0/Makefile.in` to select the compiler and target platform. Uncomment the specific lines and add optimization parameters, e.g. ```shell OPTFLAGS = -O3 -xCORE-AVX2 ``` - Then build Metis4 ```shell $ cd Alya/Executables/unix $ make metis4 ``` - For Alya there are several example configurations, copy one, e.g. for Intel Compilers: ```shell $ cp configure.in/config_ifort.in config.in ``` - Edit the config.in: Add the corresponding platform optimization flags to `FCFLAGS`, e.g. ```shell FCFLAGS = -module $O -c -xCORE-AVX2 ``` - MPI: No changes in the configure file are necessary. By default you use metis4 and 4 byte integers. - MPI-hybrid (with OpenMP) : Uncomment the following lines for OpenMP version: ```shell CSALYA := $(CSALYA) -qopenmp (-fopenmp for GCC Compilers) EXTRALIB := $(EXTRALIB) -qopenmp (-fopenmp for gcc Compilers) ``` - Configure and build Alya (-x Release version; -g Debug version, plus uncommenting debug and checking flags in config.in) ```shell ./configure -x nastin parall make NINJA=1 -j num_processors ``` ### GPGPU Usage Each problem needs a `GPUconfig.dat`. A sample is available at `Alya/Thirdparties/ninja` and needs to be copied to the work directory. A README file in the same location provides further information. - Extract the small one node test case and configure to use GPU solvers: ```shell $ tar xvf cavity1_hexa_med.tar.bz2 && cd cavity1_hexa_med $ cp ../Alya/Thirdparties/ninja/GPUconfig.dat . ``` - To use the GPU, you have to replace `GMRES` with `GGMR` and `DEFLATED_CG` with `GDECG`, both in `cavity1_hexa.nsi.dat` - Edit the job script to submit the calculation to the batch system. ```shell job.sh: Modify the path where you have your Alya.x (compiled with MPI options) sbatch job.sh ``` Alternatively execute directly: ```shell OMP_NUM_THREADS=4 mpirun -np 16 Alya.x cavity1_hexa ``` ## Building Alya for Intel Xeon Phi Knights Landing (KNL) The Xeon Phi processor version of Alya is currently relying on compiler assisted optimization for AVX-512. Porting of performance critical kernels to the new assembly instructions is underway. There will not be a version for first generation Xeon Phi Knights Corner coprocessors. ### KNL Building This version was tested with the Intel Compilers 2017.1, Intel MPI 2017.1. Ensure that the wrappers `mpif90` and `mpicc` point to the correct binaries. Alya can be used with just MPI or hybrid MPI-OpenMP parallelism. Standard execution mode is to rely on MPI only. - Uncompress the source and configure the depending Metis library and Alya build options: ```shell tar xvf alya-prace-acc.tar.bz2 ``` - Edit the file `Alya/Thirdparties/metis-4.0/Makefile.in` to select the compiler and target platform. Uncomment the specific lines and add optimization parameters, e.g. ```shell OPTFLAGS = -O3 -xMIC-AVX512 ``` - Then build Metis4 ```shell $ cd Alya/Executables/unix $ make metis4 ``` - For Alya there are several example configurations, copy one, e.g. for Intel Compilers: ```shell $ cp configure.in/config_ifort.in config.in ``` - Edit the config.in: Add the corresponding platform optimization flags to `FCFLAGS`, e.g. ```shell FCFLAGS = -module $O -c -xMIC-AVX512 ``` - MPI: No changes in the configure file are necessary. By default you use metis4 and 4 byte integers. - MPI-hybrid (with OpenMP) : Uncomment the following lines for OpenMP version: ```shell CSALYA := $(CSALYA) -qopenmp (-fopenmp for GCC Compilers) EXTRALIB := $(EXTRALIB) -qopenmp (-fopenmp for gcc Compilers) ``` - Configure and build Alya (-x Release version; -g Debug version, plus uncommenting debug and checking flags in config.in) ```shell ./configure -x nastin parall make -j num_processors ``` ### KNL Usage - Extract the small one node test case. ```shell $ tar xvf cavity1_hexa_med.tar.bz2 && cd cavity1_hexa_med $ cp ../Alya/Thirdparties/ninja/GPUconfig.dat . ``` - Edit the job script to submit the calculation to the batch system. ```shell job.sh: Modify the path where you have your Alya.x (compiled with MPI options) sbatch job.sh ``` Alternatively execute directly: ```shell OMP_NUM_THREADS=4 mpirun -np 16 Alya.x cavity1_hexa ``` ## Remarks If the number of elements is too low for a scalability analysis, Alya includes a mesh multiplication technique. This tool can be used by selecting an input option in the ker.dat file. This option is the number of mesh multiplication levels one wants to apply (0 meaning no mesh multiplication). At each multiplication level, the number of elements is multiplied by 8, so one can obtain a huge mesh automatically in order to study the scalability of the code on different architectures. Note that the mesh multiplication is carried out in parallel and thus should not impact the duration of the simulation process.