README.md

## Test Case C

This test case aims to stress the underlying hardware with high-resolution images and a large deep neural network (DNN). For example, running on a tier-0 cluster using 256 nodes (each with 2 CPU sockets) would be (please allocate the compute resources first):

```
mpirun -np 512 python dg_train.py -f output_bw_2048.hdf5 --arch EfficientNetB7 --epoches 10 --noise 0.3  --batch-size 1
```
Here, `--batch-size 1` is used because the DNN is so large that it would use up to 160 GB of memory. This memory requirement exceeds the available memory of all currently available GPUs, and so the training should be performed on the CPU in principle. In some large-memory nodes, a `batch-size` of 2 or even 4 might be possible. 

As a walkaround, it is still possible to perform the training on the GPUs by using CUDA's unified memory. This would likely introduce performance penalty. As an example, if one wishes to allocate 5x the size of GPU memory, and carry out training on 256 nodes (each with 4 GPUs), the commandline would be
```
mpirun -np 1024 python dg_train.py -f output_bw_2048.hdf5 --arch EfficientNetB7 --epoches 10 --noise 0.3  --batch-size 1 --gpu-mem-frac 5
```
Please note that the surplus of memory requirement is offset by the host memory. So in this example, the host memory is expected to offset (160 - 32) * 4 = 512 GB. So please choose the `--gpu-mem-frac` with according to the actual hardware specs to ensure that the model has enough memory to operate.