GPU Machine

The following contains information on the setup of the small GPU machine presently located in the server room on the third floor. The specifications are:

  • 1 x nvidia k40c (12Gb).
  • 2 x nvidia M2090 (5Gb).
  • CPU: Intel Xeon E5-1650 v2 @ 3.90GHz (6 cores).
  • Motherboard: ASUS P9X79-WS-SYS.
  • Chipset: Intel Xeon E7 v2/Xeon.
  • RAM: 4 x 8192 MB DDR3-1866MHz Samsung.
  • OS: Debian 3.16.7.
  • There is 1 x 2TB disc partitioned with around 500Gb for the filesystem (/) and the rest for user files (/home).

There is no module system, though a number of packages and various versions have been installed.

FFTW: Versions 2, 3.3 (dev+double+long+quad+single) have been installed through apt-get and version 3.3.4 is in /opt/fftw-3.3.4/ as a set of static libraries in the subsequent lib and .lib directories.

LAPACK: Version 3.5.0 (+dev) has been installed through apt-get.

BLAS: Version 1.2.20110419 (+dev) has bee installed through apt-get.

ABINIT: Version 7.8.2 has been installed through apt-get.

CUDA: Version 7.5 has been installed to /usr/local/cuda-7.5/ including CUFFT, CUBLAS, CUDART, CURAND and CUSPARSE.

SLURM: Version 15.08.8 is installed.

SPRKKR: The binaries for version 6.3 are located in /usr/local/bin with the source code and make.inc required for building in /opt/sprkkr6.3.

On Boot (to be automated): The GPUs should be set to thread exclusive mode so that they cannot take multiple jobs simultaneously. This has to be done each time the machine is booted with (as root):

$ nvidia-smi -i # -c EXCLUSIVE_PROCESS

Where # is the number of the GPU (can be found by nvidia-smi). This command should be run for each GPU.

Also, the daemons for the slurm queueing system should be run (as root)

$ /etc/init.d/munge start
$ /etc/init.d/slurmctld start
$ /etc/init.d/slurmd start

Submitting jobs and queues: All of the queue configuration is specified in two files (/etc/slurm-llnl/slurm.conf and /etc/slurm-llnl/gres.conf). As the head node is also the slave node the maximum amount of memory that can be requested is 28Gb, leaving 4Gb for the operating system, also, only 4 CPUs have been allocated to running jobs. This can be adjusted in the slurm.conf file. If reqesting a GPU (note 1 CPU is always allocated per GPU job by default though it is possible to request more), they must be specified explicitly in the job script by using the –gres (generic resource) flag.

To request the GPUs include in your preamble: #SBATCH –gres=gpu:GPU_TYPE:1
Where, GPU_TYPE is either m2090 or k40.

Where you can replace the last 1 with 2 if you want both m2090 cards. The maximum run time is currently 5 days but this can be adjusted. The preamble for a generic job script is quite straightforward.

#!/bin/bash
#SBATCH –job-name=JOBNAME
#%A – JOB ID, %a – TASK ID
#SBATCH –output=o.OUTPUTFILE.%A_%a
#Run time
#SBATCH –time=5-0
#Partition/queue
#SBATCH –partition=defq
#Generic resource (for m2090 replace k40 with m2090)
#SBATCH –gres=gpu:k40:1
#SBATCH –ntasks=1
Number of CPUs you want to request
#SBATCH –cpus-per-task=1
#Run 10 tasks with id 1-10
#SBATCH –array=1-10:1

There are no nodes and only one disc so no need to use a scratch space or copy any files. There are also some extra notes in /opt/slurm/notes.

Interactive queue: To request an interactive session, e.g. for GPU development, use the following command (e.g. to develop on an m2090).

$ srun -N 1 -p defq –ntasks-per-node=1 –gres=gpu:m2090:1 –pty bash


Comments are closed.