Maths GPU clusters and servers
Currently there are a number of GPU facilities available to Maths users - some are used by individual research groups and will not be detailed here but there are 4 generally accessible facilities consisting of two separate clusters plus two stand-alone GPU servers. The clusters each have their own host servers, nvidia1.ma and nvidia2.ma are server blades installed into the same chassis, which share the same PCI-express expansion chassis that physically accommodates the GPU cards. Both clusters are identical except for the GPU cards installed:
- nvidia1: three Tesla M2090 GPU cards with 512 CUDA cores and 6 Gbytes of memory
- nvidia2: two Tesla K20m GPU cards with 2496 CUDA cores and 5 Gbytes of memory
2.4 TB of local disk storage is available on each server for user home directories and in addition, the Maths data storage servers silos 1-4 and calculus are mounted on each GPU server via NFS. Running the Ubuntu 16.04 LTS Linux operating system, the nVidia CUDA version 9 drivers and Toolkit is installed on both.
- nvidia3: this is the stand-alone GPU server which is fitted with two nVidia K40 GPU cards each with 2880 cores, 12 Gb of memory and 2 TB of local storage as well as access to the networked storage facilities. With Ubuntu 18.04 and the latest CUDA 11 software as well as support for OpenCL, this facility is considerably more up to date than nvidia1 or nvidia2 and is better suited to experimental leading-edge applications.
- nvidia4:
the latest addition to the GPU family, this server has eight nVidia GTX 2080 Ti GPU cards, 1.5 Tb of memory and 22 TB of local storage and is otherwise set up much the same as for nvidia3.
Programs you can run on the clusters may either be pre-compiled binaries that have been built and linked on another compatible GPU system or ones you have written yourself (or using source code given to you by others) as a CUDA source file and compiled using the nvcc compiler. By convention, CUDA source files have the suffix .cu but may contain a mix of C, C++ and CUDA statements; nvcc uses the system's gcc compiler to generate non-GPU object code when necessary, switching automatically to the nVidia PTX compiler for GPU object code.
Getting started
- Access to either of the GPU clusters or the stand-alone servers is remotely via ssh and to begin with, you need an account on one or both of the host servers - simply email Andy Thomas requesting an account. Once this is set up, the account details will be mailed to you - the password is a random password and you are strongly encouraged to change it when you log in for the first time, using the 'passwd' utility and following the prompts.
- Before you start writing and compiling your own CUDA programs, you might want to have a look at some examples and you'll find a comprehensive selection of ready-to-compile programs in /usr/local/cuda/samples. A script called cuda-install-samples-11.0.sh is provided for you on nvidia3 and nvidia4 (cuda-install-samples-9.0.sh on nvidia1 and nvidia2) to make a writable copy of these read-only examples in your own home directory so that you can compile and run your own versions - here's an example of its use:
cuda-install-samples-11.0.sh ~/my_samples
- will copy the entire set of examples to a directory called my_samples/NVIDIA_CUDA-11.0_Samples in your home directory. Once you have done this, you can explore the examples and if you want to build and run the binary, just change into the directory containing your chosen example and type 'make'. For example, deviceQuery is a useful utility that displays the characteristics of each GPU card attached to the server so to compile and run your own copy of this, do the following:
cd ~/my_samples/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery
- The utility should report it has found 3 GPUs for nvidia1 (two in the case of nvidia2 and nvidia3, eight for nvidia4) and provide a detailed listing of the features for each of them.
- nvcc does have a man page on the server but it's not very useful since it just lists the main nVidia CUDA utilities with very little information on their usage. You'll find a selection of nVidia documentation in PDF format right here on this server and you can also access nVidia's own online documentation for full information on the CUDA Toolkit.
Checking the status of the GPUs
- If you want to find out what all the GPU cards are doing, use the nvidia-smi utility. Typing 'nvidia-smi' with no parameters produces a summary of their status as shown below:
- which shows both GPUs in nvidia2 fully loaded although only using about 20% of the total available memory; the PIDs and names of the processes running on the host server are also listed and normal Linux utilities such as 'ps ax' can be used to find further information on these.
- Typing 'nvidia-smi -q' produces a very detailed status report for all GPUs in the system but this can be limited to a given GPU of interest with the -i N option, where N is the GPU identifier (0,1 or 2 for nvidia1 and 0 or 1 for nvidia2 and so on). For example, the command
nvidia-smi -q -i 1
- will show the full information for GPU 1 only. Unlike most other nVidia CUDA programs, nvidia-smi has extensive man page documentation although many of the available options are reserved for the root user since they affect the operation of the GPU card.
Are disk quotas imposed on the GPU cluster servers?
- No but as with all Maths systems disk usage is continuously monitored and thos who have used a large proportion of the available home directory storage will be asked to move data to one of the silo storage servers or delete unwanted data, etc.
Is user data on the cluster servers backed up?
- Yes, all four servers are mirrored daily to our onsite backup servers which in turn are mirrored to the Maths offsite servers in Milton Keynes and Slough.
What about job scheduling and fair usage controls?
- Job queueing and resource management is not being used on the GPU clusters or the stand-alone servers at present because, unlike the Maths compute cluster in the past, fair usage and contention for resources has not been a problem with the GPU facilities. Also, it is very difficult to implement traditional HPC-style cluster job management on GPU cards because there is no low-level interface to core and memory resources on any given GPU card, although it is possible to control use of entire GPU cards. But with the present small-scale clusters used by a small group of regular users, it currently is not worth implementing any form of job control.
About the GPU clusters
- The host servers nvidia1 and nvidia2 are blade servers fitted into a Dell C6100 chassis, with each server separately connected via iPASS links to a Dell C410x PCI-express expansion chassis which is capable of housing up to 16 GPU cards. The chassis is configured so that 8 GPU card bays connect to one server and the other 8 bays to the other server although not all of the bays are populated with GPU cards. The servers each have two 2.67 Ghz quad-core Xeon CPUs and 72 GB of main memory.
- nvidia3 is a SuperMicro GR1027GR-72R2 GPU server that can accomodate up to 3 double-width GPU cards although only two are fitted at present. Two 2.5 Ghz quad-core CPUs are fitted and 64 GB of memory is available.
- nvidia4 is a large Tyan FT77DB7109 server fitted with eight nVidia GeForce RTX 2080Ti GPUS (pictured left) two 16-core 2.8 GHz Xeon CPUs, 1.5 TB of memory and 14 hard disks, 2 off which are fast SAS disks arranged as a mirrored pair for the system while the other 12 form a MDRAID pool, with one disk being reserved as a 'hot spare'.