Choosing a Cluster Configuration

11 Apr, 2019

ISC High Performance hosts the annual Student Cluster Competition (SCC) for which students build a cluster and compete in different application categories for the highest performance under a power budget. At SCC 2019 in Frankfurt, Germany, the power limit is set at 3kW. This blog posts covers the proposal for our cluster architecture.

Performance vs Power Draw

For the SCC, the team considered two prize categories: The overall prize and the award for the highest Linpack performance. While the former is awarded based on the weighted performance on all presented benchmarks and applications, the latter’s evaluation is solely based on a single benchmark performance. Choosing a cluster architecture optimised for the set of called applications is a prerequisite for high performance results at the competition.

The cluster comprises CPU, memory, storage, motherboard, accelerators (GPU), network cards and cooling system (fans vs. liquid cooling). From these groups the team identified CPU and GPU as most important factors impacting performance and power consumption. Peripherals like memory, motherboard, storage and cooling set a baseline for power consumption as we assume a rather negligible, constant power draw from these. The author assumes the following power draw:

RAM: 5 Watt
Motherboard: 50 Watt
Storage (SSD): 5 Watt
Liquid Cooling: 150 Watt

Liquid cooling has the potential to leverage power efficiency. Last years’ teams gained increased performance per Watt with a liquid cooling system drawing around 150 Watt at maximum. Besides the performance gain, liquid cooling systems bear the risk of spillage of liquid, damaging components and risking the cluster’s operability on the whole. Deploying liquid cooling with the latest processors requires specialised accessories and potential adaptations of the chassis. Eventually, the team together with the sponsor decided to include a fan-cooled chassis instead of liquid cooling. The team’s sponsor Boston Ltd. provided eight CPU nodes and one GPU node. In fact, the GPU node was a NVIDIA DGX-1 server carrying eight NVIDIA Tesla V100 GPU with 32 GB of memory each, connected via 8-way NVLink on a single node.

Power draw

Choosing the right number and architecture composition of CPU and GPU and assuming TDP as power draw at peak performance, the author compares the processors’ thermal design power (TDP). Figure 3.1 visualises the tradeoffs between Intel Xeon Platinum 8180 (CPU) and NVIDIA Tesla V100 (GPU) for quantity selection. Idle and maximum power consumption of the processors offered to the team are presented in Table 3.2.

Estimated idle and maximum power consumption:

Processor	Idle	Max
Intel Xeon Platinum 8180	10W	205W
Intel Xeon Gold 6126	10W	125W
Intel Xeon Gold 6140	10W	140W
NVIDIA Tesla V100	36W	300W

The higher the cluster performance within the power budget, the better. Scenario (12 CPU, 8 GPU) stays within the power limit of 3000 Watt while still maintaining budget back-up for other components such as RAM, motherboard and fan cooling. Scenario (12 CPU, 4 GPU) maintains too low performance and additional CPU and GPU could be added to the configuration. Scenarios (14 CPU, 8 GPU), (12 CPU, 10 GPU) and (14 CPU, 9 GPU) exceed the power budget of 3000 Watt. Figure 3.1 does not take other components’ power consumption into account but visualises hardware configurations based on idle an peak power draw. In alternative to idle and peak performance, an over-specified system running at lower utilisation than 100% might benefit optimised performance.

Figure 3.1 shows the processor combinations and their power consumption. The x-axis names combinations of Intel Xeon Platinum 8180 (CPU) and NVIDIA Tesla V100 (GPU) quantities. The non-dashed bars (GPU_max CPU_idle) represent GPU running at maximum performance with maximum power consumption and CPU running idle. The dashed bars (CPU_max GPU_idle) represent CPU running at maximum performance and maximum power consumption with GPU running idle.

Table 3.3 lists the cluster’s final mix of processors used at the competition. From above’s considerations, the hardware sponsor recommended using the listed set of available CPU. Table 3.4 shows the remaining component specifications besides CPU. The final cluster contains nine nodes, 18 CPU and eight GPU.

The following table shows the configuration for cluster nodes and their processors. The final cluster architecture comprises nine nodes, thereof eight CPU-only and one node with eight NVIDIA Tesla V100 GPU. The listed processors are all Intel Xeon CPU. Each node carries two of the presented processors. The number of cores per CPU and the base frequency are listed. Thermal Design Power (TDP) represents the processor’s mean power draw in Watt while operating at Base Frequency under a manufacturer defined workload. TDP in Watt, Base Frequency in GHz:

Nodes	Processor	Cores	Microarch.	Instruction set	Freq.	TDP
3	Xeon Gold 6126	12	Skylake	AVX-512, AVX2, AVX, SSE4.2	2.6	125
4	Xeon Gold 6248	20	Cascade-lake	AVX-512, AVX2, AVX, SSE4.2	2.5	150
1	Xeon Platinum 8176	28	Skylake	AVX-512, AVX2, AVX, SSE4.2	2.1	165
1	Xeon E5-2698V4	20	Broadwell	AVX2	2.2	135

The CPU contained in the cluster are Intel Xeon processors but the mix of varying processor types comes with the cost of varying core counts, frequencies, performances and power consumption. The supported instruction sets are identical for the Skylake and Cascade-lake generations but Broadwell only supports AVX2 while the former support up to AVX-512. Compiling libraries and software packages like SWIFT and OpenFOAM, other team member must choose the lower performing AVX2 if optimised vectorisation is required and the code is to be run across all nodes. Alternatively, the GPU node has to be left untouched while running these applications in order to compile for the higher performing AVX-512 instruction set. Leaving out the Broadwell node (the GPU node) for program deployment might waste power draw that could be spent on run performance. Controlling the different CPU frequencies with the Linux tool cpupower is possible but working on heterogeneous processors across the cluster led to unbalanced performance as does the varying processor specifications in general.

Cluster components and their specification:

Component	Specification
Memory	128 GB per node, 256 GB in the DGX-1 Server
Storage	1 TB in the headnode, 1 TB in the DGX-1 Server
Networking	Mellanox 36x 100 Gbps Infiniband Switch
Accelerator	8x NVIDIA Tesla V100 (32 GB VRAM) in an NVIDIA DGX-1 Server, 8-way NVLink
Cooling	Fan-cooled system chassis

Software considerations

For the OS we will use CentOS Linux 7. The cluster’s software stack is set up to support the announced benchmarks and applications. Nevertheless, installing the required compilers, tools and libraries the team maintains a broad support for common HPC applications. This software stack and the team’s experience with it gained during preparation lets us easily adapt to changing requirements i.e. the secret application.

The team tested different compilers and overall the Intel compilers demonstrated better performance than any other compiler. In general, we prefer open-source due to the community support and the wide ranging documentation and discussions that are publicly accessible but because of the performance Intel compilers are our primary choice. Nevertheless, we also consider other compilers depending on the availability and performance - including GNU and PGI Compilers.

Our software stack includes different MPI implementations including but not limited to OpenMPI and MPICH. We are currently also experimenting with MVAPICH2. Regarding MPI implementations we acknowledge several restrictions: As recommended by HPCAC we will avoid Open MPI versions between 1.10.3 to 1.10.6 due to the known timer bug. Horovod users experienced problems with Open MPI 3.1.3. It is recommended to either downgrade to Open MPI 3.1.2 or to upgrade to Open MPI 4.0.0. The team will avoid these versions. For CP2K, the team currently experiments with OpenMPI v3 and v4.

Controlling and tracking power consumption the team uses IPMI2, cpupower and cpupower-frequency-set tools, respectively NVIDIA System Management Interface for GPU. Additional libraries and software packages include:

RapidCFD, cufflink for OpenFOAM
ScaLAPACK, FFTW3, libxsmm and MKL for CP2K
HDF5, ParMETIS, libNUMA and GSL for SWIFT
Anaconda, Nvidia GPU drivers, CUDA Toolkit, cuDNN and NCCL2 for the AI Application (TensorFlow/Horovod)

Controlling GPU performance

The performance of TensorFlow and Horovod for deep learning strongly relies on the cluster’s GPU making it imperative to investigate means by which GPU performance could be maximised while balancing power consumption. Nevertheless, the Student Cluster Competition introduces applications that are not GPU-enabled, making it also necessary to examine how to decrease the GPU’s power consumption when running idle. Previous EPCC student teams were able to turn off Error-correcting code memory (ECC) in order to lower GPU idle power [@ManosFarsarakis]. Decreasing idle GPU power consumption implies increased available power for CPU-focused applications.

NVIDIA System Management Interface (nvidia-smi) provides monitoring information for NVIDIA’s Tesla devices. The program presents the data in plain text or XML format. Table 3.2 lists the idle state power draw of NVIDIA Tesla V100 GPU, contained in a DGX-Station node carrying four NVIDIA Tesla V100 GPU. These test were conducted while a test system was available to the team - an NVIDIA DGX-Station. The devices are idle at 34 degrees Celcius and 36 Watt. Their maximum hardware cap is 300 Watt. Nvidia-smi also provides several management operations for changing device state and operation modes that adopt performance and power balance:

Maximum Performance Mode operates the device at its peak Thermal Design Power (TDP) level, thus 300 Watt, see section 3.2 for the NVIDIA Tesla V100 for accelerating applications relying on fastest computational speed and highest available data throughput [@NvidiaVoltaArchWhitepaper].
In Maximum Efficiency Mode the device is run at optimal power efficiency available through the nvidia-smi interface (see Max-Q) [@NvidiaVoltaArchWhitepaper].
In Persistence Mode the enabled NVIDIA driver remains loaded at all cost, even when no active client exists. This mode minimises driver load latency associated with executing dependent applications like CUDA programs.
GPU Operation Mode (GOM) is capable of disabling device features in favour of reduced power consumption and optimised GPU throughput. GOM offers three operation levels:
- All on mode enables all features and runs the device at full speed.
- Compute mode is optimised for running compute tasks only. No graphics operations are allowed.
- Low Double Precision mode enables and optimises graphics applications that don’t require high bandwidth double precision.
Max-Q defines a configuration that delivers the best performance/watt balance for a given workload. Running in Persistence Mode a user can set power consumption caps for individual devices or a group of devices.
- First the persistence mode has to be enbled by issuing: nvidia-smi -pm 1. To turn off persistence mode: nvidia-smi -pm 0
- If persistence mode is enabled, set a maximum power consumption of 180 Watt to all detected devices with nvidia-smi -pl 180
- To target specific devices use flag -i device_id, e.g. nvidia-smi -i 0 -pl 180 to set the first of many devices in a node to a power cap of 180 Watt.

GPU checks and power draw monitoring

The weekend before the competition, the cluster was shipped from Boston Limited’s headquarters in St Albans near London to the conference venue in Frankfurt. The nodes arrived pre-assembled at the location and only needed to be hung in the chassis, wired and powered on. CPU-only test runs by other team members on Sunday, the setup day before the competition, provided more insights into the overall cluster’s power draw. The setup with nine nodes operated at 1.6 kW at idle. Other team’s idle power draw displayed as live graphs in a Grafana dashboard ranged from 0.5 to 1.2 kW. Competitors’ architectures included pure CPU clusters and setups with older generations GPU - most team’s included accelerators for GPU-enabled applications and benchmarks.

Checking the GPU devices together with other team members who used the GPU as well, HPL was run first and led to very low results below 10 TFlop/s. Re-runs raised memory errors. Checking the nvidia-smi interface confirmed the error: Error on MPI rank 3: memcpy, illegal access. As depicted in Figure 3.2, GPU #3 was not running properly. The GPU system was shutdown, detached and the determined processor investigated. A screw had loosened avoiding full device connection. Carefully screwing tightly resolved the issue. The system was brought back and ready for operation.

While other team members tested their applications, the author checked on the GPU node’s performance and power draw. Last years’ evaluation metrics for the AI application included peak throughput images per second, the number of input samples that could be processed. 2019’s rules on the AI application focused on evaluating the Intersection-over-union score (IoU) of a prediction with a neural network for semantic image segmentation. The evaluation focus was on the inference at the competition. The resource- and power-intense network training could be dealt with during preparation, prior to the competition. The submission at the competition comprised the application’s inference output on unseen data that the team was provided with at the day of competition.

Controlling power draw during the AI application run with GPU, the author selected only a single GPU device for inference. Figure 3.3 shows a test run on GPU with ID 0. Persistence mode is enabled and the power cap is set to its maximum at 300 Watt. GPU devices 1 to 7 run idle. The run takes ca. 80 seconds. While other GPU run idle at very low power consumption under 50 Watt each, the GPU device the inference job is assigned to, hits its peak power draw of 300 Watt and ca. 250 Watt multiple times during the computation of convolutions. Combining GPU peak power consumption and one of the Intel Xeon E5-2698V4 CPU’s peak power consumption, the overall cluster power draw stayed below 2.2 kW at the third day of competition. Feeding the GPU, data is read on a single CPU, though the GPU node carries two Broadwell processors.

The data in Figure 3.3 was captured with nvidia-smi -q -d POWER and one second between reading points. It is likely the GPU hit 300 Watt multiple times within milliseconds. Exceeding the GPU’s TDP of 300 Watt was prevented by setting a hard power cap of 300 Watt through the nvidia-smi interface.

Conclusions

The competition’s main challenge is its power limitation. Recent multi-core processors are well optimised to reach highest performance under optimised power efficiency. The key to successful runs at SCC is to find optimal application configurations balancing performance and power consumption. The hyperparameter optimisation inevitably included a multitude of evaluation runs. Our evaluation was based on correctness, performance metrics (e.g. Flops) and particularly the cluster’s overall power draw.

Test systems and the cluster’s architecture changed before the competition days. Scripting as many build-steps as possible we automated and guaranteed fast transition from a test to a production system. Writing build and staging scripts works as automation and documentation. The second day we added a kill script ending processes when the cluster’s power draw approaches 3kW proves preventing us from penalties for crossing the power budget.

Having decided on the cluster’s architecture and its components, our setup still held potential for further power reduction measures. Using a multi-node system, latency sensitive applications were accelerated through Infiniband. Only the system’s headnode required internet connection, making it possible for us to remove unnecessary ethernet network cards on individual nodes further reducing the cluster’s overall power consumption. We successfully deployed our system in the Student Cluster Competition and made use of high-performing accelerators and Infiniband networking.