Computing cluster for the University of Rzeszow

MEGATEL, leader of the MEGATEL/OPTEAM Rzeszow consortium, has provided the Interdisciplinary Centre for Computer Modelling at University of Rzeszow with very interesting cluster. It is interesting because it brings together all the latest trends in computing technology.

Firstly, this cluster - with a relatively small budget - will grow old relatively slow, thanks to the space designed for GPU computational accelerators, second - it will be used for the input and interaction calculations for both experienced and novice users. Thirdly, a multi-nodal, fault-tolerant and remotely available HA NFS file system was used in the cluster.

The solution offered involves inserting 120 accelerator cards with 40 nodes (3 cards per node). The computing power of the cluster, which was 7.5 TFlops at the date of purchase - may be increased to at least 150 TFlops during use of the device.

The basis of the cluster structure is formed with HP SL250 G8 servers by HP, each provided with 3 PCIe 3.0x16 connections that communicate with two processors of the node without locks. Node has an adequate supply of electrical and cooling power while maintaining fault tolerance against failures of power supplies and fans. PCIe connectors are characterized by full height, full length and are "thick", that is, in a size that allows insertion of any PCI GPU card.



The key feature of this project is that the choice of nodes ready for further expansion of GPU cards practically didn't affect the purchase price of the cluster. Its users have a complete freedom of choice from the GPU accelerators by INTEL, ATI and NVIDIA - available currently and for the next three years. For testing purposes - 4 NVIDIA Tesla K20 5GB cards were used.

This cluster, which is rare in the market, is running under the MS Windows HPC system. This allows for the use of a queuing system and interactive applications such as ANSYS (calculation of fluid flow). In addition, the cluster is divided into two parts, which may be connected and then separated again both using software or hardware. This architecture prevents the inexperienced users form interfering with the more experienced ones, and vice versa.

Remote file system resistance was achieved by using a solution based on DRBD open source software and HP SL4540 servers. SL4540 servers are specialized machines that can accommodate up to 60 drives in 4U with height equal to the reck. Disk controllers of those servers, in addition to standard RAID 1 to 6 security, have the unique ability to use high-speed SSDs as cache memory, which significantly speeds up disk operations.

A two-node symmetrical disk cluster acting in the ACTIVE-PASIVE mode was created under the project. Active node serves customers and shadow replicates data to the passive node at the same time. The whole is controlled by the HEARTBEAT software- implemented mechanism that in the event of failure of the active node redirects the traffic to its mirror image at the passive node.

The disk array created this way was subjected to so-called STRESTESTS, including: unexpected power shutdown and return on the active node during file operations. During all tests, not only no data were lost, but also the connection sessions of clients to the disk system have not been interrupted.

Obviously we do not wish such stress to our colleagues from Rzeszów - we wish successful calculations only!

Technical data of the cluster:

The total computational power is 7.5 TFLOPS; the number of nodes - 40 pieces; 2- processor nodes connected via INFINIBAND network; disk resource capacity - 18TB. Budget - PLN 1.4 million; date of commissioning - March 2014.