Initial notes on cluster networking

AI Safety at UCLA and ACM AI at UCLA recently built out a cluster of three servers, hosting a current total of 14 NVIDIA 3090s. Just considering the GPUs is worth a blog post on its own, but this post will focus on the networking arrangement between these systems.

Overall layout

First, let’s get a rundown of the network topology for this mini-cluster. There are a total of four computers on the network:

sullivan: The login node and also service host for the cluster. All ssh connections must be made through sullivan and all services that other servers rely on, such as nfs and ldap, run on sullivan.
temescal: The first of the GPU servers. Currently hosts 4 3090s and is powered by Threadripper PRO 5955WX.
serrano: The second of the GPU servers. Is host to 5 3090s, powered by an EPYC 7402P.
ynez: The third and final GPU server. Also hosts 5 3090s using an EPYC 7402P.

Each computer comes with onboard 10GBASE-T networking, except sullivan, which only has 2.5GBASE-T, and will need to be supplemented with an add-in NIC (eg. Intel X540). These are connected through a 10G switch (TP-LINK TL-SX105), which also provides the connection to the rest of the internet. However, for synchronizing large jobs between each of the GPU servers, this link does not provide high enough bandwidth, or more importantly, low enough latency. Thus we look to Remote Direct Memory Access (RDMA) to enable these jobs.

RDMA

For our purposes, the two implementations of RDMA we are interested in are InfiniBand (IB) and RDMA over Converged Ethernet (RoCE). There are many other RDMA solutions, but these two seem the most acchievable for our small, non-enterprise usecase. This main because cheap Mellanox (now NVIDIA) NICs supporting these features can be found on sites like eBay. Here is a nice list of Mellanox part numbers and their capabilities.

serrano and ynez each have an MCX314A-BCCT installed, whereas temescal has an MCX456A-ECAT. The choice of these cards locks us in to using RoCE, because while the MCX456-ECAT supports VPI (that is, Ethernet and InfiniBand), the MCX314A-BCCT only supports the Ethernet protocol. This seems unfortunate, because the InfiniBand stack allows for slightly lower overall latencies than RoCE, but Meta Engineering found that RocE and InfiniBand clusters were successful in their Llama 3 training runs. Additonally, if we find the extra latency in RoCE to be an issue, upgrading serrano and ynez to the MCX456A-ECAT should be relatively simple, and comes with the added bonus of increasing speeds from 40Gbps to 100Gbps!

RDMA support is provided by the MLNX-OFED package, now provided by NVIDIA. It appears this library has been subsumed, however, by NVIDIA’s DOCA-OFED library, which mainly focues on NVIDIA’s new BlueField NICs, but maintains backwards compatability with the older Mellanox Cards (to a point).

To be continued…

Stay tuned for more details. Currently working on getting OpenMPI setup!