Initial notes on cluster networking
AI Safety at UCLA and ACM AI at UCLA recently built out a cluster of three servers, hosting a current total of 14 NVIDIA 3090s. Just considering the GPUs is worth a blog post on its own, but this post will focus on the networking arrangement between these systems.
Overall layout
First, let’s get a rundown of the network topology for this mini-cluster. There are a total of four computers on the network:
- sullivan: The login node and also service host for the cluster. All ssh connections must be made through- sullivanand all services that other servers rely on, such as nfs and ldap, run on- sullivan.
- temescal: The first of the GPU servers. Currently hosts 4 3090s and is powered by Threadripper PRO 5955WX.
- serrano: The second of the GPU servers. Is host to 5 3090s, powered by an EPYC 7402P.
- ynez: The third and final GPU server. Also hosts 5 3090s using an EPYC 7402P.
Each computer comes with onboard 10GBASE-T networking, except sullivan, which
only has 2.5GBASE-T, and will need to be supplemented with an add-in NIC (eg.
Intel X540). These are connected through a 10G switch (TP-LINK TL-SX105), which
also provides the connection to the rest of the internet. However, for
synchronizing large jobs between each of the GPU servers, this link does not
provide high enough bandwidth, or more importantly, low enough latency. Thus
we look to Remote Direct Memory Access (RDMA) to enable these jobs.
RDMA
For our purposes, the two implementations of RDMA we are interested in are InfiniBand (IB) and RDMA over Converged Ethernet (RoCE). There are many other RDMA solutions, but these two seem the most acchievable for our small, non-enterprise usecase. This main because cheap Mellanox (now NVIDIA) NICs supporting these features can be found on sites like eBay. Here is a nice list of Mellanox part numbers and their capabilities.
serrano and ynez each have an MCX314A-BCCT installed, whereas temescal has
an MCX456A-ECAT. The choice of these cards locks us in to using RoCE, because
while the MCX456-ECAT supports VPI (that is, Ethernet and InfiniBand), the
MCX314A-BCCT only supports the Ethernet protocol. This seems unfortunate, because
the InfiniBand stack allows for slightly lower overall latencies than RoCE, but
Meta Engineering found
that RocE and InfiniBand clusters were successful in their Llama 3 training runs.
Additonally, if we find the extra latency in RoCE to be an issue, upgrading
serrano and ynez to the MCX456A-ECAT should be relatively simple, and comes
with the added bonus of increasing speeds from 40Gbps to 100Gbps!
RDMA support is provided by the MLNX-OFED package, now provided by NVIDIA. It
appears this library has been subsumed, however, by NVIDIA’s DOCA-OFED library,
which mainly focues on NVIDIA’s new BlueField NICs, but maintains backwards
compatability with the older Mellanox Cards (to a point).
To be continued…
Stay tuned for more details. Currently working on getting OpenMPI setup!