image
At Clussys, we're on a mission to revolutionize networking with the cutting-edge Compute Express Link (CXL) and CXL Switches with unparalleled scalability.

Clussys > NUPA: RDMA and TCP/IP over CXL and PCIe Fabric
2024 March 26

NUPA: RDMA and TCP/IP over CXL and PCIe Fabric

As Clussys, unlike traditional Ethernet network infrastructures, we will completely abandon the MAC network model and adopt a network model based on CXL controllers. We are particularly focused on small-scale, high-performance, low-latency cluster applications. Our goal is to interconnect 100 to 1000 GPUs, CPUs, and SSDs via high-speed PCIe/CXL bus networks, allowing these computational devices to operate as if they were one single device. This is the core area of current computational performance expansion, rather than focusing on interconnecting 10,000 or even 100,000 devices as in traditional cloud computing. If needed, we can scale up using Ethernet/IB to PCIe/CXL network conversion, but this is not our primary focus currently.

In small-scale clusters, such as the Nvidia-defined superpod clusters, we are particularly concerned with high-speed, low-latency communication between devices. Just like traditional networks, the key elements for improving communication performance include hardware control as much as possible, streamlined data flow (zero copy), and excellent flow control. To simplify the model, we compare network protocols and aim to achieve data transmission by leveraging existing hardware features as much as possible.

protocols

In the figure above, we can see that traditional network card devices introduce approximately 1-3us of additional latency. Additionally, data needs to be constantly copied between user space, kernel space, and devices. While various optimization techniques and technology modules can simplify this data movement, the results are not yet satisfactory, and we need something better.

In a CXL network, data movement can directly utilize the DMA of the CXL controller without the need for additional network cards or DSPs (this also applies to PCIe networks). Furthermore, the cache protocol provided by the CXL controller allows for faster and more timely data responses between different devices. Clussys, through the design of a unique yet universal bus interconnect network that leverages the performance of CXL controllers, efficiently moves data from one CPU to another, and from one GPU to another. The end-to-end latency is only 600ns, significantly reducing latency by over three times compared to RDMA networks.

To expedite ecosystem development and gather feedback from priority engineers, we are eager to share our achievements and research progress with the broader developer community. In this repository, we will gradually unveil our advancements. The structure of this repository is as follows:

li@ubuntu:~/workspace/project/demo$ tree .
demo/
├── imgs
│   └── protocols.jpg
├── ip
│   ├── cxl
│   │   ├── nupanet.ko
│   │   └── README.md
│   └── pcie
│       ├── nupanet.ko
│       └── README.md
├── rdma
│   ├── cxl
│   │   ├── nupa_net.ko
│   │   ├── rdma_cxl.ko
│   │   └── README.md
│   └── pcie
│       ├── ib_uverbs.ko
│       ├── nupa_net.ko
│       ├── rdma_nupa.ko
│       └── README.md
├── README.md
├── uec
│   └── README.md
└── vms
    ├── cxl
    │   ├── node1.sh
    │   ├── node2.sh
    │   └── README.md
    └── pcie
        ├── node1.sh
        ├── node2.sh
        └── README.md

To accelerate ecosystem development and incorporate the opinions and suggestions of priority engineers, we are eager to share our achievements and R&D progress with developers. In this repository, we will gradually open up our progress. The structure of this repository is as follows:

IP Folder

In the ip folder, we will build netdevice devices on the CXL network and introduce the capability of DMA technology for data transmission. Given the characteristics of CXL and PCIe networks, we aim to simplify protocol processing in the netdevice flow, removing unnecessary verification and inspection processes. We will provide applications for both CXL and PCIe scenarios.

RDMA Folder

In the rdma folder, we will directly build RDMA devices on the CXL network and re-adapt the verbs interface. This is to enable applications to adapt to the future arrival of CXL networks without the need for modifications. Similarly, we will offer both CXL and PCIe modes.

UEC Folder

UEC? This is for the future, but since the future has not arrived yet, let’s reserve a spot for UEC.

         -----------------------------------------------
        |       |          |  Latency     |  Cloud      |
 App    |  HPC  |  LLM     |  Senstive    |  Native     |
        |  APP  |  APP     |  APP         |  APP        |
         ------------------------------------------------
        |                         |                     |
Protocol|    RDMA       ----------     TCP/UDP/IP       |
        |              |                                |
         ------------------------------------------------
        |              |                                |
 HW     |     IB       |              MAC               |
        |              |                                |
         -----------------------------------------------
               Current Application and Networking

with the help of UEC, the community aiming to do like this:

         -----------------------------------------------
        |       |          |  Latency     |  Cloud      |
  App   |  HPC  |  LLM     |  Senstive    |  Native     |
        |  APP  |  APP     |  APP         |  APP        |
         ------------------------------------------------
        |             |              |                  |
Protocl |    RDMA     |     UEC      |    TCP/UDP/IP    |
        |             |              |                  |
         ------------------------------------------------
        |         |   PCIe    |                         |
 HW     |     IB  |   CXL     |          MAC            |
        |         |   UCIe    |                         |
         -----------------------------------------------
                       ^                       
                     Clussys

      The future of UEC Networking, with where Clussys sits

However UEC is still in its very early stage. Clussys will keep a close eye on its development.

VMS Folder

Finally, our main attraction, the vms folder, our software development environment. As the name suggests, this is a development and verification environment based on virtual machines. Of course, in virtual machines, performance expectations should be moderated. The goal is to demonstrate the credibility of the technology. Within the vms directory, we have built two environments for CXL and PCIe. It is worth mentioning that CXL is a rapidly evolving technology with high kernel requirements. Therefore, we have selected the 6.3 kernel series. While new kernels offer new features, stability needs collective validation. For PCIe, a long-standing technology, we have opted for the 5.15 kernel. In each virtual machine environment, you can run scprits for usage and experience. Additionally, a readme file is available for reference.

Finally, we gave our architecture a name, NUPA :-)

The project will be hosted on GitHub, accessible at https://github.com/Clussys/nupa. For those interested in a test binary file, please direct your inquiries to info@clussys.com.

Github repo

https://github.com/Clussys/nupa