UnifabriX uses CXL to improve HPC performance

CXL promises to reinvent the way computer systems are built. It runs on PCIe and can expand the memory of individual CPUs, but its biggest promise is providing network-arbitrated memory pools that can allocate somewhat higher latency memory as needed to CPUs or to software-defined virtual machines. CXL-based products will start appearing on the market in 2023.

CXL looks set to reinvent data centers, but the benefits of a higher-latency memory for use in high-performance computing (HPC) applications have not been obvious, at least until UnifabriX demonstrated the bandwidth and capacity benefits of their CXL-based smart memory node at Super in 2022. Computer Conference (SC22). There’s a just-released video showing UnifabriX demonstrations for memory and storage HPC applications that demonstrate HPC benefits.

UnifabriX says the product is based on the Resource Processing Unit (RPU). The RPU is embedded in its CXL Smart Memory Node, shown below. This is a 2U rack-mounted server with serviceable EDSFF E3 media bays. The product contains up to 64TB capacity in DDR5/DDR4 memory and NVMe SSDs.

The company says the product is compatible with CXL 1.1 and 2.0 and works on PCIe Gen5. They also say it is CXL 3.0 ready and supports both PCIe Gen5 and CXL expansion. It also supports NVMe SSD access through CXL (SSD CXL over Memory). The product is intended for use in bare-metal and virtualized environments across a wide range of applications, including HPC, AI and databases.

As with other CXL products, the memory node offers expanded memory, but it can also provide higher performance. Notably, at the 2022 Super Computer Conference (SC22), the memory node was used to run an HPCG performance benchmark versus the benchmark without the help of the memory node. The results are shown below.

For the conventional HPCG benchmark, as the number of CPU cores processing the benchmark increases, performance initially increases approximately linearly with the number of processor cores. However, performance levels off at about 50 CPU cores without any performance improvement as the number of cores increases. When you get to 100 available cores, only 50 cores are in use. This is because there is no additional memory bandwidth available.

If the memory node is added to provide additional CXL memory in addition to the memory directly attached to the CPU cores, we see that performance scaling with cores can continue. The memory code improves overall HPCG performance by moving lower priority data from CPU local memory to CXL remote memory. This prevents saturation of the local memory and allows continuous scaling of performance with additional processor cores. As shown above, the memory node improved HPCG benchmark performance by more than 26%.

The company has worked closely with Intel on its CXL solution and Intel cites these results as well as other 3rd part tests in its recent product brief that Infrastructure Processing Units (IPUs) (Intel Agilex FPGA Accelerators bring improved TCO, performance and flexibility to 4th Gen Intel Xeon platforms).

In addition to providing memory capacity and bandwidth improvements, the memory node can also provide NVMe SSD access through CXL. The company says its plans are to include memory, storage and networking through the CXL/PCIe interface, hence the name unifabriX. With networking included, their boxes can replace top of rack (TOR) solutions as well as provide memory and storage access.

The UnifabriX memory node, which uses the company’s resource management unit, provides a path to overcome direct-attach DRAM bandwidth limitations in HPC applications using shared CXL memory.

Leave a Reply

Your email address will not be published. Required fields are marked *