NextSilicon’s Maverick-2 locks onto bottleneck code
- NextSilicon has developed a novel chip that adapts its hardware to accelerate high-performance computing applications.
- The Maverick-2 is claimed to have up to 4x the processing performance per watt of graphics processing units (GPUs) and 20x that of high-performance general processors (CPUs).
After years of work, the start-up NextSilicon has detailed its Maverick-2, what it claims is a new class of accelerator chip.
A key complement to the chip is NextSilicon’s software, which parses the high-performance computing application before mapping it onto the Maverick-2.
“CPUs and GPUs treat all the code equally,” says Brandon Draeger, vice president of marketing at NextSilicon. “Our approach looks at the most important, critical part of the high-performance computing application and we focus on accelerating that.”
With the unveiling of the Maverick-2 NextSilicon has exited its secrecy period.
Founded in 2017, the start-up has raised $303 million in funding and has 300 staff. The company is opening two design centres—in Serbia and Switzerland—with a third planned for India. The bulk of the company’s staff is located in Israel.
High-performance computing and AI
High-performance computing simulates complex physical processes such as drug design and weather forecasting. Such computations require high-precision calculations and use 32-bit or 64-bit floating-point arithmetic. In contrast, artificial intelligence (AI) workloads have more defined computational needs, and can use 16-bit and fewer floating-point formats. Using these shorter data formats results in greater parallelism per clock cycle.
Using NextSilicon’s software, a high-performance computing workload written in such programming languages as C/C++, Fortran, OpenMP, or Kokkos, is profiled to identify critical flows. These are code sections that run most frequently and benefit from acceleration.
“We look at the most critical part of the high-performance computing application and focus on accelerating that,” says Draeger.
This is an example of the Pareto principle: a subset of critical code (the principle’s 20 per cent) that runs most (80 per cent) of the time. The goal is to accelerate these most essential code segments.
The Maverick-2
These code flows are mapped onto the Maverick-2 processor and replicated hundreds or thousands of times, depending on their complexity and the on-chip resources available.
However, this is just the first step. “We run telemetry with the application,” says Draeger. “So, when the chip first runs, the telemetry helps us to size and identify the most likely codes.” The application’s mapping onto the hardware is then refined as more telemetry data is collected, further improving performance.
“In the blink of an eye, it can reconfigure what is being replicated and how many times,” says Draeger. “The more it runs, the better it gets.”
The time taken is a small fraction of the overall run time (see diagram). “A single high-performance computing simulation can run for weeks,” says Draeger. “And if something significant changes within the application, the software can help improve performance or power efficiency.”
NextSilicon’s software saves developers months of effort when porting applications ported onto a high-performance computing accelerator, it says.
NextSilicon describes the Maverick-2 as a new processor class, which it calls an Intelligent Compute Accelerator (ICA). Unlike a CPU or GPU, it differentiates the code and decides what is best to speed up. The configurable hardware of the Maverick-2 is thus more akin to a field-programmable gate array (FPGA). But unlike an FPGA, the Maverick-2’s hardware adapts on the fly.
Functional blocks and specifications
The Maverick-2 is implemented using a 5nm CMOS process and is based on a dataflow architecture. Its input-output (I/O) includes 16 lanes of PCI Express (PCIe 5.0) and a 100 Gigabit Ethernet interface. The device features 32 embedded cores in addition to the main silicon logic onto which the flows are mapped. The chip’s die is surrounded by four stacks of high-bandwidth memory (HBM3E), providing 96 gigabytes (GB) of high-speed storage.
NextSilicon is also developing a dual-die design - two Maverick-2s combined - designed with the OCP Acceleration Module (OAM) packaged form factor in mind. The OAM variant, arriving in 2025, will use HBM3E memory for an overall store capacity of 192 gigabytes (GB) (see diagram).
The OCP, the open-source industry organisation, has developed an open-source Universal Base Board (OBB) specification that hosts up to eight such OAMs or, in this case, Maverick-2s. NextSilicon is aiming to use the OAM dual-die design for larger multi-rack platforms.
The start-up says it will reveal the devices’ floating-point operations per second (FLOPS) processing performance and more details about the chip’s architecture in 2025.
Partners
NextSilicon has been working with vendor Penguin Solutions to deliver systems that integrated their PCI Express modules based on its first silicon, the Maverick-1, a proof-of-concept design. Sandia National Laboratories led a consortium of US labs, including Lawrence Livermore National Laboratory and Los Alamos National Laboratory, in trialling the first design.
"We're currently sampling dozens of customers across national labs and commercial environments. That's been our focus," says Draeger. "We have early-adopter programs that will be available at the start of 2025 with Dell Technologies and Penguin Solutions, where customers can get engaged with an evaluation system."
Volume production is expected by mid-2025.
Next steps
AI and high-performance computing are seen as two disparate disciplines, but Draeger says AI is starting to interact with the latter in exciting ways.
Customers may pre-process data sets using machine-learning techniques before running a high-performance computing simulation. This is referred to as data cleansing.
A second approach is the application of machine-learning to the simulation’s results for post-processing analysis. Here, the simulation results are used to improve AI models that aim to approximate what a simulation is doing, to deliver results deemed ‘good enough’. Weather forecasting is one application example.
An emerging approach is to run small AI models in parallel with the high-performance simulation. “It offers a lot of promise for longer-running simulations that can take weeks, to ensure that the simulation is on track,” says Draeger.
Customers welcome anything that speeds up the results or provides guidance while the calculations are taking place.
NextSilicon is focussing on HPC but is eyeing data centre computing.
"We're starting with HPC because that market has many unique requirements, says Draeger. "If we can deliver performance benefits to high-performance computing customers then AI is quite a bit simpler."
There is a need for alternative accelerator chips that are flexible, power efficient, and can adapt in whatever direction a customer's applications or workloads take them, says Draeger.
NextSilicon is betting that its mix of software and self-optimising hardware will become increasingly important as computational needs evolve.
Reader Comments