Startup Boosts Scale-Up to 1000+ GPUs in a Single Domain

Delos Data wants to enable practical scale-up domains of 1000+ GPUs in flexible topology designs.
PALO ALTO, Calif. — Startup Delos Data wants to enable GPU scale-up domains above 1000 GPUs via its cluster management software stack and a new server design. The overall aim is to provide flexible topology options that can be tailored for specific AI inference workloads at scale, reducing cost and power per token by improving GPU utilization.
The industry’s shift from training to inference workloads necessitates thinking differently about a lot of things, Delos Data CEO Ed Doe told EE Times.
“Training was seen as similar to HPC,” he said. “The workloads run over weeks or months; there are similarities, but there are important differences as well. Distributed inference is nanosecond latency sensitive.”
Inference workloads also need to be always-on, non-stop, and in order to truly disaggregate, inference clusters need a level of modularity beyond what full-rack GPU systems can offer, Doe said. This requires additional components to the interconnect architecture to ensure it is strong and resilient.
“A lot of GPU manufacturers build boxes and systems, prescribing a lot of different [parameters]—we very much advocate for modular architectures,” Doe said. “Disaggregation can mean many things. We’re thinking about it in the historical way—physical disaggregation, where you don’t have to put everything in the same physical box.”
Delos’ platform, Nonstop AI, is a disaggregated server design with a software stack for at-scale AI inference, which is intended to bring the benefits of scale-up connectivity to systems that previously required scale-out networking. (In scale-out systems, data has to traverse multiple, slower links to get to other GPUs, but with scale-up, GPUs are connected directly together, so the latency is shorter, and consistency is better.)
Server design
Delos’ server design, produced with a Taiwanese OEM partner, is designed to bring scale-up to the front panel via nine OSFPs per GPU (or any type of accelerator), offering 72× 200 Gb/s ports per server. These servers can be connected via copper cables or optical fiber (or any type of cables), via an Ethernet or circuit switch (or any type of switch).
This enables scale-up domains to be huge—1000 GPUs is practical, Delos Data CTO Dan Daly told EE Times, but 10,000 is certainly possible.
“When you can change the topology, even with just one switch, you can put 100,000 GPUs in a single scale-up domain,” Daly said. “[We can] leverage the scale-out ecosystem of OSFP cables and cages and modularity and vendor choice, rather than, ‘this is what you get because it came with the rack’.”
Topologies for inference and training are different. For example, Google uses a 3D torus for training clusters and Boardfly topology for inference clusters. (Source: Delos Data)
With a few notable exceptions (i.e., Google’s large TPU clusters), scale-up domains have generally been limited to below 100 up to now (Nvidia’s NVLink is limited to 72 GPUs), but there are arguments for larger scale-up domains, including faster inference.
“The reason this hasn’t been done very often is these cables are kind of flaky,” Daly said. “They wiggle around, they pop out. And the switch is no longer in the rack; it’s somewhere else, in another row, in another location. It could die, it could update its software asynchronously—so there are new failure modes.”
Delos’ Mosaic stack demo. Traces on the right-hand side show a temporary performance dip due to unplugging a cable. (Source: Delos Data)
Delos demonstrated its Mosaic software stack at GTC in March, showing off a crucial ability to fail gracefully. Scale-up networks have lots of parallel paths, which allows for a good level of resiliency, but this has to be managed by software. In the Delos demo, pulling any cable out means a temporary dip in productivity while data is re-routed via different paths. Mosaic tracks performance to make sure full productivity is reached on the new route. This would work in a similar way if a GPU or accelerator fails, Daly said, made easier by having a larger scale-up domain for access to more GPUs.
More flexibility in terms of topology will also enable heterogeneous clusters, whether that’s different GPU types or other types of AI accelerators—there are many possibilities, Doe said.
Inference clusters are the target application, but this architecture would also work well for training and HPC, or anywhere large amounts of data need to be moved, Daly said, though it’s primarily validated and tested for inference systems.
“The reality is that the world is becoming much more optimized around the end workload,” Doe said. “You have to understand what matters to that workload, and the optimal way to do that, as opposed to prescribing a specific GPU or interconnect or cable or topology.”
Delos has deployments with early access customers, with broader availability planned for the fourth quarter of 2026.
Read also:
Jensen Huang was crowned ‘The Inference King’ with Groq-powered Rubin CPUs, unveiling dramatic architectural leaps at GTC this year.
RELATED TOPICS:AI, AI INFERENCE, DATA CENTER, HIGH PERFORMANCE COMPUTING (HPC), NETWORK
COMPANIES:DELOS DATA
_Sally Ward-Foxton covers AI for EETimes.com and EETimes Europe magazine. Sally has spent the last 18 years writing about the electronics industry from London. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more news publications. She holds a Masters' degree in Electrical and Electronic Engineering from the University of Cambridge._Follow Sally on LinkedIn
0 comments
