New Server Hopes to Break Through AI's "Memory Wall"

Overview

AI hardware startup Majestic Labs is building a new server called Prometheus with up to 128 terabytes of memory, over 60 times more than Nvidia's DGX B300, to address what the industry calls the memory wall in large language model inference. The company uses a DRAM-centric architecture with a proprietary copper-cable memory interface and custom aggregation chips. Prometheus pairs this memory with a custom AI processor called Ignite and supports common frameworks without code changes.

Key Takeaways

Majestic Labs is developing an AI server, Prometheus, with up to 128 terabytes of memory.
That is over 60 times more memory than Nvidia's DGX B300 server.
The design goes all in on DRAM, specifically LPDDR6, in a unified architecture.
Prometheus uses a custom AI processor called Ignite, with 12 Ignite chips per server.
It supports PyTorch, vLLM and OpenAI's Triton without requiring code modifications.

Stats & Key Facts

#Prometheus offers up to 128 terabytes of memory
#Over 60 times more memory than Nvidia's DGX B300
#Memory bandwidth up to 25.6 terabytes per second
#Proprietary memory interface effective up to a meter
#12 Ignite chips per Prometheus server; up to four servers per rack; up to 120 kilowatts per rack

New Server Hopes to Break Through AI's "Memory Wall"

The Memory Wall

Memory is the key constraint on LLMs.

›Memory is arguably the most serious constraint on modern AI large language models.
›Token generation is an inherently memory-bound task.
›The severity of the bottleneck grows with model size.

According to one influential paper cited in the article, the rate at which models output text is limited by how quickly data can be read in from memory. This creates a memory wall that holds back LLM inference performance, and Majestic Labs is taking a direct and comprehensive approach to solving it.

Majestic's Bet

More memory is the company's edge.

›Prometheus offers up to 128 terabytes of memory.
›That is over 60 times more than Nvidia's DGX B300.
›Co-founder Sha Rabii believes the increase will give the company an edge.

Rabii acknowledges that Nvidia has done a phenomenal job creating a system that can scale out, but argues it becomes less economical as models grow and ends up over-provisioning on compute while starving on memory.

DRAM-Centric Architecture

The design differs fundamentally from competitors.

›Nvidia servers use fast high-bandwidth memory for model weights plus a larger, slower DRAM pool.
›Majestic goes all in on DRAM, specifically LPDDR6, in a unified architecture.
›Most memory interfaces operate only over a few millimeters, limiting how much memory fits.

To solve the distance limit, Majestic uses a proprietary memory interface built from miniature copper cables that is effective up to a meter. It is paired with custom memory aggregation chips that sit next to memory modules and coordinate memory across the server, fanning out to many commodity DRAM chips. The design offers memory bandwidth up to 25.6 terabytes per second.

The Ignite Processor

More memory needs paired acceleration.

›Ignite is a custom AI processing unit that serves as the server's compute engine.
›The Prometheus server contains 12 Ignite chips.
›Ignite combines ARM application cores with RISC-V vector and tensor cores on a single die.

The ARM cores act as an on-chip host processor to orchestrate the AI model, while the RISC-V cores carry out the actual LLM processing. The result is a single chip that handles multiple aspects of LLM inference without handing off between processors, with all cores sharing the same memory space.

Software Compatibility

Majestic aims to reduce adoption friction.

›Prometheus will support PyTorch, vLLM and OpenAI's Triton inference frameworks.
›It will not require code modifications.
›Existing models compatible with these frameworks can run as-is.

Rabii acknowledges software matters given that many AI frameworks are already entrenched, and says the company is trying to reduce friction across physical and software aspects of customer adoption.

Server Design and Pricing

The hardware follows an open standard.

›The server is Open Compute Project-compliant.
›Up to four servers fit in a rack, with power draw up to 120 kilowatts per rack.
›Heat is managed with cold-plate liquid cooling.

The memory design is modular, so servers purchased with less than the maximum 128 TB of memory can be configured accordingly. Majestic Labs had not yet revealed specific metrics for Prometheus' compute performance at the time of writing.

Frequently Asked Questions

What is Prometheus?

Prometheus is an AI server being developed by Majestic Labs with up to 128 terabytes of memory, over 60 times more than Nvidia's DGX B300.

What is the memory wall?

It is the bottleneck in which LLM token generation is limited by how quickly data can be read from memory, a constraint that grows with model size.

How does Majestic's architecture differ?

It goes all in on DRAM, specifically LPDDR6, in a unified architecture, using a proprietary copper-cable interface effective up to a meter and custom memory aggregation chips.

What is Ignite?

Ignite is Majestic's custom AI processing unit; each Prometheus server has 12 of them, combining ARM application cores with RISC-V vector and tensor cores on a single die.

Which software frameworks does Prometheus support?

It will support PyTorch, vLLM and OpenAI's Triton inference frameworks without requiring code modifications, so compatible existing models can run as-is.

Majestic Labs is betting that a DRAM-centric, high-memory server can ease the LLM memory wall that constrains inference.