HBM vs GDDR: How High-Bandwidth Memory Breaks Through the "Memory Wall" Bottleneck in AI Training and Inference

Markets
Updated: 06/10/2026 05:33

In the trillion-parameter AI race, GPU computing power may be in the spotlight, but a more hidden component is quietly becoming the industry’s strategic high ground—High Bandwidth Memory (HBM). If a GPU is like a supercharged engine with thousands of cylinders, then HBM is the fuel system that keeps data flowing. No matter how powerful the engine, it can only idle if the fuel supply can’t keep up.

Industry consensus is shifting: the bottleneck for AI computing power is no longer limited to the computational units themselves, but increasingly lies in data transfer efficiency. Data shows that in traditional computing architectures, data movement can account for 60%-80% of total system energy consumption. In inference scenarios, GPU idle rates can reach as high as 99%. The key limiting factor behind this is memory bandwidth.

Leveraging 3D stacking and Through-Silicon Via (TSV) technology, HBM achieves far greater bandwidth and energy efficiency per unit area than conventional memory, making it a standard feature in AI accelerators from NVIDIA, AMD, Google, and other industry giants.

Technical Principles: How HBM Reshapes the Data Channel Between GPU and Memory

From "Flat Racetrack" to "Vertical Elevator"

HBM isn’t a new storage medium; it’s a set of interface and packaging specifications that define "how to interconnect DRAM at extremely high bandwidth." Its core technology stack breaks down into three layers:

3D Stacking — Multiple layers of DRAM chips are vertically stacked (mainstream configurations are currently 8 to 12 layers, with HBM4 advancing to 16 layers), multiplying storage density and parallel channel count within the same physical footprint.

Through-Silicon Via (TSV) — Microscopic holes, just 5-10 microns in diameter, are etched inside each DRAM layer and filled with conductive material to create vertical channels, enabling tens of thousands of interconnections between layers. This stands in stark contrast to traditional PCB wiring, where trace lengths are measured in centimeters or meters, while TSV signal transmission distances are compressed to the micron scale, dramatically reducing signal attenuation and latency.

Silicon Interposer — HBM stacks connect to a silicon interposer via micro bumps, which then links to GPU/CPU chips over extremely short distances, forming a unified packaging module. The entire structure uses advanced 2.5D packaging technologies like CoWoS for high-density integration.

The architecture’s breakthrough is in bus width. A single HBM stack typically offers a 1024-bit bus, while HBM3E can scale up to 2048 bits. For example, SK hynix’s latest mass-produced HBM3E chip delivers 24GB capacity and bandwidth exceeding 1TB/s. By comparison, traditional GDDR solutions offer just 32 bits per chip (or 384 bits in multi-chip configurations), resulting in orders-of-magnitude differences in data transfer capability.

HBM’s fundamental design philosophy is "wide and slow"—it achieves total bandwidth through massive parallel channels, each running at relatively low frequency, resulting in significantly better energy efficiency than high-frequency designs. GDDR, on the other hand, follows a "narrow and fast" logic—squeezing bandwidth from a few channels by ramping up operating frequency. These two approaches suit entirely different application scenarios: HBM pursues maximum throughput, while GDDR balances throughput and cost.

HBM vs GDDR6: The Battle of "Wide and Slow" vs "Narrow and Fast"

Both HBM and GDDR6 belong to the DRAM memory family, serving as data access channels for GPUs, but they differ fundamentally in design goals, performance characteristics, and cost structure.

Bandwidth: HBM3E delivers up to 1.2TB/s per stack, with next-generation HBM4 expected to surpass 2.0TB/s. GDDR6X maxes out at about 1TB/s per card, already approaching physical limits in flagship products. However, HBM is markedly superior in energy efficiency per unit bandwidth, translating directly into quantifiable operational cost advantages in large-scale AI data center deployments.

Power and Latency: Thanks to TSV’s ultra-short vertical paths, HBM consumes about 30% less power than GDDR5. In terms of latency, GDDR relies on PCB traces for communication with the GPU, typically resulting in microsecond delays; HBM, packaged directly next to the GPU chip, compresses latency to the nanosecond range. Notably, HBM’s random access latency is slightly higher than GDDR in extreme throughput scenarios, but for large-scale parallel streaming access—the typical mode for AI training and inference—throughput is the critical bottleneck.

Cost: This is HBM’s most obvious drawback. Industry data shows HBM costs over $25 per GB, while GDDR6 is only about $5-8 per GB. HBM can account for 60%-80% of total high-end GPU costs. GDDR6 actually offers better cost-per-bandwidth performance—when absolute peak bandwidth isn’t required, GDDR6 is clearly more cost-effective.

In summary, choosing between HBM and GDDR is fundamentally a trade-off between performance boundaries and cost constraints. HBM is essential for scenarios where "a certain bandwidth threshold must be met to operate"—such as inference on trillion-parameter models. Below this bandwidth, the system simply won’t function effectively. GDDR6, meanwhile, serves scenarios where "acceptable performance at minimum cost" is the priority, such as deploying small to medium models (7B-13B parameters).

The two are not substitutes, but parallel technical routes for different needs. Yet in AI training and large-scale inference, HBM’s advantages are steadily pushing GDDR out of the core arena.

The "Memory Wall" Dilemma: Why HBM Demand Grows Exponentially with Larger AI Models

To understand the explosive growth in HBM demand, we need to revisit a fundamental bottleneck in AI computing—the "Memory Wall."

The Widening Gap Between Compute and Bandwidth Growth

Over the past thirty years, processor performance has doubled every 18-24 months in line with Moore’s Law, but memory bandwidth has lagged behind. Research on AI and the memory wall shows that AI model compute grows about 3x every two years, while memory bandwidth only increases about 1.6x, and interconnect bandwidth even less. This means each compute upgrade devalues memory transfer capacity.

This contradiction is especially acute in inference. Training relies on matrix multiplication (GEMM), with high compute density—arithmetic intensity can exceed 100+ FLOPs/byte. Inference, however, centers on matrix-vector multiplication (GEMV), with arithmetic intensity often below 2 FLOPs/byte. The lower the arithmetic intensity, the more system performance depends on memory bandwidth rather than compute power—this is the "bandwidth wall" effect.

The "Transfer Burden" of Large Model Inference

The basic process of large model inference is: for every generated token, all model parameters must be loaded from memory to the compute core. Take the Llama 3 70B model as an example: at FP16 precision, weights total about 140GB. Each token generated requires moving all 140GB of parameters. To ensure a smooth experience of generating 30 tokens per second, the bandwidth between HBM and the compute core must support roughly 4.2TB of transfers per second.

This demand is already pushing the limits of current mainstream hardware. NVIDIA’s H100 SXM5 offers 3.35TB/s of HBM bandwidth. In other words, even the top-tier AI accelerator is barely adequate for a 70B parameter model. As models scale to hundreds of billions, trillions, and beyond, required bandwidth will grow linearly—or even superlinearly.

Dual Constraints: Capacity and Bandwidth

Memory capacity is another critical factor. If a model’s total parameter size exceeds a single GPU’s HBM capacity, the model must be split across multiple GPUs for parallel operation—a method known as tensor parallelism. But splitting introduces a new bottleneck: frequent communication of intermediate results between GPUs, which can ultimately drag down overall efficiency.

Thus, HBM’s value lies in two dimensions: bandwidth determines single-card inference speed and minimum latency, while capacity decides whether a model fits on a single card, how many cards are needed, and the cost of cross-card communication.

Industry direction is clear: HBM is shifting from "premium option" to "standard configuration" for AI computing power. TrendForce data shows HBM demand will grow over 130% year-over-year in 2025, and continue rising by more than 70% in 2026. HBM has moved from a supporting role in graphics processing to an irreplaceable core component in the AI compute chain.

Industry-Wide Impact: From Technical Choices to Market Supply-Demand Imbalance

Market Expansion

HBM’s market growth is outpacing early forecasts from most institutions. SEMI China data projects the HBM market will grow 58% to $54.6 billion by 2026, nearly 40% of the total DRAM market. Micron estimates HBM’s TAM (Total Addressable Market) will grow at a compound annual rate of about 40%, from $35 billion in 2025 to $100 billion in 2028—surpassing the entire DRAM market size in 2024.

Rigid Supply Constraints

But surging demand is clashing with rigid supply-side capacity. SEMI data shows that although Samsung, SK hynix, and Micron have shifted 70% of new/adjustable capacity toward HBM production, overall HBM capacity shortfall remains at 50%-60%.

The bottleneck stems from the high barriers to HBM manufacturing. Production requires advanced DRAM process technology (leading vendors are now at the 1β nm node), plus TSV etching, micro bump bonding, wafer-level packaging, and other advanced packaging technologies. TSMC’s CoWoS packaging capacity—the core platform for integrating HBM and GPUs—is projected to expand to over 125,000 wafers per month by late 2026, up about 79% year-over-year, but still falls short of order demand from NVIDIA, AMD, Broadcom, and others.

Supply Chain Risks and Price Transmission

Capacity shortages are reflected directly in pricing. HBM3E prices rose 5%-10% during 2025. More importantly, as the three major manufacturers shift capacity to HBM, consumer DDR memory supply shrinks, with prices expected to keep rising through late 2026. HBM shortages are impacting the broader memory industry by squeezing out capacity.

In June 2026, Jensen Huang confirmed that SK hynix, Samsung, and Micron have all passed certification and begun mass supplying HBM4 chips, with Samsung leading the industry by starting HBM4 mass production in February 2026. Yet even with all three giants expanding simultaneously, the supply-demand gap for HBM will remain at about 50% through 2025-2026. Achieving supply-demand balance in the short term remains difficult. Upstream expansion pace, packaging capacity bottlenecks, and rapid downstream AI compute demand together create a dynamic but persistently tight supply-demand landscape.

Conclusion

From fundamental technological innovation, to rigid dependence in AI compute scenarios, to supply-demand imbalance across the entire industry chain, HBM has evolved from a branch of memory technology into the core battleground of AI infrastructure competition.

HBM’s irreplaceability in AI training and inference stems from a basic computing principle: once model parameter size crosses a certain threshold, bandwidth is no longer an "optimization," but an "enabler"—below the threshold, the system simply won’t run effectively. GDDR6 may have a cost advantage, but its narrow-channel, high-frequency architecture can’t match the bandwidth ceiling and energy efficiency needed for trillion-parameter models. This structural difference means HBM and GDDR are not simply competitors, but layered solutions for different requirements in the AI compute core.

Looking ahead, ongoing mass production of HBM4 (with single-stack bandwidth expected to exceed 2TB/s), maturation of 16-layer stacking, and new packaging technologies like hybrid bonding will further push HBM’s performance ceiling. However, it’s worth noting that companies like Huawei are exploring algorithmic optimizations to reduce reliance on HBM, and alternatives such as SRAM and compute-in-memory architectures are advancing in parallel. Whether HBM can maintain its lead through technological iterations, and whether its supply bottlenecks can be eased in future expansion cycles, will be among the most important variables to watch in the AI compute industry over the next several years.

The content herein does not constitute any offer, solicitation, or recommendation. You should always seek independent professional advice before making any investment decisions. Please note that Gate may restrict or prohibit the use of all or a portion of the Services from Restricted Locations. For more information, please read the User Agreement
Like the Content