CUDA vs Tensor Cores: NVIDIA GPU Secrets

Prologue: "Never Running AI on CPU Again"

When the ChatGPT craze inspired me to run Llama-2 locally, I had a rude awakening.

M1 MacBook (CPU only):

>>> "Tell me about quantum physics"
[10s later] "Quantum"
[10s later] "physics"

10 seconds per word.

Then I borrowed an old RTX 3060 (CUDA + Tensor Cores):

>>> "Tell me about quantum physics"
[instantly] "Quantum physics is the study of very small particles..."
[full paragraph in 1 second]

100x difference.

"Why? Why so fast on GPU only?"

Why I Studied This

That's when GPU architecture fascinated me.

My questions:

Why use gaming GPUs for AI?
AMD GPUs exist — why only NVIDIA?
What are "CUDA cores" and "Tensor cores"?

The answers revealed NVIDIA's strategic hardware — components deliberately planted for the AI era, years before the boom.

What Confused Me

CPU cores vs GPU cores — what's actually different?
Is CUDA software or hardware?
If Tensor cores exist, do we still need CUDA cores?

Most importantly: "Why not AMD GPUs?"

The Aha Moment: "8 PhDs vs 5,000 Kids"

A senior engineer explained it like this:

CPU cores (8): 8 PhDs

Solve complex research papers (single-thread work) blazingly fast

But there are only 8, so simple repetitive tasks are slow

GPU cores (5,000): 5,000 elementary school students

Can't solve complex papers

But 5,000 simultaneous additions are faster than any PhD squad

AI training is essentially:

1 + 1 = ?
2 + 3 = ?
4 + 5 = ?
... (1 billion times)

8 PhDs solving these sequentially vs 5,000 kids solving simultaneously.

Obviously the latter wins.

After hearing this, it finally clicked why GPUs are perfect for AI. CPUs are optimized for completing one complex task quickly. GPUs are optimized for processing thousands of simple tasks simultaneously. AI training falls squarely into the latter category.

1. CUDA Cores: Army of Ants

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and the name for the worker cores inside their GPUs.

Spec Comparison

GPU Model	CUDA Cores	Price	Use
RTX 3060	3,584	$329	Gaming + light AI
RTX 3090	10,496	$1,499	Gaming + deep learning
A100 (datacenter)	6,912	$10,000+	Large-scale AI

You might wonder why the A100 has fewer CUDA cores than the RTX 3090. The answer: A100's Tensor core performance, memory bandwidth, and NVLink interconnect are far superior for AI workloads. Raw CUDA core count isn't the whole story.

NVIDIA's Strategy: CUDA Ecosystem Dominance

AMD also makes GPUs (Radeon). But AMD GPUs are rarely used for AI.

Reason: CUDA

NVIDIA has been distributing CUDA development tools for free since 2006.

// CUDA code example
__global__ void addKernel(int *c, const int *a, const int *b) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];  // Each thread calculates simultaneously
}

The key here is the __global__ keyword — it defines a function that runs on the GPU, not the CPU. threadIdx.x lets thousands of threads each process their own piece of data concurrently.

All major AI frameworks (TensorFlow, PyTorch) were built on CUDA. AMD's rival platform ROCm arrived in 2016, and its coverage is still limited.

Result: CUDA = the de facto GPU AI standard.

This is NVIDIA's most powerful moat. Even if AMD matches hardware performance, the software ecosystem is locked into CUDA, making it extremely difficult to switch.

2. Tensor Cores: Matrix Calculation Monsters

In 2017, NVIDIA added Tensor Cores to the Volta architecture.

Why Were They Needed?

AI's fundamental operation is Matrix Multiplication.

# Deep learning's core operation
output = weights @ inputs  # Matrix multiplication

A single image recognition pass requires:

Weights: 4,096 × 4,096 matrix
Inputs: 4,096 × 1 vector
Computation: ~16 million multiplications + additions

CUDA cores can do this, but they work stitch by stitch — each core performs one floating-point multiply-add (FMA) per clock cycle.

Tensor Core Magic

Tensor cores calculate 4×4 matrix chunks in a single operation.

CUDA cores:
10,000 simultaneous 1×1 calculations

Tensor cores:
4×4 blocks (16 elements) in one shot
→ effectively 16x efficiency per operation

This is possible because Tensor cores natively support MMA (Matrix Multiply-Accumulate) operations at the hardware level: D = A × B + C, where A, B, C, and D are matrices, all processed in a single instruction.

Tensor Core Evolution

Tensor cores have evolved through each GPU generation:

2nd Gen - Turing (RTX 20 series): Added INT8, INT4 support. Improved inference performance.
3rd Gen - Ampere (A100, RTX 30): Introduced Sparsity — skipping zero elements in matrices for 2x speedup. New TF32 (19-bit) format for seamless FP32-to-Tensor-Core migration.
4th Gen - Hopper (H100, RTX 40): Transformer Engine built-in. Automatically applies FP8 (8-bit) precision during LLM training for up to 6x speedup.

This hardware evolution is what made monsters like GPT-4 possible.

Real-World Speed Difference: With vs Without Tensor Cores

When I ran Stable Diffusion (AI image generation):

RTX 2080 (no Tensor cores)

512×512 image: 22 seconds

RTX 3060 (with Tensor cores)

512×512 image: 8 seconds (2.75x faster)

Similar CUDA core count (~3,500), but the presence of Tensor cores made a roughly 3x difference. For AI workloads, Tensor cores are game-changers.

3. FP16 vs FP32: Precision vs Speed

FP32 (32-bit floating point)

pi = 3.141592653589793  # Very precise

Needed for science, physics simulations — anywhere exact decimal precision matters.

FP16 (16-bit floating point)

pi = 3.14  # Less precise, but sufficient

For AI, this level of precision is more than enough.

"Cat probability: 99.123456%"
vs
"Cat probability: 99.12%"

→ Both result: "Cat"

The Strategy: Mixed Precision Training

FP16 calculation → 2x+ speed, half the memory

The key technique is Mixed Precision Training: forward and backward passes use FP16 for speed, while weight updates remain in FP32 for accuracy. You get the best of both worlds — speed and precision.

Precision	CUDA Core	Tensor Core
FP32	10 TFLOPS	-
FP16	20 TFLOPS	80 TFLOPS
FP8 (latest)	-	160 TFLOPS

For the RTX 4090:

FP32: ~80 TFLOPS
FP16 (Tensor): ~330 TFLOPS (4x faster)

What does TFLOPS mean? One TFLOPS = one trillion floating-point operations per second. The RTX 4090's 330 TFLOPS of FP16 Tensor performance means 330 trillion matrix calculations every second. This is the secret behind AI's rapid training speeds.

4. Hands-On: Running AI on Your GPU

Setup

# Check CUDA installation
nvcc --version

# Install PyTorch with CUDA support
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Using Tensor Cores in Code

import torch

# FP16 mode (activates Tensor cores)
model = MyModel().half().cuda()  # .half() = FP16
input = torch.randn(1, 3, 512, 512).half().cuda()

# Inference with automatic mixed precision
with torch.cuda.amp.autocast():
    output = model(input)

Calling .half() converts model weights from FP32 to FP16, automatically engaging Tensor cores. autocast() intelligently decides which operations use FP16 and which stay in FP32.

Benchmark Results

import time

# FP32 (CUDA cores only)
start = time.time()
for _ in range(100):
    output = model_fp32(input_fp32)
print(f"FP32: {time.time() - start:.2f}s")
# Output: FP32: 12.34s

# FP16 (Tensor cores engaged)
start = time.time()
for _ in range(100):
    output = model_fp16(input_fp16)
print(f"FP16: {time.time() - start:.2f}s")
# Output: FP16: 4.21s (2.9x faster)

One line of code change (.half()) yields a nearly 3x speedup. That's the power of Tensor cores.

5. Why Not AMD?

AMD has developed similar technology (Matrix Cores). But here's the reality:

Software Ecosystem Gap

# NVIDIA
import torch
torch.cuda.is_available()  # True (works instantly)

# AMD
import torch_directml  # Separate installation
# Many unsupported operations, frequent errors

Community Size

CUDA tutorials: Millions
ROCm tutorials: Thousands

When you hit a problem with CUDA, Stack Overflow has the answer in 5 minutes. With ROCm, you might be filing a GitHub issue and waiting days for a response.

Enterprise Reality

OpenAI, Google, and Meta all use NVIDIA A100/H100 GPUs. Cloud providers (AWS, GCP, Azure) predominantly offer NVIDIA GPU instances.

AMD's MI250/MI300 datacenter GPUs are emerging, and ROCm is improving rapidly. But catching up to 15 years of accumulated CUDA ecosystem in a short time remains an enormous challenge.

6. GPU Selection Guide

Gaming Only

Focus on CUDA core count
RTX 4060 Ti (4,352 cores)

Light AI (Stable Diffusion, LLM Inference)

Tensor cores essential
RTX 3060 (12GB VRAM) — the 12GB VRAM is crucial; 8GB models can't load large AI models

Deep Learning Training

Tensor cores + 24GB+ VRAM
RTX 3090 / 4090

Enterprise-Scale AI

NVIDIA A100 / H100 (datacenter)
Key: linking multiple GPUs via NVLink for combined compute power

Final Thought: "NVIDIA's First-Mover Advantage"

Why NVIDIA dominates the AI era:

2006: Free CUDA distribution → ecosystem lock-in
2017: Tensor cores → AI-specific hardware
2020+: FP8, Transformer Engine → continuous optimization

AMD can't catch up easily because it's not just about hardware specs. The developer ecosystem — libraries, tutorials, community knowledge, enterprise integrations — must follow. NVIDIA has a 15-year head start on all of that.

When I first ran AI on a GPU, the experience was unforgettable.

"This is why everyone wants NVIDIA."