CPU vs GPU: 아인슈타인 1명 vs 초등학생 10,000명 (완전정복)

프롤로그 - "왜 내 i9 노트북으로 딥러닝이 안 되죠?"

처음 딥러닝에 입문했을 때 황당한 경험을 했다. 큰맘 먹고 산 i9 CPU 노트북으로 모델 학습을 돌려봤다. CPU가 컴퓨터의 두뇌니까, 최고급 모델이면 당연히 빠를 거라 생각했다. 그런데 예상 소요 시간이 "720시간"이라고 떴다. 한 달을 기다리란 말인가?

당황해서 연구실 선배에게 물어봤더니, 선배가 웃으며 서버를 하나 빌려줬다. 그 서버에 달린 건 CPU가 아니라 게임용 그래픽 카드인 GPU였다. 놀랍게도 내 노트북에서 한 달 걸린다던 작업이 그 서버에서는 30분 만에 끝났다.

그때의 충격을 아직도 기억한다. 게임하라고 만든 그래픽 카드가 왜 미적분 수학 문제를 CPU보다 수천 배 잘 푸는 걸까? 이 의문을 풀기 위해 몇 주를 하드웨어 문서와 씨름했고, 결국 이거였다 — "빠른 천재 1명"과 "느린 병사 만 명"의 차이.

이 글은 그때 내가 받아들였던 깨달음을 정리해본다.

1. 고민과 좌절 - 왜 내 비싼 CPU가 졌을까?

처음엔 억울했다. i9 프로세서는 벤치마크에서 항상 최상위권이었고, 단일 코어 성능도 압도적이었으니까. 그런데 딥러닝 모델 학습에선 완전히 무력했다. 도대체 뭐가 문제였을까?

내가 놓쳤던 것

문제는 "무엇을 빨리 하느냐"였다. CPU는 복잡한 한 문제를 빠르게 해결하도록 설계되었다. 컴파일러 최적화, 분기 예측, 비순차 실행(Out-of-Order Execution), 거대한 캐시 메모리 — 이 모든 게 "한 번에 한 작업을 초고속으로" 처리하기 위한 장치들이다.

반면 딥러닝 학습은 단순한 연산을 몇억 번 반복하는 작업이다. 행렬 곱셈이 전부다. 곱하고, 더하고, 곱하고, 더하고... 이런 작업엔 천재 한 명보다 평범한 사람 만 명이 훨씬 효율적이다.

이 깨달음이 와닿았던 순간, 컴퓨터 아키텍처에 대한 내 시각이 완전히 바뀌었다.

2. 핵심 개념 - 아인슈타인 vs 초등학생 군단

이 비유를 처음 들었을 때 모든 게 명확해졌다.

CPU: 아인슈타인 교수님 4명

능력: 미적분, 양자역학, 철학, 복잡한 논리 추론 가능
특징: 한 번에 한 문제씩 풀지만 속도가 광속급
약점: 덧셈 100만 개를 주면 아무리 천재라도 혼자 100만 번 끄적여야 함

GPU: 초등학생 10,000명

능력: 미적분 모름. 오직 덧셈과 뺄셈만 가능
특징: 개별적으론 느리고 멍청하지만 숫자가 압도적
강점: 덧셈 100만 개를 주면 1만 명이 각자 100개씩 분담해서 순식간에 해결

왜 게임과 AI가 같은 하드웨어를 쓰는가?

처음엔 이해가 안 갔다. 게임이랑 인공지능이 무슨 공통점이 있다고? 그런데 이해했다 — 둘 다 "단순 막노동"이라는 사실을.

4K 모니터는 약 830만 개의 픽셀이 있다. 각 픽셀의 RGB 값을 바꾸는 건 복잡한 미적분이 아니다. 그냥 R += 10, G -= 5 같은 단순 연산의 반복이다. AI의 행렬 곱셈도 마찬가지다. 수백만 번의 곱셈과 덧셈을 반복하는 일.

이런 "단순 반복 작업"에는 비싼 연봉의 아인슈타인(CPU)보다, 최저시급 알바생 만 명(GPU)이 훨씬 효율적이다. 이 비유가 완벽하게 와닿았다.

3. GPU의 진화 - 게임용 보조 장치에서 AI의 심장까지

GPU가 어떻게 여기까지 왔는지 타임라인을 정리해본다.

1996년 - 3Dfx Voodoo (3D 가속의 시조새)

CPU가 하던 3D 연산을 처음으로 도와주는 보조 카드 등장. 오직 게임(퀘이크, 툼레이더)만을 위해 존재했다. 삼각형 그리기 원툴.

1999년 - NVIDIA GeForce 256 (GPU라는 용어의 탄생)

최초로 "Transform & Lighting" 과정을 하드웨어가 처리했다. 젠슨 황이 처음으로 GPU라는 단어를 마케팅에 사용했다. 이때까지만 해도 그래픽 전용 카드였다.

2006년 - GeForce 8800 GTX & CUDA (혁명의 시작)

최초의 통합 쉐이더 아키텍처 (Unified Shader)가 등장했다. 더 중요한 건 CUDA의 출시였다.

CUDA는 "그래픽 말고 C언어로 수학 계산도 시켜보자"는 발상에서 나왔다. 이게 GPGPU (General Purpose GPU)의 시작이었고, 과학자들이 GPU를 슈퍼컴퓨터처럼 쓰기 시작했다. 이 시점이 내가 정리해본다면 GPU의 진정한 전환점이었다.

2012년 - AlexNet (AI 빅뱅)

제프리 힌튼 교수팀이 GPU 2장으로 딥러닝 모델을 학습시켜 이미지 인식 대회를 압도적으로 이겼다. 이 사건 이후 "GPU가 AI에 필수"라는 인식이 전 세계로 퍼졌다. 나도 이 논문 때문에 GPU를 공부하기 시작했다.

2018년 - RTX 2080 (레이 트레이싱)

빛의 경로를 추적하는 RT 코어와 AI 연산을 위한 Tensor 코어가 탑재됐다. 하드웨어 가속이 세분화되기 시작했다. 게임과 AI가 각자의 전용 코어를 갖게 된 시점이다.

2022년 - H100 (AI 전용 괴물)

게임? 알 바 아님. 오직 AI 학습만을 위해 설계된 데이터센터용 GPU. 가격이 수천만 원을 호가한다. 이제 GPU는 그래픽이 아니라 AI의 심장이 됐다.

이 타임라인을 정리하면서 이해했다 — GPU는 "진화"한 게 아니라 "재탄생"한 거였다.

4. 아키텍처 깊이 파고들기 - 무어의 법칙은 죽었다

왜 갑자기 병렬 처리가 중요해졌을까? 이 질문에 답하려면 무어의 법칙을 이해해야 한다.

인텔의 창업자 고든 무어는 "반도체 성능은 18개월마다 2배가 된다"고 했다. 실제로 1970년대부터 2000년대 초반까지 이 법칙은 정확히 맞아떨어졌다. CPU 클럭 속도는 계속 올라갔다.

하지만 2005년쯤부터 CPU 클럭 속도의 향상이 멈췄다. 물리적 한계 — 특히 발열 문제(Thermal Wall) — 때문이었다. 클럭을 더 올리면 칩이 녹아버린다.

엔지니어들의 선택 - "빠른 놈 하나" → "느린 놈 여러 개"

CPU를 더 빠르게 만들 수 없다면? 여러 개 붙이자. 이게 멀티코어 시대의 시작이었다. 그리고 GPU는 이 개념을 극단까지 밀어붙였다.

아키텍처 비교

이제 둘의 설계 철학 차이가 명확해졌다.

CPU: Latency 지향 (Low Latency)

거대한 캐시 메모리 (L1/L2/L3)
복잡한 제어 유닛 (Control Unit)
분기 예측기 (Branch Predictor)
비순차 실행 (Out-of-Order Execution)
목표: "빠른 응답 속도" — 한 작업을 최대한 빨리 끝내기

GPU: Throughput 지향 (High Throughput)

캐시와 제어 유닛을 최소화
남는 공간에 ALU(계산기)를 꽉 채움
수천 개의 스레드를 동시에 관리하는 Warp Scheduler
목표: "대량 처리" — 한 번에 많은 작업을 처리

이 차이를 받아들였을 때, CPU와 GPU가 왜 같은 문제에서도 성능 차이가 수백 배 나는지 완전히 이해됐다.

5. 실행 방식의 차이: SISD vs SIMD

코드로 보면 더 명확하다.

CPU: SISD (Single Instruction Single Data)

기본적으로 명령어 하나가 데이터 하나를 처리한다. 물론 현대 CPU도 AVX 같은 SIMD 명령어가 있지만, GPU에 비하면 제한적이다.

# CPU 스타일 - 순차적 실행 (Sequential)
import time

def cpu_add(n):
    a = [1] * n
    b = [2] * n
    c = [0] * n

    # 100만 번의 루프를 혼자서 돎
    start = time.time()
    for i in range(n):
        c[i] = a[i] + b[i]

    print(f"CPU 처리 시간: {time.time() - start}초")

# n이 1,000,000이면 커피 마시고 와야 함
cpu_add(1_000_000)

이 코드를 처음 돌렸을 때 생각했다. "이게 뭐가 느린 거지?" 하지만 GPU 버전을 보고 나서 충격을 받았다.

GPU: SIMD (Single Instruction Multiple Data)

명령어 하나("더해라!")를 내리면, 수천 개의 코어가 동시에 각자의 데이터를 잡고 실행한다.

// GPU 스타일: 병렬 실행 (Parallel) - CUDA C++

// __global__ 키워드: GPU에서 실행되는 함수(Kernel)임을 명시
__global__ void gpu_add(int *a, int *b, int *c, int n) {
    // threadIdx.x: 현재 스레드의 번호 (나는 몇 번째인가?)
    // blockIdx.x: 현재 스레드 블록의 번호
    // blockDim.x: 한 블록에 몇 개의 스레드가 있는가?

    // 이 공식으로 나의 고유 ID를 계산 (전체 운동장에서 내 위치)
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // 내 담당 구역만 처리 (if문으로 범위 체크)
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

/*
CPU: 루프 돌며 하나씩 처리 (For-Loop)
GPU: "야! 너네 1번부터 100만 번까지 각자 위치로!" -> "실시!" (동시 실행)

이것이 SIMT (Single Instruction Multiple Threads) 모델이다.
*/

// 호스트 코드 (CPU에서 실행)
int main() {
    int n = 1000000;

    // GPU 메모리 할당 및 데이터 복사
    // (생략)

    // Kernel 실행: 1000개의 블록, 블록당 1024개의 스레드
    int threadsPerBlock = 1024;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;

    gpu_add<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // 1,024,000개의 스레드가 동시에 실행됨
    // 루프 없음. 폭격.
}

이 코드를 처음 이해했을 때의 쾌감을 잊을 수 없다. 루프가 없다. 모든 스레드가 동시에 실행된다. 이게 바로 GPU의 본질이었다.

6. CUDA 메모리 계층 구조 더 알아보기

GPU 프로그래밍에서 가장 어려운 부분이 메모리 관리다. 이걸 이해하는 데 꽤 오래 걸렸다.

GPU도 CPU처럼 메모리 계층이 있다.

1. Global Memory (VRAM)

가장 크지만 가장 느림 (RAM과 비슷한 역할)
모든 스레드가 접근 가능
레이턴시가 수백 클럭 사이클
예: RTX 4090은 24GB GDDR6X

2. Shared Memory

스레드 블록(Thread Block)끼리 공유하는 초고속 메모리
L1 캐시 역할인데 개발자가 직접 제어 가능
레이턴시가 수십 클럭 사이클
크기는 블록당 48KB ~ 96KB

3. Registers

스레드 개인 공간
가장 빠름 (1 클럭 사이클)
크기가 매우 작음

CUDA 최적화의 핵심은 "느려터진 Global Memory 접근을 줄이고, Shared Memory를 활용하는 것"이다. 이것이 타일링(Tiling) 기법이다.

// Shared Memory를 활용한 행렬 곱셈 (간소화 버전)
__global__ void matmul_shared(float *A, float *B, float *C, int N) {
    // Shared Memory 선언 (블록 내 모든 스레드가 공유)
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    // Global Memory → Shared Memory로 데이터 복사
    As[ty][tx] = A[row * N + (tile * TILE_SIZE + tx)];
    Bs[ty][tx] = B[(tile * TILE_SIZE + ty) * N + col];

    // 동기화: 모든 스레드가 데이터 복사를 끝낼 때까지 대기
    __syncthreads();

    // Shared Memory에서 계산 (Global Memory보다 100배 빠름)
    for (int k = 0; k < TILE_SIZE; k++) {
        sum += As[ty][k] * Bs[k][tx];
    }

    // 다시 동기화
    __syncthreads();
}

이 패턴을 이해했을 때, "아, GPU 프로그래밍은 결국 메모리 게임이구나"라고 받아들였다.

7. 그래픽 API 전쟁: DirectX vs Vulkan vs Metal vs CUDA

GPU에게 일을 시키려면 언어(API)가 필요하다. 각 API마다 철학이 다르다.

DirectX (Microsoft)

윈도우 게임의 표준
편하고 강력함
Xbox도 이걸 씀
단점: 윈도우 전용

OpenGL

옛날 표준
너무 낡고 비효율적 (Draw Call 오버헤드 큼)
요즘은 거의 안 씀

Vulkan

차세대 표준
어렵지만 CPU 오버헤드가 거의 없음
안드로이드와 리눅스에서 많이 씀
멀티스레딩 지원 강력

Metal (Apple)

맥/아이폰 전용
애플 실리콘(M1/M2/M3)에 최적화됨
Swift/Objective-C 기반
폐쇄적이지만 성능 좋음

CUDA (NVIDIA)

게임 말고 연산(Compute) 전용
AI/딥러닝 업계 표준
C++과 거의 흡사해서 배우기 쉬움 (상대적으로)
단점: NVIDIA GPU에서만 작동

Compute Shader (컴퓨트 쉐이더)

옛날엔 GPU로 계산하려면 "이 숫자를 픽셀 색깔인 척" 속여서 그림을 그려야 했다. 눈물 나는 똥꼬쇼였다.

하지만 Compute Shader가 나오면서, 그래픽 파이프라인과 상관없이 GPU를 순수 계산기처럼 쓸 수 있게 됐다. 유니티(Unity)나 언리얼 엔진에서도 이 기능을 써서 수만 마리의 물고기 떼(Flocking) 시뮬레이션을 돌린다.

이 부분을 정리해본다 — GPU는 이제 그래픽이 아니라 범용 병렬 컴퓨터다.

8. 생성형 AI (Generative AI)의 원리

Stable Diffusion 같은 AI 그림 생성은 어떻게 돌아갈까? 이걸 이해하는 데 꽤 시간이 걸렸다.

노이즈 제거 (Denoising) 과정

랜덤 노이즈에서 시작: 모래 뿌려진 그림처럼 시작
점진적 정제: 노이즈를 조금씩 걷어내며 그림 완성
50번 반복: 이 과정을 50 스텝 반복
U-Net 연산: 한 번 걷어낼 때마다 수십억 번의 행렬 곱셈 발생

CPU로 돌리면 한 장 뽑는 데 10분 걸리지만, GPU는 5초면 끝난다.

Tensor Core의 마법

RTX 시리즈의 Tensor Core는 FP16(반정밀도) 행렬 곱셈에 특화됐다. AI 학습과 추론에선 32비트 정밀도가 필요 없다. 16비트면 충분하다. Tensor Core는 이걸 하드웨어 레벨에서 가속한다.

# Stable Diffusion의 핵심 루프 (간소화)
for step in range(50):  # 50번 반복
    # U-Net Forward Pass
    noise_pred = unet(latent, timestep, text_embedding)

    # 노이즈 예측값만큼 빼기
    latent = latent - noise_pred * scheduler_scale

# 각 스텝마다 수십억 번의 곱셈/덧셈 발생
# GPU 없으면 불가능

이것이 FP16과 Tensor Core의 힘이다. 정리해본다 — 생성형 AI는 GPU 없이는 불가능했을 것이다.

9. 게이밍용(GeForce) vs 작업용(Quadro/RTX A-Series)

"왜 똑같은 칩인데 쿼드로는 5배 비싼가요?" 이 질문을 처음 받았을 때 나도 몰랐다. 알아보니 이유가 있었다.

GeForce (게이밍)

목적: 프레임 잘 나오면 장땡
특징: 가끔 텍스처 깨지거나 계산 오류 나도 괜찮음 (게임하다가 벽 뚫고 들어가도 사람 안 죽음)
가격: 상대적으로 저렴 (100만 원대)

Quadro / RTX A-Series (작업용)

목적: 절대 틀리면 안 됨 (건축 설계, 자동차 시뮬레이션, 의료 영상)
특징:
1. ECC 메모리: 메모리 비트 오류를 스스로 수정 (우주선 설계하다 비트 플립 나면 대참사)
2. 정밀도: FP64(Double Precision) 성능을 제한하지 않음
3. 드라이버: CAD/3D 툴(Maya, SolidWorks, Catia) 인증을 받음
4. 보증: 5년 보증, 24/7 기술 지원
가격: 비쌈 (500만 원 ~ 수천만 원)

결론: 돈 버는 도구는 비싸다. 취미로 쓰는 건 싸다. 시장의 이치다.

10. 실제 응용 - 언제 CPU를 쓰고 언제 GPU를 써야 하나?

이제 실제로 어떻게 선택할지 정리해본다.

CPU를 써야 할 때

복잡한 분기 로직 (if-else 지옥)
순차적 의존성 (이전 결과를 봐야 다음 계산 가능)
작은 데이터셋 (병렬화 오버헤드가 더 큼)
I/O 중심 작업 (파일 읽기/쓰기, 네트워크)

예시: 웹 서버, 데이터베이스 쿼리, OS 실행

GPU를 써야 할 때

독립적인 대량 데이터 연산 (각 데이터가 서로 의존하지 않음)
행렬 연산 (딥러닝, 선형대수)
그래픽 렌더링 (당연히)
암호화폐 채굴 (단순 해시 반복)

예시: AI 학습/추론, 비디오 인코딩, 물리 시뮬레이션

팁

# Bad: GPU로 짧은 루프 돌리기 (오버헤드만 큼)
for i in range(100):  # 너무 작음
    gpu_kernel(data[i])

# Good: CPU로 처리
for i in range(100):
    result = simple_calc(data[i])

# Good: GPU로 긴 루프 돌리기
gpu_kernel(data)  # 한 번에 100만 개 처리

이해했다 — 도구의 선택은 작업의 성격에 달려 있다.

11. 정리 - 적재적소 (Right Tool for the Right Job)

"CPU가 좋나요, GPU가 좋나요?"는 우문이다. "페라리가 좋나요, 버스가 좋나요?"와 같은 질문이니까.

데이트하러 갈 땐 페라리(CPU)가 좋고, 수학여행 갈 땐 버스(GPU)가 좋다.

개발자로서 여러분은 내 코드가 어떤 성격인지 파악해야 한다.

복잡한 분기와 순차 로직이 많다? → CPU 최적화 (알고리즘 개선)
독립적인 대량 데이터 연산이 많다? → GPU 가속 (CUDA, Metal, Compute Shader)

컴퓨터는 혼자 일하지 않는다. 천재 교수(CPU)와 수만 명의 제자들(GPU)이 협업할 때, 비로소 최고의 성능이 나온다.

결국 이거였다 — 하드웨어는 도구일 뿐이고, 진짜 실력은 적재적소에 쓰는 능력이다.

요약 비교표

특성	CPU (Central Processing Unit)	GPU (Graphics Processing Unit)
코어 수	적음 (4개 ~ 64개)	엄청 많음 (수천 개 ~ 수만 개)
코어 당 성능	매우 높음 (천재)	낮음 (초등학생)
핵심 목표	Low Latency (빠른 응답)	High Throughput (대량 처리)
제어 로직	복잡함 (분기 예측, OOO 실행)	단순함
캐시 메모리	큼 (L1/L2/L3)	작음 (공유 메모리 활용)
명령어 구조	SISD (or limited SIMD)	SIMT (Single Instruction Multiple Threads)
특화 작업	OS 실행, 웹 서버, 복잡한 로직	그래픽 렌더링, 딥러닝, 암호화폐 채굴
비유	페라리, 아인슈타인	버스, 초등학생 군단
가격 (소비자용)	30만 원 ~ 200만 원	50만 원 ~ 300만 원
가격 (서버용)	100만 원 ~ 1000만 원	500만 원 ~ 5000만 원 (H100)

CPU vs GPU: 1 Einstein vs 10,000 Students — My Learning Journey

Prologue: The Day My i9 Laptop Failed Me

When I first tried deep learning, I made an expensive mistake. I bought a top-tier i9 laptop, thinking, "The CPU is the brain of the computer, so the best CPU must be the fastest."

I launched my first training job. The estimated time: 720 hours. A full month.

Confused and frustrated, I asked a senior researcher. He laughed and lent me a server. It had a gaming graphics card — a GPU. The same task that would've taken 30 days on my laptop finished in 30 minutes.

I was shocked. Why was a card designed for playing video games thousands of times better at math than my expensive CPU?

That question haunted me for weeks. I dug into hardware documentation, benchmark tests, and architecture papers. And then it clicked — the difference between "one genius" and "ten thousand workers."

This post is my attempt to share that revelation.

1. The Struggle: Why Did My Expensive CPU Lose?

At first, I felt betrayed. My i9 processor was always at the top of benchmarks. Single-core performance? Unmatched. But when it came to deep learning, it was completely powerless.

What went wrong?

What I Missed

The problem wasn't speed. It was what kind of speed mattered.

CPUs are designed to solve one complex problem extremely fast. Compiler optimizations, branch prediction, out-of-order execution, massive caches — all these exist to handle one task at light speed.

But deep learning training is about repeating simple operations billions of times. Matrix multiplication. Multiply, add, multiply, add, multiply, add... For this kind of work, one genius is far less efficient than ten thousand average workers.

When I finally understood this, my entire view of computer architecture changed.

2. The Core Concept: Einstein vs An Army of Students

This analogy made everything clear.

CPU: 4 Einstein Professors

Ability: Calculus, quantum physics, philosophy, complex logic
Strength: Solves one problem at a time at light speed
Weakness: If you give them 1 million simple addition problems, even a genius has to solve them one by one

GPU: 10,000 Elementary School Students

Ability: Can't do calculus. Only addition and subtraction
Strength: Individually slow and simple, but overwhelming in numbers
Power: Give them 1 million addition problems, and 10,000 kids each solve 100 problems — done in seconds

Why Graphics and AI Use the Same Hardware

At first, this confused me. What do video games and artificial intelligence have in common?

Then I realized — both are "simple, repetitive labor."

A 4K monitor has 8.3 million pixels. Changing each pixel's RGB value isn't rocket science. It's just simple arithmetic: R += 10, G -= 5, repeat 8 million times. AI's matrix multiplication is the same. Millions of multiplications and additions.

For this kind of "manual labor," 10,000 minimum-wage workers (GPU) are far more efficient than one high-salary genius (CPU).

This metaphor clicked for me perfectly.

3. GPU Evolution: From Gaming Accessory to AI's Heart

How did GPUs get here? Let me trace the timeline.

1996: 3Dfx Voodoo (The Dinosaur)

The first 3D accelerator card. Could only draw triangles for games like Quake and Tomb Raider.

1999: GeForce 256 (The First "GPU")

NVIDIA's Jensen Huang coined the term GPU for marketing. First hardware Transform & Lighting. Still just graphics.

2006: GeForce 8800 GTX & CUDA (The Revolution)

First Unified Shader Architecture. More importantly, CUDA launched.

CUDA's pitch: "Let's use GPUs for math, not just graphics." This was the birth of GPGPU (General Purpose GPU). Scientists started using GPUs as supercomputers. In my view, this was the real turning point.

2012: AlexNet (The AI Big Bang)

Geoffrey Hinton's team trained a deep learning model on 2 GPUs and crushed the ImageNet competition. After this, "GPUs are essential for AI" became gospel. I started studying GPUs because of this paper.

2018: RTX 2080 (Ray Tracing Era)

RT Cores for ray tracing and Tensor Cores for AI. Hardware acceleration became specialized. Games and AI each got dedicated cores.

2022: H100 (The AI Monster)

Gaming? Irrelevant. Designed purely for AI training in data centers. Costs tens of thousands of dollars. GPUs are no longer graphics cards — they're AI engines.

When I mapped this timeline, I realized GPUs didn't just "evolve" — they were reborn.

4. Deep Dive: Moore's Law is Dead

Why did parallel processing suddenly matter?

To answer this, we need to understand Moore's Law.

Intel founder Gordon Moore predicted: "Semiconductor performance doubles every 18 months." From the 1970s to the early 2000s, this held true. CPU clock speeds kept rising.

But around 2005, CPU clock speeds stopped increasing. We hit a physical limit — the Thermal Wall. Push the clock higher, and the chip melts.

Engineers' Pivot: "One Fast Core" → "Many Slow Cores"

If we can't make CPUs faster, let's add more of them. This kicked off the multi-core era. GPUs took this concept to the extreme.

Architecture Comparison

Now the design philosophies became crystal clear.

CPU: Latency-Oriented (Low Latency)

Massive cache hierarchy (L1/L2/L3)
Complex control unit
Branch predictor
Out-of-order execution
Goal: "Fast response time" — finish one task as quickly as possible

GPU: Throughput-Oriented (High Throughput)

Minimal cache and control logic
Pack the remaining space with ALUs (arithmetic units)
Warp scheduler managing thousands of threads simultaneously
Goal: "Mass processing" — handle many tasks at once

Understanding this difference explained why CPUs and GPUs can have 100x performance differences on the same problem.

5. Execution Model: SISD vs SIMD

Code makes this even clearer.

CPU: SISD (Single Instruction Single Data)

One instruction processes one piece of data. Modern CPUs have SIMD extensions like AVX, but they're limited compared to GPUs.

# CPU Style: Sequential Execution
import time

def cpu_add(n):
    a = [1] * n
    b = [2] * n
    c = [0] * n

    # One loop running 1 million times
    start = time.time()
    for i in range(n):
        c[i] = a[i] + b[i]

    print(f"CPU Time: {time.time() - start}s")

# If n = 1,000,000, grab a coffee
cpu_add(1_000_000)

When I first ran this, I thought, "What's so slow about this?" Then I saw the GPU version.

GPU: SIMD (Single Instruction Multiple Data)

One command ("ADD!") triggers thousands of cores simultaneously.

// GPU Style: Parallel Execution (CUDA C++)

// __global__: Function runs on GPU (Kernel)
__global__ void gpu_add(int *a, int *b, int *c, int n) {
    // threadIdx.x: My thread number (which one am I?)
    // blockIdx.x: My block number
    // blockDim.x: How many threads per block?

    // Calculate my unique ID (my position in the field)
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Process only my assigned data
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

/*
CPU: Loop through one by one (For-Loop)
GPU: "Hey! Positions 1 through 1 million, GO!" (simultaneous execution)

This is the SIMT (Single Instruction Multiple Threads) model.
*/

// Host code (runs on CPU)
int main() {
    int n = 1000000;

    // GPU memory allocation and data transfer
    // (omitted for brevity)

    // Kernel launch: 1000 blocks, 1024 threads per block
    int threadsPerBlock = 1024;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;

    gpu_add<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // 1,024,000 threads execute simultaneously
    // No loop. Bombardment.
}

When I finally grasped this code, the satisfaction was immense. No loop. All threads execute at once. This was GPU's essence.

6. Advanced: CUDA Memory Hierarchy

The hardest part of GPU programming is memory management. It took me a while to understand this.

GPUs have a memory hierarchy like CPUs.

1. Global Memory (VRAM)

Largest but slowest (like RAM)
Accessible by all threads
Latency: hundreds of clock cycles
Example: RTX 4090 has 24GB GDDR6X

2. Shared Memory

Ultra-fast memory shared within a thread block
Like L1 cache, but developer-controlled
Latency: tens of clock cycles
Size: 48KB ~ 96KB per block

3. Registers

Private to each thread
Fastest (1 clock cycle)
Very limited size

The core of CUDA optimization: "Minimize slow Global Memory access, maximize Shared Memory usage." This is called Tiling.

// Matrix multiplication using Shared Memory (simplified)
__global__ void matmul_shared(float *A, float *B, float *C, int N) {
    // Declare Shared Memory (shared by all threads in block)
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    // Copy data: Global Memory → Shared Memory
    As[ty][tx] = A[row * N + (tile * TILE_SIZE + tx)];
    Bs[ty][tx] = B[(tile * TILE_SIZE + ty) * N + col];

    // Synchronize: Wait for all threads to finish copying
    __syncthreads();

    // Compute from Shared Memory (100x faster than Global)
    for (int k = 0; k < TILE_SIZE; k++) {
        sum += As[ty][k] * Bs[k][tx];
    }

    // Synchronize again
    __syncthreads();
}

When I understood this pattern, I realized: "GPU programming is fundamentally a memory game."

7. Graphics API Wars: DirectX vs Vulkan vs Metal vs CUDA

To command a GPU, you need a language (API). Each has a different philosophy.

DirectX (Microsoft)

Standard for Windows games
Easy and powerful
Xbox uses it
Downside: Windows-only

OpenGL

Old standard
Outdated and inefficient (high draw call overhead)
Rarely used today

Vulkan

Next-gen standard
Hard to learn but near-zero CPU overhead
Popular on Android and Linux
Strong multithreading support

Metal (Apple)

Mac/iPhone exclusive
Optimized for Apple Silicon (M1/M2/M3)
Swift/Objective-C based
Closed ecosystem but high performance

CUDA (NVIDIA)

Compute-only (not graphics)
Industry standard for AI/deep learning
Syntax similar to C++ (relatively easy to learn)
Downside: NVIDIA GPUs only

Compute Shaders

In the old days, to use GPUs for computation, you had to "pretend numbers were pixel colors" and draw fake images. It was ridiculous.

But Compute Shaders changed that. Now you can use GPUs as pure calculators, independent of the graphics pipeline. Unity and Unreal Engine use this for simulations like fish flocking (thousands of fish).

My takeaway: GPUs are no longer graphics cards — they're general-purpose parallel computers.

8. Generative AI: How Stable Diffusion Works

How do AI image generators like Stable Diffusion work? It took me a while to understand.

The Denoising Process

Start from random noise: Like a sandy, grainy image
Gradual refinement: Remove noise little by little to reveal the image
50 iterations: Repeat this process 50 times
U-Net computation: Each step involves billions of matrix multiplications

On a CPU, generating one image takes 10 minutes. On a GPU, it takes 5 seconds.

Tensor Core Magic

RTX series Tensor Cores specialize in FP16 (half-precision) matrix multiplication. For AI training and inference, 32-bit precision isn't necessary — 16-bit is enough. Tensor Cores accelerate this at the hardware level.

# Stable Diffusion's core loop (simplified)
for step in range(50):  # 50 iterations
    # U-Net Forward Pass
    noise_pred = unet(latent, timestep, text_embedding)

    # Subtract predicted noise
    latent = latent - noise_pred * scheduler_scale

# Each step: billions of multiplications/additions
# Impossible without GPU

This is the power of FP16 and Tensor Cores. My conclusion: Generative AI would've been impossible without GPUs.

9. Gaming (GeForce) vs Workstation (Quadro/RTX A-Series)

"Why is Quadro 5x more expensive than GeForce with the same chip?"

I didn't know either at first. But there are real reasons.

GeForce (Gaming)

Goal: High frame rates (FPS)
Tolerance: Minor glitches or calculation errors are acceptable (a wall glitch in a game won't kill anyone)
Price: Relatively cheap ($500 - $1,500)

Quadro / RTX A-Series (Professional)

Goal: Absolute accuracy (architecture, car simulation, medical imaging)
Features:
1. ECC Memory: Self-corrects bit errors (a bit flip in spacecraft design = disaster)
2. Precision: Full FP64 (double precision) performance
3. Certified Drivers: Verified with CAD/3D tools (Maya, SolidWorks, Catia)
4. Warranty: 5-year warranty, 24/7 tech support
Price: Expensive ($2,000 - $30,000)

Conclusion: Tools that make money cost money. Hobby tools are cheap. Market logic.

10. Real-World Application: When to Use CPU vs GPU

Let me summarize when to choose each.

Use CPU When

Complex branching logic (if-else hell)
Sequential dependencies (need previous result to compute next)
Small datasets (parallelization overhead exceeds benefit)
I/O-heavy tasks (file reading/writing, networking)

Examples: Web servers, database queries, OS operations

Use GPU When

Independent mass data operations (data points don't depend on each other)
Matrix operations (deep learning, linear algebra)
Graphics rendering (obviously)
Cryptocurrency mining (repetitive hashing)

Examples: AI training/inference, video encoding, physics simulation

Practical Tips

# Bad: Using GPU for short loops (overhead dominates)
for i in range(100):  # Too small
    gpu_kernel(data[i])

# Good: CPU for small tasks
for i in range(100):
    result = simple_calc(data[i])

# Good: GPU for massive tasks
gpu_kernel(data)  # Process 1 million items at once

I learned: Tool choice depends on task nature.

11. Summary: Right Tool for the Right Job

"Is CPU better or GPU better?" is a silly question. It's like asking "Is a Ferrari better or a bus better?"

For a date, take the Ferrari (CPU). For a school trip, take the bus (GPU).

As a developer, you need to understand your code's nature.

Lots of complex branching and sequential logic? → Optimize for CPU (algorithm improvements)
Lots of independent, mass data operations? → Accelerate with GPU (CUDA, Metal, Compute Shaders)

Computers don't work alone. When the genius professor (CPU) and ten thousand students (GPU) collaborate, peak performance emerges.

My final realization: Hardware is just a tool. True skill is knowing when to use which tool.

CPU vs GPU: 아인슈타인 1명 vs 초등학생 10,000명 (완전정복)

관련 포스트

ARM vs x86: 아키텍처 철학의 차이

CUDA 코어와 텐서 코어: NVIDIA GPU의 핵심

NVMe vs SATA: 도로는 넓을수록 좋다

SSD vs HDD: 저장 장치의 원리

프롤로그 - "왜 내 i9 노트북으로 딥러닝이 안 되죠?"

1. 고민과 좌절 - 왜 내 비싼 CPU가 졌을까?

내가 놓쳤던 것

2. 핵심 개념 - 아인슈타인 vs 초등학생 군단

CPU: 아인슈타인 교수님 4명

GPU: 초등학생 10,000명

왜 게임과 AI가 같은 하드웨어를 쓰는가?

3. GPU의 진화 - 게임용 보조 장치에서 AI의 심장까지

1996년 - 3Dfx Voodoo (3D 가속의 시조새)

1999년 - NVIDIA GeForce 256 (GPU라는 용어의 탄생)

2006년 - GeForce 8800 GTX & CUDA (혁명의 시작)

2012년 - AlexNet (AI 빅뱅)

2018년 - RTX 2080 (레이 트레이싱)

2022년 - H100 (AI 전용 괴물)

4. 아키텍처 깊이 파고들기 - 무어의 법칙은 죽었다

엔지니어들의 선택 - "빠른 놈 하나" → "느린 놈 여러 개"

아키텍처 비교

5. 실행 방식의 차이: SISD vs SIMD

CPU: SISD (Single Instruction Single Data)

GPU: SIMD (Single Instruction Multiple Data)

6. CUDA 메모리 계층 구조 더 알아보기

1. Global Memory (VRAM)

2. Shared Memory

3. Registers

7. 그래픽 API 전쟁: DirectX vs Vulkan vs Metal vs CUDA

DirectX (Microsoft)

OpenGL

Vulkan

Metal (Apple)

CUDA (NVIDIA)

Compute Shader (컴퓨트 쉐이더)

8. 생성형 AI (Generative AI)의 원리

노이즈 제거 (Denoising) 과정

Tensor Core의 마법

9. 게이밍용(GeForce) vs 작업용(Quadro/RTX A-Series)

GeForce (게이밍)

Quadro / RTX A-Series (작업용)

10. 실제 응용 - 언제 CPU를 쓰고 언제 GPU를 써야 하나?

CPU를 써야 할 때

GPU를 써야 할 때

팁

11. 정리 - 적재적소 (Right Tool for the Right Job)

요약 비교표

CPU vs GPU: 1 Einstein vs 10,000 Students — My Learning Journey

Prologue: The Day My i9 Laptop Failed Me

1. The Struggle: Why Did My Expensive CPU Lose?

What I Missed

2. The Core Concept: Einstein vs An Army of Students

CPU: 4 Einstein Professors

GPU: 10,000 Elementary School Students

Why Graphics and AI Use the Same Hardware

3. GPU Evolution: From Gaming Accessory to AI's Heart

1996: 3Dfx Voodoo (The Dinosaur)

1999: GeForce 256 (The First "GPU")

2006: GeForce 8800 GTX & CUDA (The Revolution)

2012: AlexNet (The AI Big Bang)

2018: RTX 2080 (Ray Tracing Era)

2022: H100 (The AI Monster)

4. Deep Dive: Moore's Law is Dead

Engineers' Pivot: "One Fast Core" → "Many Slow Cores"

Architecture Comparison

5. Execution Model: SISD vs SIMD

CPU: SISD (Single Instruction Single Data)

GPU: SIMD (Single Instruction Multiple Data)

6. Advanced: CUDA Memory Hierarchy

1. Global Memory (VRAM)

2. Shared Memory

3. Registers

7. Graphics API Wars: DirectX vs Vulkan vs Metal vs CUDA

DirectX (Microsoft)

OpenGL

Vulkan

Metal (Apple)

CUDA (NVIDIA)