Notification System Design: Sending Alerts to Millions of Users

It Looked Simple at First

Implementing a notification feature seems trivial at first. User triggers an event → send notification to another user. Done. Maybe 10 lines of code.

But once you open that box, it's a completely different world. Sending one notification to one person is easy. But as notifications scale up, the complexity compounds: some users want push, others want email, some notifications need to be immediate, others should be batched, failures need retries, duplicates must be prevented. In large-scale notification systems, the list is endless.

A notification system isn't just about sending messages. It's about designing a massive logistics operation.

Notifications Flow Through Four Channels

The first question I faced was "how do we deliver notifications?" because every user has different preferences.

Push Notifications: The most immediate and powerful channel for mobile apps. FCM (Firebase Cloud Messaging) for Android, APNs (Apple Push Notification service) for iOS. Timing is everything—delay kills the value.

Email: Can carry long content and be reviewed later. Using services like SendGrid or AWS SES. Better suited for less urgent notifications.

SMS: Highest delivery rate but most expensive. Using services like Twilio. Reserved for truly critical notifications (payment confirmations, security alerts).

In-app Notifications: Only visible inside the app. Accumulates in a notification center until the user opens the app. The least intrusive channel.

The key insight: the same event needs different channels for different users. When someone comments on my post, one user wants push, another wants email, and another might not want any notification at all.

interface NotificationChannel {
  type: 'push' | 'email' | 'sms' | 'in_app';
  enabled: boolean;
  provider?: string; // 'fcm', 'apns', 'sendgrid', 'twilio'
}

interface UserPreferences {
  userId: string;
  channels: {
    comment: NotificationChannel[];
    like: NotificationChannel[];
    follow: NotificationChannel[];
    marketing: NotificationChannel[];
  };
  quietHours?: {
    start: string; // "22:00"
    end: string;   // "08:00"
    timezone: string;
  };
}

Without a Queue, the System Crashes

The second wall is "how do you send notifications at scale?"

The naive approach is simple: event happens → fetch target user list → loop through and send notifications. This works fine with a handful of users.

But as follower counts grow into the tens of thousands or more, the loop becomes a bottleneck—the server freezes while iterating, and if one notification fails (network error, API limit), everything blocks.

The answer was a message queue. Think of it like the postal system. People who want to send letters (Producers) drop them in mailboxes, then postal workers (Consumers) each take letters and deliver them independently.

[Event Occurs]
    ↓
[Producer: Create Notification]
    ↓
[Message Queue: RabbitMQ/Kafka/SQS]
    ↓        ↓        ↓
[Consumer 1][Consumer 2][Consumer 3] ... [Consumer N]
    ↓        ↓        ↓
[FCM]    [SendGrid] [Twilio]

Producers just drop messages into the queue and move on. Consumers process messages independently. If one fails, others continue working. Load increases? Spin up more consumers. Perfect distributed processing.

# Producer (FastAPI example)
from celery import Celery
import redis

celery_app = Celery('notifications', broker='redis://localhost:6379/0')

@app.post("/api/post/{post_id}/like")
async def like_post(post_id: str, user_id: str):
    # Handle like
    post = await db.posts.find_one({"_id": post_id})
    author_id = post["author_id"]

    # Add message to notification queue (async)
    celery_app.send_task('send_notification', args=[{
        'type': 'like',
        'recipient_id': author_id,
        'actor_id': user_id,
        'post_id': post_id,
        'timestamp': datetime.utcnow().isoformat()
    }])

    return {"status": "success"}


# Consumer (Celery Worker)
@celery_app.task(bind=True, max_retries=3)
def send_notification(self, event):
    try:
        recipient_id = event['recipient_id']
        prefs = get_user_preferences(recipient_id)

        # Generate message from template
        message = render_template('like_notification', event)

        # Send via appropriate channels based on user settings
        for channel in prefs.get_enabled_channels(event['type']):
            if channel == 'push':
                send_push(recipient_id, message)
            elif channel == 'email':
                send_email(recipient_id, message)

    except Exception as exc:
        # Retry on failure (exponential backoff)
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)

Without Templates, Management Becomes Hell

The third realization: "same event, different message format per channel."

Push notifications need to be short and punchy: "John Doe liked your post" Emails need to be long and detailed: HTML templates with images and multiple links. SMS needs to be ultra-brief: "New comment. Check: app.com/p/123"

Initially, I hardcoded messages for each notification, but changing copy meant modifying code. Multi-language support was a nightmare.

Introducing a template system changed everything. Templates are managed in DB or files, code just fills in variables.

interface NotificationTemplate {
  id: string;
  type: 'like' | 'comment' | 'follow' | 'mention';
  channel: 'push' | 'email' | 'sms' | 'in_app';
  locale: string;
  subject?: string; // For email
  title: string;    // For push/in-app
  body: string;
  htmlBody?: string; // HTML for email
  variables: string[]; // ['actor_name', 'post_title']
}

// Template stored in DB
const templates = {
  like_push_en: {
    title: "New Like",
    body: "{{actor_name}} liked your {{post_title}}",
    variables: ['actor_name', 'post_title']
  },
  like_email_en: {
    subject: "{{actor_name}} liked your post",
    htmlBody: `
      <h2>Hello!</h2>
      <p><strong>{{actor_name}}</strong> liked
      your post <a href="{{post_url}}">{{post_title}}</a>.</p>
    `,
    variables: ['actor_name', 'post_title', 'post_url']
  }
}

function renderTemplate(templateId: string, variables: Record<string, string>) {
  const template = getTemplate(templateId);
  let rendered = template.body;

  for (const [key, value] of Object.entries(variables)) {
    rendered = rendered.replace(new RegExp(`{{${key}}}`, 'g'), value);
  }

  return rendered;
}

To Avoid Spam, Batching and Rate Limiting Are Essential

The fourth trap: "too many notifications and users turn them off."

In an active community, dozens of notifications can arrive daily. If a post gets 50 comments, send 50 pushes? Users will delete the app.

Batching was the answer. Collect similar notifications over a time window and send them together.

interface NotificationBatch {
  userId: string;
  type: 'comment' | 'like' | 'follow';
  events: Array<{
    actorId: string;
    timestamp: Date;
    metadata: any;
  }>;
  firstEventTime: Date;
  shouldSendAt: Date; // First event + 5 minutes
}

// Batching logic
async function handleNewEvent(event) {
  const existingBatch = await redis.get(`batch:${event.userId}:${event.type}`);

  if (existingBatch) {
    // Add to existing batch
    existingBatch.events.push(event);
    await redis.set(`batch:${event.userId}:${event.type}`, existingBatch);
  } else {
    // Create new batch
    const batch = {
      userId: event.userId,
      type: event.type,
      events: [event],
      firstEventTime: new Date(),
      shouldSendAt: new Date(Date.now() + 5 * 60 * 1000) // 5 minutes later
    };
    await redis.set(`batch:${event.userId}:${event.type}`, batch);

    // Schedule send after 5 minutes
    scheduleNotification(batch.shouldSendAt, batch);
  }
}

// Batch message example
// Individual: "John Doe commented on your post"
// Batched: "John Doe and 12 others commented on your post"

Rate limiting was equally important. Even crucial notifications become spam if you send 10 per minute.

from datetime import datetime, timedelta
import redis

redis_client = redis.Redis()

def check_rate_limit(user_id: str, channel: str, limit: int, window_seconds: int):
    """
    Token bucket algorithm
    Example: Max 20 push notifications per hour
    """
    key = f"ratelimit:{user_id}:{channel}"
    current = redis_client.get(key)

    if current is None:
        # First request
        redis_client.setex(key, window_seconds, 1)
        return True

    if int(current) >= limit:
        return False  # Limit exceeded

    redis_client.incr(key)
    return True

# Usage
if check_rate_limit(user_id, 'push', limit=20, window_seconds=3600):
    send_push_notification(user_id, message)
else:
    # Queue for batch send later
    queue_for_batch(user_id, message)

Without Priority, Critical Notifications Get Buried

The fifth lesson: "not all notifications are equal."

A payment failure alert and someone viewing your profile can't be the same priority. Security warnings need immediate delivery, but marketing notifications can wait days.

We introduced a priority system.

enum NotificationPriority {
  CRITICAL = 0,  // Security, payment failure - immediate send, aggressive retry
  HIGH = 1,      // Mentions, direct messages - send within minutes
  NORMAL = 2,    // Likes, follows - batching allowed
  LOW = 3        // Marketing, recommendations - batch + send only during quiet hours
}

interface NotificationJob {
  id: string;
  userId: string;
  type: string;
  priority: NotificationPriority;
  payload: any;
  createdAt: Date;
  maxRetries: number;
  retryCount: number;
}

// Different queues by priority
const queues = {
  critical: new Queue('notifications:critical', {
    defaultJobOptions: { attempts: 5, backoff: { type: 'exponential', delay: 1000 } }
  }),
  high: new Queue('notifications:high', {
    defaultJobOptions: { attempts: 3, backoff: { type: 'exponential', delay: 2000 } }
  }),
  normal: new Queue('notifications:normal', {
    defaultJobOptions: { attempts: 2, backoff: { type: 'fixed', delay: 60000 } }
  }),
  low: new Queue('notifications:low', {
    defaultJobOptions: { attempts: 1 } // Give up if failed
  })
};

Without Retry Logic, Notifications Disappear

The sixth reality: "external services fail all the time."

FCM throws 503 errors, SendGrid APIs timeout, networks disconnect. Can't give up after one failure.

Retry with exponential backoff. Increase wait time after each failure, giving services time to recover.

async function sendWithRetry(
  sendFn: () => Promise<void>,
  maxRetries: number = 3,
  baseDelay: number = 1000
) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      await sendFn();
      return { success: true };
    } catch (error) {
      const isLastAttempt = attempt === maxRetries - 1;

      // Non-retriable errors (invalid token, user deleted app)
      if (isNonRetriableError(error)) {
        await handlePermanentFailure(error);
        return { success: false, reason: 'non_retriable' };
      }

      if (isLastAttempt) {
        await logFailure(error);
        return { success: false, reason: 'max_retries' };
      }

      // Exponential backoff: 1s, 2s, 4s, 8s...
      const delay = baseDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
}

function isNonRetriableError(error: any): boolean {
  // FCM: Invalid registration token
  if (error.code === 'messaging/invalid-registration-token') return true;

  // SendGrid: Invalid email
  if (error.code === 400 && error.message.includes('invalid email')) return true;

  // Twilio: Invalid phone number
  if (error.code === 21211) return true;

  return false;
}

Without Data, You Can't Improve

The final realization: "sending notifications isn't the end."

Did it actually reach the user? Did they read it? Click it? Which notifications are most effective? No data means no answers.

Metrics to track:

Delivery Rate: Percentage of queued notifications actually sent
Reach Rate: Percentage of sent notifications that reached devices
Open Rate: Percentage of delivered notifications users actually viewed
Click-Through Rate: Percentage of viewed notifications that led to actions
Opt-out Rate: Percentage of users who disabled notifications

interface NotificationMetrics {
  notificationId: string;
  userId: string;
  type: string;
  channel: string;
  priority: NotificationPriority;

  // Delivery stages
  queuedAt: Date;
  sentAt?: Date;
  deliveredAt?: Date;  // Confirmed by external service

  // User actions
  viewedAt?: Date;
  clickedAt?: Date;
  actionTaken?: string;

  // Result
  status: 'queued' | 'sent' | 'delivered' | 'failed' | 'bounced';
  failureReason?: string;

  // Metadata
  deviceType?: string;
  osVersion?: string;
  appVersion?: string;
}

// Analytics query example
async function getNotificationStats(type: string, days: number) {
  const results = await db.metrics.aggregate([
    {
      $match: {
        type: type,
        queuedAt: { $gte: new Date(Date.now() - days * 86400000) }
      }
    },
    {
      $group: {
        _id: "$channel",
        total: { $sum: 1 },
        sent: { $sum: { $cond: [{ $ne: ["$sentAt", null] }, 1, 0] } },
        delivered: { $sum: { $cond: [{ $ne: ["$deliveredAt", null] }, 1, 0] } },
        viewed: { $sum: { $cond: [{ $ne: ["$viewedAt", null] }, 1, 0] } },
        clicked: { $sum: { $cond: [{ $ne: ["$clickedAt", null] }, 1, 0] } }
      }
    }
  ]);

  return results.map(r => ({
    channel: r._id,
    deliveryRate: (r.sent / r.total * 100).toFixed(2) + '%',
    reachRate: (r.delivered / r.sent * 100).toFixed(2) + '%',
    openRate: (r.viewed / r.delivered * 100).toFixed(2) + '%',
    ctr: (r.clicked / r.viewed * 100).toFixed(2) + '%'
  }));
}

In the End, It Was a Logistics Center

Looking back, designing a notification system wasn't about sending messages. It was operating a massive logistics center.

Channel selection: Which shipping method (truck/ship/plane)
Message queue: The logistics center's conveyor belt
Templates: Standardized packaging boxes
Batching: Combining items going the same direction into one truck
Priority: Standard shipping vs same-day delivery
Rate limiting: Don't send more than 10 packages per day to one house
Retry: Redeliver if recipient is absent
Metrics: Delivery completion rate, receipt confirmation

When a single-line notification became millions, every layer of design mattered. No queue means crashes, no templates means management hell, no priority means critical notifications get buried, no data means no improvement.

The core was user experience. Technically perfect delivery means nothing if users don't want it or timing is wrong. Success isn't measured by "how many we sent" but "how many valuable notifications we sent at the right moment."

Now I know: sending one notification and designing a system are completely different dimensions of the problem.

Notification System Design: Sending Alerts to Millions of Users

Related Posts

Von Neumann Architecture: The Design Principles of Modern Computers

Stack vs Queue: How Developers Stand in Line

AI Agents: How Autonomous AI Systems Actually Work

Designing a Payment System: The Weight of Code That Moves Money