It Looked Simple at First
Implementing a notification feature seems trivial at first. User triggers an event → send notification to another user. Done. Maybe 10 lines of code.
But once you open that box, it's a completely different world. Sending one notification to one person is easy. But as notifications scale up, the complexity compounds: some users want push, others want email, some notifications need to be immediate, others should be batched, failures need retries, duplicates must be prevented. In large-scale notification systems, the list is endless.
A notification system isn't just about sending messages. It's about designing a massive logistics operation.
Notifications Flow Through Four Channels
The first question I faced was "how do we deliver notifications?" because every user has different preferences.
Push Notifications: The most immediate and powerful channel for mobile apps. FCM (Firebase Cloud Messaging) for Android, APNs (Apple Push Notification service) for iOS. Timing is everything—delay kills the value.
Email: Can carry long content and be reviewed later. Using services like SendGrid or AWS SES. Better suited for less urgent notifications.
SMS: Highest delivery rate but most expensive. Using services like Twilio. Reserved for truly critical notifications (payment confirmations, security alerts).
In-app Notifications: Only visible inside the app. Accumulates in a notification center until the user opens the app. The least intrusive channel.
The key insight: the same event needs different channels for different users. When someone comments on my post, one user wants push, another wants email, and another might not want any notification at all.
interface NotificationChannel {
type: 'push' | 'email' | 'sms' | 'in_app';
enabled: boolean;
provider?: string; // 'fcm', 'apns', 'sendgrid', 'twilio'
}
interface UserPreferences {
userId: string;
channels: {
comment: NotificationChannel[];
like: NotificationChannel[];
follow: NotificationChannel[];
marketing: NotificationChannel[];
};
quietHours?: {
start: string; // "22:00"
end: string; // "08:00"
timezone: string;
};
}
Without a Queue, the System Crashes
The second wall is "how do you send notifications at scale?"
The naive approach is simple: event happens → fetch target user list → loop through and send notifications. This works fine with a handful of users.
But as follower counts grow into the tens of thousands or more, the loop becomes a bottleneck—the server freezes while iterating, and if one notification fails (network error, API limit), everything blocks.
The answer was a message queue. Think of it like the postal system. People who want to send letters (Producers) drop them in mailboxes, then postal workers (Consumers) each take letters and deliver them independently.
[Event Occurs]
↓
[Producer: Create Notification]
↓
[Message Queue: RabbitMQ/Kafka/SQS]
↓ ↓ ↓
[Consumer 1][Consumer 2][Consumer 3] ... [Consumer N]
↓ ↓ ↓
[FCM] [SendGrid] [Twilio]
Producers just drop messages into the queue and move on. Consumers process messages independently. If one fails, others continue working. Load increases? Spin up more consumers. Perfect distributed processing.
# Producer (FastAPI example)
from celery import Celery
import redis
celery_app = Celery('notifications', broker='redis://localhost:6379/0')
@app.post("/api/post/{post_id}/like")
async def like_post(post_id: str, user_id: str):
# Handle like
post = await db.posts.find_one({"_id": post_id})
author_id = post["author_id"]
# Add message to notification queue (async)
celery_app.send_task('send_notification', args=[{
'type': 'like',
'recipient_id': author_id,
'actor_id': user_id,
'post_id': post_id,
'timestamp': datetime.utcnow().isoformat()
}])
return {"status": "success"}
# Consumer (Celery Worker)
@celery_app.task(bind=True, max_retries=3)
def send_notification(self, event):
try:
recipient_id = event['recipient_id']
prefs = get_user_preferences(recipient_id)
# Generate message from template
message = render_template('like_notification', event)
# Send via appropriate channels based on user settings
for channel in prefs.get_enabled_channels(event['type']):
if channel == 'push':
send_push(recipient_id, message)
elif channel == 'email':
send_email(recipient_id, message)
except Exception as exc:
# Retry on failure (exponential backoff)
raise self.retry(exc=exc, countdown=2 ** self.request.retries)
Without Templates, Management Becomes Hell
The third realization: "same event, different message format per channel."
Push notifications need to be short and punchy: "John Doe liked your post" Emails need to be long and detailed: HTML templates with images and multiple links. SMS needs to be ultra-brief: "New comment. Check: app.com/p/123"
Initially, I hardcoded messages for each notification, but changing copy meant modifying code. Multi-language support was a nightmare.
Introducing a template system changed everything. Templates are managed in DB or files, code just fills in variables.
interface NotificationTemplate {
id: string;
type: 'like' | 'comment' | 'follow' | 'mention';
channel: 'push' | 'email' | 'sms' | 'in_app';
locale: string;
subject?: string; // For email
title: string; // For push/in-app
body: string;
htmlBody?: string; // HTML for email
variables: string[]; // ['actor_name', 'post_title']
}
// Template stored in DB
const templates = {
like_push_en: {
title: "New Like",
body: "{{actor_name}} liked your {{post_title}}",
variables: ['actor_name', 'post_title']
},
like_email_en: {
subject: "{{actor_name}} liked your post",
htmlBody: `
<h2>Hello!</h2>
<p><strong>{{actor_name}}</strong> liked
your post <a href="{{post_url}}">{{post_title}}</a>.</p>
`,
variables: ['actor_name', 'post_title', 'post_url']
}
}
function renderTemplate(templateId: string, variables: Record<string, string>) {
const template = getTemplate(templateId);
let rendered = template.body;
for (const [key, value] of Object.entries(variables)) {
rendered = rendered.replace(new RegExp(`{{${key}}}`, 'g'), value);
}
return rendered;
}
To Avoid Spam, Batching and Rate Limiting Are Essential
The fourth trap: "too many notifications and users turn them off."
In an active community, dozens of notifications can arrive daily. If a post gets 50 comments, send 50 pushes? Users will delete the app.
Batching was the answer. Collect similar notifications over a time window and send them together.
interface NotificationBatch {
userId: string;
type: 'comment' | 'like' | 'follow';
events: Array<{
actorId: string;
timestamp: Date;
metadata: any;
}>;
firstEventTime: Date;
shouldSendAt: Date; // First event + 5 minutes
}
// Batching logic
async function handleNewEvent(event) {
const existingBatch = await redis.get(`batch:${event.userId}:${event.type}`);
if (existingBatch) {
// Add to existing batch
existingBatch.events.push(event);
await redis.set(`batch:${event.userId}:${event.type}`, existingBatch);
} else {
// Create new batch
const batch = {
userId: event.userId,
type: event.type,
events: [event],
firstEventTime: new Date(),
shouldSendAt: new Date(Date.now() + 5 * 60 * 1000) // 5 minutes later
};
await redis.set(`batch:${event.userId}:${event.type}`, batch);
// Schedule send after 5 minutes
scheduleNotification(batch.shouldSendAt, batch);
}
}
// Batch message example
// Individual: "John Doe commented on your post"
// Batched: "John Doe and 12 others commented on your post"
Rate limiting was equally important. Even crucial notifications become spam if you send 10 per minute.
from datetime import datetime, timedelta
import redis
redis_client = redis.Redis()
def check_rate_limit(user_id: str, channel: str, limit: int, window_seconds: int):
"""
Token bucket algorithm
Example: Max 20 push notifications per hour
"""
key = f"ratelimit:{user_id}:{channel}"
current = redis_client.get(key)
if current is None:
# First request
redis_client.setex(key, window_seconds, 1)
return True
if int(current) >= limit:
return False # Limit exceeded
redis_client.incr(key)
return True
# Usage
if check_rate_limit(user_id, 'push', limit=20, window_seconds=3600):
send_push_notification(user_id, message)
else:
# Queue for batch send later
queue_for_batch(user_id, message)
Without Priority, Critical Notifications Get Buried
The fifth lesson: "not all notifications are equal."
A payment failure alert and someone viewing your profile can't be the same priority. Security warnings need immediate delivery, but marketing notifications can wait days.
We introduced a priority system.
enum NotificationPriority {
CRITICAL = 0, // Security, payment failure - immediate send, aggressive retry
HIGH = 1, // Mentions, direct messages - send within minutes
NORMAL = 2, // Likes, follows - batching allowed
LOW = 3 // Marketing, recommendations - batch + send only during quiet hours
}
interface NotificationJob {
id: string;
userId: string;
type: string;
priority: NotificationPriority;
payload: any;
createdAt: Date;
maxRetries: number;
retryCount: number;
}
// Different queues by priority
const queues = {
critical: new Queue('notifications:critical', {
defaultJobOptions: { attempts: 5, backoff: { type: 'exponential', delay: 1000 } }
}),
high: new Queue('notifications:high', {
defaultJobOptions: { attempts: 3, backoff: { type: 'exponential', delay: 2000 } }
}),
normal: new Queue('notifications:normal', {
defaultJobOptions: { attempts: 2, backoff: { type: 'fixed', delay: 60000 } }
}),
low: new Queue('notifications:low', {
defaultJobOptions: { attempts: 1 } // Give up if failed
})
};
Without Retry Logic, Notifications Disappear
The sixth reality: "external services fail all the time."
FCM throws 503 errors, SendGrid APIs timeout, networks disconnect. Can't give up after one failure.
Retry with exponential backoff. Increase wait time after each failure, giving services time to recover.
async function sendWithRetry(
sendFn: () => Promise<void>,
maxRetries: number = 3,
baseDelay: number = 1000
) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
await sendFn();
return { success: true };
} catch (error) {
const isLastAttempt = attempt === maxRetries - 1;
// Non-retriable errors (invalid token, user deleted app)
if (isNonRetriableError(error)) {
await handlePermanentFailure(error);
return { success: false, reason: 'non_retriable' };
}
if (isLastAttempt) {
await logFailure(error);
return { success: false, reason: 'max_retries' };
}
// Exponential backoff: 1s, 2s, 4s, 8s...
const delay = baseDelay * Math.pow(2, attempt);
await sleep(delay);
}
}
}
function isNonRetriableError(error: any): boolean {
// FCM: Invalid registration token
if (error.code === 'messaging/invalid-registration-token') return true;
// SendGrid: Invalid email
if (error.code === 400 && error.message.includes('invalid email')) return true;
// Twilio: Invalid phone number
if (error.code === 21211) return true;
return false;
}
Without Data, You Can't Improve
The final realization: "sending notifications isn't the end."
Did it actually reach the user? Did they read it? Click it? Which notifications are most effective? No data means no answers.
Metrics to track:
- Delivery Rate: Percentage of queued notifications actually sent
- Reach Rate: Percentage of sent notifications that reached devices
- Open Rate: Percentage of delivered notifications users actually viewed
- Click-Through Rate: Percentage of viewed notifications that led to actions
- Opt-out Rate: Percentage of users who disabled notifications
interface NotificationMetrics {
notificationId: string;
userId: string;
type: string;
channel: string;
priority: NotificationPriority;
// Delivery stages
queuedAt: Date;
sentAt?: Date;
deliveredAt?: Date; // Confirmed by external service
// User actions
viewedAt?: Date;
clickedAt?: Date;
actionTaken?: string;
// Result
status: 'queued' | 'sent' | 'delivered' | 'failed' | 'bounced';
failureReason?: string;
// Metadata
deviceType?: string;
osVersion?: string;
appVersion?: string;
}
// Analytics query example
async function getNotificationStats(type: string, days: number) {
const results = await db.metrics.aggregate([
{
$match: {
type: type,
queuedAt: { $gte: new Date(Date.now() - days * 86400000) }
}
},
{
$group: {
_id: "$channel",
total: { $sum: 1 },
sent: { $sum: { $cond: [{ $ne: ["$sentAt", null] }, 1, 0] } },
delivered: { $sum: { $cond: [{ $ne: ["$deliveredAt", null] }, 1, 0] } },
viewed: { $sum: { $cond: [{ $ne: ["$viewedAt", null] }, 1, 0] } },
clicked: { $sum: { $cond: [{ $ne: ["$clickedAt", null] }, 1, 0] } }
}
}
]);
return results.map(r => ({
channel: r._id,
deliveryRate: (r.sent / r.total * 100).toFixed(2) + '%',
reachRate: (r.delivered / r.sent * 100).toFixed(2) + '%',
openRate: (r.viewed / r.delivered * 100).toFixed(2) + '%',
ctr: (r.clicked / r.viewed * 100).toFixed(2) + '%'
}));
}
In the End, It Was a Logistics Center
Looking back, designing a notification system wasn't about sending messages. It was operating a massive logistics center.
- Channel selection: Which shipping method (truck/ship/plane)
- Message queue: The logistics center's conveyor belt
- Templates: Standardized packaging boxes
- Batching: Combining items going the same direction into one truck
- Priority: Standard shipping vs same-day delivery
- Rate limiting: Don't send more than 10 packages per day to one house
- Retry: Redeliver if recipient is absent
- Metrics: Delivery completion rate, receipt confirmation
When a single-line notification became millions, every layer of design mattered. No queue means crashes, no templates means management hell, no priority means critical notifications get buried, no data means no improvement.
The core was user experience. Technically perfect delivery means nothing if users don't want it or timing is wrong. Success isn't measured by "how many we sent" but "how many valuable notifications we sent at the right moment."
Now I know: sending one notification and designing a system are completely different dimensions of the problem.