1 billion monthly active users. 34 minutes of average daily watch time. A recommendation engine so good it can hook a brand-new user in under 30 minutes — before it knows anything about them.
Most system design guides skip TikTok entirely. Or they reduce it to "it's just a video app with a good algorithm." That's not good enough.
As a senior AI PM, you need to understand why TikTok's architecture is different from YouTube, what product decisions live inside the For You Page (FYP), and what trade-offs your engineering team is navigating every sprint.
TikTok is arguably the most studied recommendation system in consumer AI right now. It combines short-form video ingestion, cold-start ML (serving great content to brand-new users with zero history), a real-time feedback loop that re-ranks every swipe, and global distribution across markets with wildly different content libraries.
If you can explain TikTok's architecture clearly — including the AI components — you can design any short-form media platform.
This is Part 3 of my Core System Design & AI System Design series — built specifically for AI product managers and data professionals.
Part 1 covered YouTube. Part 2 covered Twitter/X. TikTok is where both of those converge — and where the AI gets significantly more interesting.
This is the breakdown that actually covers it all. 👇
📌 TL;DR
TikTok serves 1B+ MAU with an average 34 min/day session — driven almost entirely by the For You Page (FYP), not the Following feed
The FYP recommendation engine is the product. Every architectural decision exists to serve it faster, fresher, and more accurately
TikTok's cold-start problem is uniquely hard: unlike YouTube or Twitter, a brand-new user with zero history must get an addictive feed within the first 5–10 videos
Five critical subsystems: Video ingestion & transcoding, CDN & streaming, For You Page ML engine, Content moderation, and Creator & live streaming
The most important PM insight: TikTok optimizes for completion rate (did you watch the full video?) not just clicks — and that single objective function change explains the entire content ecosystem it created

TikTok End to End System Design Architecture
📊 The Numbers That Define Every Decision
Scale isn't a detail — it's the constraint that shapes every architectural choice.
Metric | Scale |
|---|---|
Monthly active users | 1+ billion |
Daily active users | 600+ million |
Average daily session time | 34 minutes |
Videos uploaded per day | 34+ million |
Videos in catalog | 3+ billion |
Video length | 15s to 10 minutes |
FYP requests per second | Millions |
Content moderation volume | 34M+ videos/day |
Markets served | 150+ countries |
Languages supported | 75+ |
These numbers immediately tell you what the system must do:
Unlike YouTube (subscribe-heavy), TikTok is discovery-first — 70%+ of watch time comes from the FYP, not followed accounts
Video must start playing in under 1 second — TikTok's UX is scroll-native; any buffering kills the experience
The recommendation engine must work with zero user history — cold start is a core product problem, not an edge case
34 million videos/day = content moderation at a scale that makes human review mathematically impossible
The system must support wildly different content libraries per region — a video popular in Indonesia may never be recommended in Germany
✅ Functional Requirements: What TikTok Must Do
Feature | Description | Priority |
|---|---|---|
For You Page (FYP) | Personalized infinite scroll feed, zero following required | P0 |
Video upload | 15s to 10min video, with effects, music, text overlays | P0 |
Video playback | Start in <1s, loop seamlessly, full-screen vertical | P0 |
Following feed | Chronological feed from followed creators | P0 |
Content moderation | Remove policy violations pre and post publish | P0 |
Search | Videos, sounds, hashtags, creators | P1 |
Live streaming | Real-time broadcast with virtual gifts | P1 |
Duet / Stitch | React to or remix another creator's video | P1 |
Sounds / music library | Licensed audio attached to videos | P1 |
Creator analytics | Views, watch time, audience demographics | P1 |
Comments & reactions | Per-video, with reply threading | P1 |
Ads (TopView, In-Feed) | Paid content in FYP stream | P1 |
Direct messages | 1:1 messaging | P2 |
💡 My Take: The P0 list here is deceptively short. But the FYP alone is more architecturally complex than most entire apps. A senior PM would say: "The FYP is the product. Upload and playback exist to serve the FYP. Content moderation exists to protect the FYP. I'll go deep on the recommendation engine because everything else is downstream of it."
⚙️ Non-Functional Requirements: Where Architecture Gets Designed
Requirement | Target | Justification |
|---|---|---|
Video start time | <1s (p99) | Scroll-native UX — buffering = user leaves |
FYP recommendation latency | <200ms | Every swipe triggers a re-rank; must feel instant |
Upload processing time | <60s to FYP-eligible | Creators expect near-instant distribution |
Cold-start feed quality | Engaging within 5–10 videos | Retention is won or lost in first session |
Content moderation | <24h for all uploaded content | Policy requirement; viral harmful content compounds fast |
Availability | 99.99% | Single outage = millions of sessions lost |
Storage durability | 11 nines | Videos are permanent creator assets |
CDN latency | <50ms to nearest edge | Global user base; buffering is unacceptable |
Consistency (view counts) | Eventual | Strong consistency at 600M DAU = global distributed lock |
Live stream latency | <3s | Virtual gift economy depends on real-time feel |
💡 My Take: The most underappreciated NFR on this list is cold-start feed quality. Every other platform can lean on your history — your subscriptions, your search queries, your likes. TikTok can't. A brand-new user has nothing. And yet TikTok turns new users into addicted daily users faster than any platform in history. That's not a UX trick. That's a cold-start ML system that is genuinely world-class. The PM who understands how it works can build anything.
🗂️ High-Level Architecture: The Five Major Subsystems
TikTok's architecture breaks into 5 independently scalable subsystems.
Client (iOS / Android)
↓
API Gateway / Global Load Balancer
↓
┌─────────────────────────────────────────────────────┐
│ 1. Video Ingestion & Transcoding Pipeline │
│ 2. CDN & Video Streaming Layer │
│ 3. For You Page ML Recommendation Engine │
│ 4. Content Moderation Pipeline │
│ 5. Creator Tools & Live Streaming │
└─────────────────────────────────────────────────────┘
↓
Storage Layer (Object Storage + Vector DB + Cache + Graph DB)1️⃣ Video Ingestion & Transcoding Pipeline
TikTok's upload pipeline has one constraint that YouTube's doesn't: speed to FYP eligibility. A creator who posts a video expects it to be distributed within minutes — not hours.

Video Ingestion and Transcoding Pipeline for TikTok system design
Key Design Decisions
📱 Client-side pre-processing Unlike YouTube where raw files are uploaded, TikTok's mobile SDK does significant pre-processing before upload: compression, format normalization, and basic quality checks. This reduces upload bandwidth, reduces server-side processing time, and gets the video to FYP eligibility faster. The trade-off: more battery usage on the creator's device.
🎵 Music fingerprinting at ingest Every video's audio is fingerprinted at upload time. This serves two purposes: copyright matching (like YouTube's Content ID) and sound discovery (linking the video to a sound trend in the FYP). The sound layer is a unique TikTok distribution mechanic — a trending sound can catapult an obscure creator's video into millions of FYPs.
🧠 Visual ML embedding at ingest This is TikTok's most important ingest step. At upload time, a computer vision model generates a high-dimensional embedding of the video's visual content. This embedding is stored in a vector database and used by the FYP recommendation engine to find visually similar content. This means TikTok can recommend a new video to the right audience even before it has any engagement data — purely based on content similarity.
⚡ Speed to FYP eligibility TikTok's target: a video should be FYP-eligible within 60 seconds of upload. This requires the moderation classifiers and basic embedding generation to run before the full transcoding pipeline completes. The system publishes a "provisional" FYP flag after fast-path checks, and upgrades to full distribution once all processing completes.
💡 My Take: The visual ML embedding at ingest is the architectural decision that makes TikTok's cold-start recommendation work. YouTube serves new videos to your existing subscribers first, then optimizes from their engagement. TikTok has no subscriber base for new creators — so it uses content embedding to find an audience from scratch. That's a fundamentally different product philosophy, and it requires a fundamentally different pipeline.
2️⃣ CDN & Video Streaming Layer
TikTok's streaming architecture has one non-negotiable: video must start playing in under 1 second. The scroll-native UX means any buffering breaks the loop.
How TikTok Pre-loads Video

The innermost layer — device local cache — is unique to mobile-first platforms. TikTok's app maintains a local video cache so that rewatching (looping) is served entirely from device storage, with zero network traffic. Loop plays are a major engagement signal, and making them instant is both a UX and data strategy decision.
Adaptive Bitrate for Mobile
Network condition | Video quality served |
|---|---|
5G / WiFi | 1080p |
4G strong | 720p |
4G weak | 480p |
3G / poor connection | 240p (audio priority) |
Offline | Cached videos only |
TikTok uses HLS/DASH adaptive streaming — same protocol as YouTube — but with much more aggressive pre-buffering because the video format (vertical, looping, short) makes pre-loading cheap.
💡 My Take: Predictive pre-fetching is a product decision disguised as an infrastructure decision. It costs real money — you're downloading content the user might never watch. But TikTok bet that the engagement lift from instant playback was worth the bandwidth cost. And they were right. This is the kind of trade-off that requires a PM who can think across UX, infrastructure cost, and ML simultaneously.
3️⃣ For You Page (FYP) ML Recommendation Engine 🤖
This is the product. Every other architectural decision in TikTok exists to serve this system faster, fresher, and more accurately.
The FYP is responsible for over 70% of all content consumed on TikTok. It works with zero follow history for new users. It updates in real-time as you swipe. And it is, by most accounts, the best short-form recommendation system ever built.
🔥 Hot take: The FYP isn't a feature. It's a paradigm shift. Every previous social platform was follow-graph-first: who you follow defines what you see. TikTok is interest-graph-first: what you watch defines everything, and follows are optional. That's not a design choice. That's a product philosophy encoded into an ML architecture.
The Three-Stage FYP Architecture

Stage 1 — Candidate Retrieval: The Two-Tower Model
Like YouTube, TikTok uses a two-tower neural network for candidate retrieval — and understanding how it works is what separates senior PM candidates from everyone else.
The two-tower model runs two completely separate neural networks in parallel:
🟣 User tower: Input: watch history, completed videos, liked content, search queries, device type, time of day, current session signals Output: a single 256-dimensional user embedding vector — a mathematical fingerprint of this user's content taste, computed fresh at every request
🟢 Video tower: Input: visual frame features (computer vision), audio features, transcript, captions, hashtags, engagement velocity, upload recency Output: a 256-dimensional video embedding vector — a mathematical fingerprint of this video's content, pre-computed offline at ingest time and cached in a vector database (Faiss)
Matching via ANN search: The system finds the ~500 videos whose embedding vectors are geometrically closest to the user's embedding. This uses Approximate Nearest Neighbor (ANN) search across 3B+ pre-cached video embeddings — returning results in under 50ms.
The critical efficiency: video embeddings are computed once at upload and never recomputed (unless the video's content changes). Only the user embedding is computed at request time. This means the most expensive computation happens offline — TikTok is essentially pre-indexing the entire content library for instant retrieval.
Why this matters for cold start: This is TikTok's architectural advantage over every competitor. For a brand-new user with zero history, the user tower generates a very weak initial embedding — but the video tower still works perfectly. Even one completed video is enough to update the user embedding and find similar content. The cold start problem is solved at the architecture level, not the product level.
Additional retrieval signals layered on top of two-tower results:
Trending / viral signals — videos with high completion velocity surface regardless of personalization
Geographic signals — region-specific content pools ensure cultural relevance
Social graph signals — followed creators as a weak secondary signal (TikTok is interest-graph-first, not follow-graph-first)
For new users, the system leans almost entirely on content similarity + trending signals until enough watch events accumulate to build a meaningful user embedding. The first 5–10 videos are a deliberate Bayesian experiment — each swipe narrows the interest profile.
Stage 2 — Ranking: What TikTok Actually Optimizes For
This is the most important product decision in TikTok's architecture.
The objective function: TikTok's ranking model primarily optimizes for video completion rate — what percentage of the video you watch before swiping.
Signal | Weight | Why |
|---|---|---|
Completion rate | Highest | Watched to end = genuine interest |
Replay / loop | Very high | Rewatched = strong positive signal |
Share | High | Shared = strong enough to show others |
Comment | Medium-high | Engaged enough to type |
Like | Medium | Easy to give; less signal than completion |
Follow from video | High | Converted to creator relationship |
"Not interested" | Very negative | Explicit negative signal |
Early swipe | Negative | Left before 50% = weak fit |
💡 My Take: This is the most consequential PM decision in TikTok's history — and it's rarely discussed. By optimizing for completion rate over likes or clicks, TikTok made the feed merit-based in a way no previous platform had. A video from a 0-follower account that gets watched to completion will be served to more people than a video from a 10M-follower account that gets skipped. That's a product philosophy: content quality beats creator fame. Every aspect of TikTok's creator economy flows from this single ML objective.
Stage 3 — Re-ranking: The Guardrails
Raw ranking output goes through a final re-ranking step:
Diversity rules: No two consecutive videos from the same creator; max N% of feed from same sound/hashtag
Policy filters: Deprioritize (not remove) borderline content; don't recommend certain categories to users under 18
Freshness injection: Periodically inject new/trending content to prevent the feed from becoming a filter bubble
Ad insertion: Every N organic videos, insert a paid placement (optimized by a separate ad ranking model)
The Cold Start Solution
This deserves its own section because it's TikTok's most impressive product engineering achievement.
The problem: A brand-new user has zero watch history. Collaborative filtering is useless. How do you build an engaging FYP in the first session?
TikTok's approach:
New user opens app
↓
Onboarding interest selection (optional, 3–5 categories)
↓
"Warm start" pool: curated diverse high-performers
(top videos across 20+ categories, globally trending)
↓
User watches Video 1 — record: completion? loop? skip?
↓
Bayesian update: "they watched a cooking video to completion"
→ Increase P(cooking interest)
↓
Video 2 is now biased toward cooking + adjacent categories
↓
After 10 videos: rough interest profile established
↓
After 50 videos: full personalization kicks inThe key insight: each video is an experiment. TikTok is running a real-time A/B test on every new user, using their watch behavior to narrow down their interest profile as fast as possible. By video 10, TikTok knows more about your content preferences than most friends do.
Real-Time vs. Batch Signals
Signal | Pipeline | Latency |
|---|---|---|
Video content embeddings | Batch at ingest | One-time at upload |
Long-term user interest profile | Batch offline | Daily update |
Recent watch history (last 50 videos) | Near real-time | Minutes |
Current session signals (last 5 swipes) | Real-time | Seconds |
Video engagement velocity (viral detection) | Real-time | Seconds |
A/B test assignment | Real-time | Milliseconds |
4️⃣ Content Moderation Pipeline
34 million videos per day. Human reviewers cannot scale to this. TikTok's moderation is AI-first with human escalation — and the stakes are higher than most platforms because TikTok's user base skews younger.

The Confidence Threshold — A PM Decision
Confidence | Action |
|---|---|
>95% violation | Block immediately, no appeal required |
80–95% violation | Remove + notify creator + appeal available |
60–80% | Restrict: no FYP, no search, followers only |
40–60% | Age-gate or reduce distribution |
<40% | Allow, flag for periodic re-review |
💡 My Take: The moderation threshold is one of the highest-stakes product decisions a PM can own. Set it too aggressive: creators get false-positived, trust collapses, content supply drops. Set it too lenient: harmful content goes viral, brand damage, regulatory risk. TikTok has faced both — the CSAM detection failures in 2019 and the over-aggressive dance video removals in 2020. Both were threshold failures. Both were product failures. Not engineering failures.
Minor Protection: A Separate Policy Layer
TikTok's under-18 user protection is a distinct moderation layer built on top of the base classifier:
Content involving alcohol, extreme dieting, or relationship advice is never recommended to users under 16
DMs are disabled by default for users under 16
Live streaming is restricted to users 16+ (18+ for virtual gifts)
These rules are enforced at the re-ranking stage — the recommendation engine generates the same candidates, but the policy layer filters before serving
5️⃣ Creator Tools & Live Streaming
Live streaming on TikTok is architecturally different from the main feed — and it's also a primary monetization vector (virtual gifts are a multi-billion dollar revenue stream).
Live Streaming Architecture

Live vs. VoD Architecture Differences
Dimension | Regular video (VoD) | Live stream |
|---|---|---|
Ingest | File upload | RTMP real-time |
Transcoding | Async, post-upload | Real-time, during stream |
CDN | Pull (cached) | Push (no cache possible) |
Latency target | <1s start time | <3s glass-to-glass |
FYP treatment | ML-ranked after upload | Surfaced via "Live" tab + social graph |
Monetization | Ad revenue share | Virtual gifts (primary) |
Storage | Permanent | Optional VOD save after stream |
The Sound / Music System
TikTok's music layer deserves special mention because it's both a legal and technical achievement:
TikTok has licensing deals with all major labels (Universal, Warner, Sony)
Every sound used in a video is fingerprinted at upload and linked to a sound entity in the graph
Sounds become distribution nodes: using a trending sound increases a video's probability of FYP placement
When a sound is unlicensed in a region, the video is muted in that region but remains accessible — a graceful degradation rather than a takedown
The "sounds" tab creates a secondary content graph alongside the creator graph — two parallel recommendation systems in one app
💾 Storage Layer: The Database Decisions
Data Type | Storage System | Justification |
|---|---|---|
Raw video files | Object storage (S3-compatible) | Immutable blobs; write-once, read-many |
Processed video segments | CDN + Object storage | Edge-distributed; high read throughput |
Video metadata | Distributed KV store | High read throughput; eventual consistency fine |
User profiles + auth | MySQL (sharded) | ACID required for account/billing operations |
FYP interaction history | Cassandra | Append-only; massive write volume; time-ordered |
Video content embeddings | Vector DB (Faiss/Milvus) | ANN search for content-based retrieval |
Social graph (follows) | Graph DB / distributed KV | Follow relationships; read-heavy |
Like / view counters | Redis → async flush | Counter aggregation; eventual consistency |
Comments | MySQL sharded | Threading requires ordering |
Live gifts / earnings | Transactional DB | Exact counts required for creator payments |
Search index | Elasticsearch | Full-text + hashtag search |
ML feature store | Distributed KV + column store | Fast serving reads + batch training |
Why TikTok Uses a Vector Database
This is the storage decision most PMs can't explain — and it's the most important one.
TikTok's content-based recommendation (especially for cold start) requires finding videos that are semantically similar to what a user just watched. That's not a keyword search problem. That's a nearest-neighbor search problem in high-dimensional embedding space.
User watches a video about Korean street food
↓
Video embedding: [0.23, 0.87, -0.12, ..., 0.45] (768 dimensions)
↓
Vector DB query: "find the 500 nearest vectors"
↓
Returns: Korean BBQ video, Japanese ramen video,
food tour Bangkok video, NYC food market video
↓
These become candidates for Stage 2 rankingA traditional SQL database cannot do this efficiently. A vector database (Faiss, Milvus, or similar) uses approximate nearest neighbor (ANN) indexing to search billions of embeddings in milliseconds.
💡 My Take: The vector database is the architectural decision that makes TikTok's cold start work. Without it, you can't do content-based recommendation at 3 billion video scale. This is also the infrastructure investment that makes TikTok defensible — building and maintaining a high-quality video embedding model + the vector index to search it is a multi-year engineering investment that most competitors can't replicate quickly. For PMs: when you're evaluating a new AI product investment, ask "does this create a data/infrastructure moat?" The vector index is TikTok's moat.
The answer that wins interviews:
"TikTok's hardest design problem isn't the recommendation engine — it's the cold start. Every other recommendation system can lean on history. TikTok can't. Their solution is to treat the first 10 videos as a real-time Bayesian experiment: each video is a probe, each swipe is a data point, and the model updates after every interaction. That requires content embeddings generated at ingest time, a vector database for similarity search, and a re-ranking loop that operates in seconds, not minutes. Those three components together are what make the FYP feel magical to a brand-new user."
TikTok Sytem Design Interview Questions
Q: What's the biggest architectural difference between TikTok and YouTube's recommendation systems?
Both use a two-tower neural network — a user tower generating a user embedding and a video tower generating a video embedding, matched via ANN search. The difference is in the primary candidate pool and objective function. YouTube is follow-graph-first: subscriptions seed the candidate pool, and the model optimizes for session watch time. TikTok is interest-graph-first: the two-tower model itself is the primary candidate pool (follows are a weak secondary signal), and the model optimizes for per-video completion rate. The practical consequence: TikTok's video tower must do far more work — every new video finds its own audience through content similarity alone, with no subscriber base to seed distribution. That's why video embeddings are computed at ingest and cached in a vector database — the entire cold-start solution lives inside the two-tower architecture.
Q: How does TikTok handle regional content differences?
TikTok runs region-specific content pools. A video popular in Indonesia is unlikely to appear in the German FYP unless it crosses a global virality threshold. The recommendation engine maintains separate interest embeddings per region, and trending signals are computed regionally. The CDN is also region-aware — content is stored at edge nodes closest to its primary audience.
Q: Why does TikTok optimize for completion rate instead of likes?
Likes are cheap to give and easy to manipulate. Completion rate is harder to fake — you either watched the video or you didn't. More importantly, completion rate correlates with genuine interest better than likes do. A 15-second video you watch three times tells the algorithm more than a 5-minute video you liked but left after 30 seconds. The objective function change created the "raw, unpolished, authentic" content aesthetic TikTok is known for — because authentic content gets watched, while polished content gets scrolled past.
Q: How does the "sounds" mechanic affect the recommendation system?
Sound is a second distribution graph running in parallel with the creator graph. Using a trending sound doesn't just get you music licensing — it connects your video to a sound entity that the FYP already knows drives completions. The algorithm surfaces videos with high-completion sounds more aggressively because the sound itself is a quality signal. For creators, this is the fastest path to FYP distribution: attach a trending sound, and you inherit its recommendation history.
Q: How does TikTok protect younger users architecturally?
Minor protection is implemented as a policy layer at Stage 3 re-ranking — not at the model level. The recommendation model generates the same candidates regardless of age, but the re-ranking step applies age-based filters before serving. This is a deliberate architectural choice: keeping policy separate from modeling means policy rules can be updated without retraining the model. It also means auditors can inspect the policy layer independently.
Q: How should an AI PM talk about TikTok's architecture in an interview without just listing components?
Start with the objective function: "TikTok optimizes for completion rate — that's a product decision that created an entirely different content ecosystem." Explain why cold start is unique and how content embeddings solve it. Name the three-stage FYP pipeline and explain why each stage exists (scale reduction, quality ranking, policy enforcement). Close with what you'd measure: not just retention, but creator distribution diversity and content ecosystem health. The engineers want to know you understand that the recommendation model is a product decision with consequences for the entire creator economy — not just a technical optimization problem.
💡 The Honest Take
TikTok's architecture is the most studied AI product system in the world right now — not because the individual components are novel, but because the product decisions inside the architecture are.
The decision to optimize for completion rate isn't an ML decision — it's a product decision that created an entirely different content ecosystem. The decision to use content embeddings for cold start isn't a database decision — it's a product decision that made TikTok accessible to creators with zero audience. The decision to treat sounds as a distribution graph isn't a licensing decision — it's a product decision that created a viral mechanic that no other platform has successfully replicated.
Every architectural component exists because a PM or leader made a call about what trade-off was acceptable.
Understanding the architecture without understanding those trade-offs is just memorizing boxes and arrows.
Your edge as a senior AI PM isn't that you can draw the FYP pipeline. It's that you can explain why it's structured that way, what the alternative was, and what content creator ecosystem it produced.
That's the difference between a PM who can talk about AI systems and one who thinks in them. 🚀
📬 Found this useful? AI PM Insider publishes every week for AI PMs and leaders building at the frontier. This is Part 3 of the Core System Design & AI System Design series. Join at aiskillshub.io
Written by Ashima Malik · LinkedIn
