This website uses cookies

Read our Privacy policy and Terms of use for more information.

1 billion monthly active users. 34 minutes of average daily watch time. A recommendation engine so good it can hook a brand-new user in under 30 minutes — before it knows anything about them.

Most system design guides skip TikTok entirely. Or they reduce it to "it's just a video app with a good algorithm." That's not good enough.

As a senior AI PM, you need to understand why TikTok's architecture is different from YouTube, what product decisions live inside the For You Page (FYP), and what trade-offs your engineering team is navigating every sprint.

TikTok is arguably the most studied recommendation system in consumer AI right now. It combines short-form video ingestion, cold-start ML (serving great content to brand-new users with zero history), a real-time feedback loop that re-ranks every swipe, and global distribution across markets with wildly different content libraries.

If you can explain TikTok's architecture clearly — including the AI components — you can design any short-form media platform.

This is Part 3 of my Core System Design & AI System Design series — built specifically for AI product managers and data professionals.

Part 1 covered YouTube. Part 2 covered Twitter/X. TikTok is where both of those converge — and where the AI gets significantly more interesting.

This is the breakdown that actually covers it all. 👇

📌 TL;DR

  • TikTok serves 1B+ MAU with an average 34 min/day session — driven almost entirely by the For You Page (FYP), not the Following feed

  • The FYP recommendation engine is the product. Every architectural decision exists to serve it faster, fresher, and more accurately

  • TikTok's cold-start problem is uniquely hard: unlike YouTube or Twitter, a brand-new user with zero history must get an addictive feed within the first 5–10 videos

  • Five critical subsystems: Video ingestion & transcoding, CDN & streaming, For You Page ML engine, Content moderation, and Creator & live streaming

  • The most important PM insight: TikTok optimizes for completion rate (did you watch the full video?) not just clicks — and that single objective function change explains the entire content ecosystem it created

Ashima Malik

TikTok End to End System Design Architecture

📊 The Numbers That Define Every Decision

Scale isn't a detail — it's the constraint that shapes every architectural choice.

Metric

Scale

Monthly active users

1+ billion

Daily active users

600+ million

Average daily session time

34 minutes

Videos uploaded per day

34+ million

Videos in catalog

3+ billion

Video length

15s to 10 minutes

FYP requests per second

Millions

Content moderation volume

34M+ videos/day

Markets served

150+ countries

Languages supported

75+

These numbers immediately tell you what the system must do:

  • Unlike YouTube (subscribe-heavy), TikTok is discovery-first — 70%+ of watch time comes from the FYP, not followed accounts

  • Video must start playing in under 1 second — TikTok's UX is scroll-native; any buffering kills the experience

  • The recommendation engine must work with zero user history — cold start is a core product problem, not an edge case

  • 34 million videos/day = content moderation at a scale that makes human review mathematically impossible

  • The system must support wildly different content libraries per region — a video popular in Indonesia may never be recommended in Germany

Functional Requirements: What TikTok Must Do

Feature

Description

Priority

For You Page (FYP)

Personalized infinite scroll feed, zero following required

P0

Video upload

15s to 10min video, with effects, music, text overlays

P0

Video playback

Start in <1s, loop seamlessly, full-screen vertical

P0

Following feed

Chronological feed from followed creators

P0

Content moderation

Remove policy violations pre and post publish

P0

Search

Videos, sounds, hashtags, creators

P1

Live streaming

Real-time broadcast with virtual gifts

P1

Duet / Stitch

React to or remix another creator's video

P1

Sounds / music library

Licensed audio attached to videos

P1

Creator analytics

Views, watch time, audience demographics

P1

Comments & reactions

Per-video, with reply threading

P1

Ads (TopView, In-Feed)

Paid content in FYP stream

P1

Direct messages

1:1 messaging

P2

💡 My Take: The P0 list here is deceptively short. But the FYP alone is more architecturally complex than most entire apps. A senior PM would say: "The FYP is the product. Upload and playback exist to serve the FYP. Content moderation exists to protect the FYP. I'll go deep on the recommendation engine because everything else is downstream of it."

Ashima Malik

⚙️ Non-Functional Requirements: Where Architecture Gets Designed

Requirement

Target

Justification

Video start time

<1s (p99)

Scroll-native UX — buffering = user leaves

FYP recommendation latency

<200ms

Every swipe triggers a re-rank; must feel instant

Upload processing time

<60s to FYP-eligible

Creators expect near-instant distribution

Cold-start feed quality

Engaging within 5–10 videos

Retention is won or lost in first session

Content moderation

<24h for all uploaded content

Policy requirement; viral harmful content compounds fast

Availability

99.99%

Single outage = millions of sessions lost

Storage durability

11 nines

Videos are permanent creator assets

CDN latency

<50ms to nearest edge

Global user base; buffering is unacceptable

Consistency (view counts)

Eventual

Strong consistency at 600M DAU = global distributed lock

Live stream latency

<3s

Virtual gift economy depends on real-time feel

💡 My Take: The most underappreciated NFR on this list is cold-start feed quality. Every other platform can lean on your history — your subscriptions, your search queries, your likes. TikTok can't. A brand-new user has nothing. And yet TikTok turns new users into addicted daily users faster than any platform in history. That's not a UX trick. That's a cold-start ML system that is genuinely world-class. The PM who understands how it works can build anything.

Ashima Malik

🗂️ High-Level Architecture: The Five Major Subsystems

TikTok's architecture breaks into 5 independently scalable subsystems.

Client (iOS / Android)
         ↓
API Gateway / Global Load Balancer
         ↓
┌─────────────────────────────────────────────────────┐
│  1. Video Ingestion & Transcoding Pipeline           │
│  2. CDN & Video Streaming Layer                     │
│  3. For You Page ML Recommendation Engine           │
│  4. Content Moderation Pipeline                     │
│  5. Creator Tools & Live Streaming                  │
└─────────────────────────────────────────────────────┘
         ↓
Storage Layer (Object Storage + Vector DB + Cache + Graph DB)

1️⃣ Video Ingestion & Transcoding Pipeline

TikTok's upload pipeline has one constraint that YouTube's doesn't: speed to FYP eligibility. A creator who posts a video expects it to be distributed within minutes — not hours.

Video Ingestion and Transcoding Pipeline for TikTok system design

Key Design Decisions

📱 Client-side pre-processing Unlike YouTube where raw files are uploaded, TikTok's mobile SDK does significant pre-processing before upload: compression, format normalization, and basic quality checks. This reduces upload bandwidth, reduces server-side processing time, and gets the video to FYP eligibility faster. The trade-off: more battery usage on the creator's device.

🎵 Music fingerprinting at ingest Every video's audio is fingerprinted at upload time. This serves two purposes: copyright matching (like YouTube's Content ID) and sound discovery (linking the video to a sound trend in the FYP). The sound layer is a unique TikTok distribution mechanic — a trending sound can catapult an obscure creator's video into millions of FYPs.

🧠 Visual ML embedding at ingest This is TikTok's most important ingest step. At upload time, a computer vision model generates a high-dimensional embedding of the video's visual content. This embedding is stored in a vector database and used by the FYP recommendation engine to find visually similar content. This means TikTok can recommend a new video to the right audience even before it has any engagement data — purely based on content similarity.

⚡ Speed to FYP eligibility TikTok's target: a video should be FYP-eligible within 60 seconds of upload. This requires the moderation classifiers and basic embedding generation to run before the full transcoding pipeline completes. The system publishes a "provisional" FYP flag after fast-path checks, and upgrades to full distribution once all processing completes.

💡 My Take: The visual ML embedding at ingest is the architectural decision that makes TikTok's cold-start recommendation work. YouTube serves new videos to your existing subscribers first, then optimizes from their engagement. TikTok has no subscriber base for new creators — so it uses content embedding to find an audience from scratch. That's a fundamentally different product philosophy, and it requires a fundamentally different pipeline.

Ashima Malik

2️⃣ CDN & Video Streaming Layer

TikTok's streaming architecture has one non-negotiable: video must start playing in under 1 second. The scroll-native UX means any buffering breaks the loop.

How TikTok Pre-loads Video

The innermost layer — device local cache — is unique to mobile-first platforms. TikTok's app maintains a local video cache so that rewatching (looping) is served entirely from device storage, with zero network traffic. Loop plays are a major engagement signal, and making them instant is both a UX and data strategy decision.

Adaptive Bitrate for Mobile

Network condition

Video quality served

5G / WiFi

1080p

4G strong

720p

4G weak

480p

3G / poor connection

240p (audio priority)

Offline

Cached videos only

TikTok uses HLS/DASH adaptive streaming — same protocol as YouTube — but with much more aggressive pre-buffering because the video format (vertical, looping, short) makes pre-loading cheap.

💡 My Take: Predictive pre-fetching is a product decision disguised as an infrastructure decision. It costs real money — you're downloading content the user might never watch. But TikTok bet that the engagement lift from instant playback was worth the bandwidth cost. And they were right. This is the kind of trade-off that requires a PM who can think across UX, infrastructure cost, and ML simultaneously.

Ashima Malik

3️⃣ For You Page (FYP) ML Recommendation Engine 🤖

This is the product. Every other architectural decision in TikTok exists to serve this system faster, fresher, and more accurately.

The FYP is responsible for over 70% of all content consumed on TikTok. It works with zero follow history for new users. It updates in real-time as you swipe. And it is, by most accounts, the best short-form recommendation system ever built.

🔥 Hot take: The FYP isn't a feature. It's a paradigm shift. Every previous social platform was follow-graph-first: who you follow defines what you see. TikTok is interest-graph-first: what you watch defines everything, and follows are optional. That's not a design choice. That's a product philosophy encoded into an ML architecture.

Ashima Malik

The Three-Stage FYP Architecture

Stage 1 — Candidate Retrieval: The Two-Tower Model

Like YouTube, TikTok uses a two-tower neural network for candidate retrieval — and understanding how it works is what separates senior PM candidates from everyone else.

The two-tower model runs two completely separate neural networks in parallel:

🟣 User tower: Input: watch history, completed videos, liked content, search queries, device type, time of day, current session signals Output: a single 256-dimensional user embedding vector — a mathematical fingerprint of this user's content taste, computed fresh at every request

🟢 Video tower: Input: visual frame features (computer vision), audio features, transcript, captions, hashtags, engagement velocity, upload recency Output: a 256-dimensional video embedding vector — a mathematical fingerprint of this video's content, pre-computed offline at ingest time and cached in a vector database (Faiss)

Matching via ANN search: The system finds the ~500 videos whose embedding vectors are geometrically closest to the user's embedding. This uses Approximate Nearest Neighbor (ANN) search across 3B+ pre-cached video embeddings — returning results in under 50ms.

The critical efficiency: video embeddings are computed once at upload and never recomputed (unless the video's content changes). Only the user embedding is computed at request time. This means the most expensive computation happens offline — TikTok is essentially pre-indexing the entire content library for instant retrieval.

Why this matters for cold start: This is TikTok's architectural advantage over every competitor. For a brand-new user with zero history, the user tower generates a very weak initial embedding — but the video tower still works perfectly. Even one completed video is enough to update the user embedding and find similar content. The cold start problem is solved at the architecture level, not the product level.

Additional retrieval signals layered on top of two-tower results:

  • Trending / viral signals — videos with high completion velocity surface regardless of personalization

  • Geographic signals — region-specific content pools ensure cultural relevance

  • Social graph signals — followed creators as a weak secondary signal (TikTok is interest-graph-first, not follow-graph-first)

For new users, the system leans almost entirely on content similarity + trending signals until enough watch events accumulate to build a meaningful user embedding. The first 5–10 videos are a deliberate Bayesian experiment — each swipe narrows the interest profile.

Stage 2 — Ranking: What TikTok Actually Optimizes For

This is the most important product decision in TikTok's architecture.

The objective function: TikTok's ranking model primarily optimizes for video completion rate — what percentage of the video you watch before swiping.

Signal

Weight

Why

Completion rate

Highest

Watched to end = genuine interest

Replay / loop

Very high

Rewatched = strong positive signal

Share

High

Shared = strong enough to show others

Comment

Medium-high

Engaged enough to type

Like

Medium

Easy to give; less signal than completion

Follow from video

High

Converted to creator relationship

"Not interested"

Very negative

Explicit negative signal

Early swipe

Negative

Left before 50% = weak fit

💡 My Take: This is the most consequential PM decision in TikTok's history — and it's rarely discussed. By optimizing for completion rate over likes or clicks, TikTok made the feed merit-based in a way no previous platform had. A video from a 0-follower account that gets watched to completion will be served to more people than a video from a 10M-follower account that gets skipped. That's a product philosophy: content quality beats creator fame. Every aspect of TikTok's creator economy flows from this single ML objective.

Stage 3 — Re-ranking: The Guardrails

Raw ranking output goes through a final re-ranking step:

  • Diversity rules: No two consecutive videos from the same creator; max N% of feed from same sound/hashtag

  • Policy filters: Deprioritize (not remove) borderline content; don't recommend certain categories to users under 18

  • Freshness injection: Periodically inject new/trending content to prevent the feed from becoming a filter bubble

  • Ad insertion: Every N organic videos, insert a paid placement (optimized by a separate ad ranking model)

The Cold Start Solution

This deserves its own section because it's TikTok's most impressive product engineering achievement.

The problem: A brand-new user has zero watch history. Collaborative filtering is useless. How do you build an engaging FYP in the first session?

TikTok's approach:

New user opens app
  ↓
Onboarding interest selection (optional, 3–5 categories)
  ↓
"Warm start" pool: curated diverse high-performers
  (top videos across 20+ categories, globally trending)
  ↓
User watches Video 1 — record: completion? loop? skip?
  ↓
Bayesian update: "they watched a cooking video to completion"
  →  Increase P(cooking interest)
  ↓
Video 2 is now biased toward cooking + adjacent categories
  ↓
After 10 videos: rough interest profile established
  ↓
After 50 videos: full personalization kicks in

The key insight: each video is an experiment. TikTok is running a real-time A/B test on every new user, using their watch behavior to narrow down their interest profile as fast as possible. By video 10, TikTok knows more about your content preferences than most friends do.

Real-Time vs. Batch Signals

Signal

Pipeline

Latency

Video content embeddings

Batch at ingest

One-time at upload

Long-term user interest profile

Batch offline

Daily update

Recent watch history (last 50 videos)

Near real-time

Minutes

Current session signals (last 5 swipes)

Real-time

Seconds

Video engagement velocity (viral detection)

Real-time

Seconds

A/B test assignment

Real-time

Milliseconds

4️⃣ Content Moderation Pipeline

34 million videos per day. Human reviewers cannot scale to this. TikTok's moderation is AI-first with human escalation — and the stakes are higher than most platforms because TikTok's user base skews younger.

The Confidence Threshold — A PM Decision

Confidence

Action

>95% violation

Block immediately, no appeal required

80–95% violation

Remove + notify creator + appeal available

60–80%

Restrict: no FYP, no search, followers only

40–60%

Age-gate or reduce distribution

<40%

Allow, flag for periodic re-review

💡 My Take: The moderation threshold is one of the highest-stakes product decisions a PM can own. Set it too aggressive: creators get false-positived, trust collapses, content supply drops. Set it too lenient: harmful content goes viral, brand damage, regulatory risk. TikTok has faced both — the CSAM detection failures in 2019 and the over-aggressive dance video removals in 2020. Both were threshold failures. Both were product failures. Not engineering failures.

Ashima Malik

Minor Protection: A Separate Policy Layer

TikTok's under-18 user protection is a distinct moderation layer built on top of the base classifier:

  • Content involving alcohol, extreme dieting, or relationship advice is never recommended to users under 16

  • DMs are disabled by default for users under 16

  • Live streaming is restricted to users 16+ (18+ for virtual gifts)

  • These rules are enforced at the re-ranking stage — the recommendation engine generates the same candidates, but the policy layer filters before serving

5️⃣ Creator Tools & Live Streaming

Live streaming on TikTok is architecturally different from the main feed — and it's also a primary monetization vector (virtual gifts are a multi-billion dollar revenue stream).

Live Streaming Architecture

Live vs. VoD Architecture Differences

Dimension

Regular video (VoD)

Live stream

Ingest

File upload

RTMP real-time

Transcoding

Async, post-upload

Real-time, during stream

CDN

Pull (cached)

Push (no cache possible)

Latency target

<1s start time

<3s glass-to-glass

FYP treatment

ML-ranked after upload

Surfaced via "Live" tab + social graph

Monetization

Ad revenue share

Virtual gifts (primary)

Storage

Permanent

Optional VOD save after stream

The Sound / Music System

TikTok's music layer deserves special mention because it's both a legal and technical achievement:

  • TikTok has licensing deals with all major labels (Universal, Warner, Sony)

  • Every sound used in a video is fingerprinted at upload and linked to a sound entity in the graph

  • Sounds become distribution nodes: using a trending sound increases a video's probability of FYP placement

  • When a sound is unlicensed in a region, the video is muted in that region but remains accessible — a graceful degradation rather than a takedown

  • The "sounds" tab creates a secondary content graph alongside the creator graph — two parallel recommendation systems in one app

💾 Storage Layer: The Database Decisions

Data Type

Storage System

Justification

Raw video files

Object storage (S3-compatible)

Immutable blobs; write-once, read-many

Processed video segments

CDN + Object storage

Edge-distributed; high read throughput

Video metadata

Distributed KV store

High read throughput; eventual consistency fine

User profiles + auth

MySQL (sharded)

ACID required for account/billing operations

FYP interaction history

Cassandra

Append-only; massive write volume; time-ordered

Video content embeddings

Vector DB (Faiss/Milvus)

ANN search for content-based retrieval

Social graph (follows)

Graph DB / distributed KV

Follow relationships; read-heavy

Like / view counters

Redis → async flush

Counter aggregation; eventual consistency

Comments

MySQL sharded

Threading requires ordering

Live gifts / earnings

Transactional DB

Exact counts required for creator payments

Search index

Elasticsearch

Full-text + hashtag search

ML feature store

Distributed KV + column store

Fast serving reads + batch training

Why TikTok Uses a Vector Database

This is the storage decision most PMs can't explain — and it's the most important one.

TikTok's content-based recommendation (especially for cold start) requires finding videos that are semantically similar to what a user just watched. That's not a keyword search problem. That's a nearest-neighbor search problem in high-dimensional embedding space.

User watches a video about Korean street food
  ↓
Video embedding: [0.23, 0.87, -0.12, ..., 0.45]  (768 dimensions)
  ↓
Vector DB query: "find the 500 nearest vectors"
  ↓
Returns: Korean BBQ video, Japanese ramen video,
         food tour Bangkok video, NYC food market video
  ↓
These become candidates for Stage 2 ranking

A traditional SQL database cannot do this efficiently. A vector database (Faiss, Milvus, or similar) uses approximate nearest neighbor (ANN) indexing to search billions of embeddings in milliseconds.

💡 My Take: The vector database is the architectural decision that makes TikTok's cold start work. Without it, you can't do content-based recommendation at 3 billion video scale. This is also the infrastructure investment that makes TikTok defensible — building and maintaining a high-quality video embedding model + the vector index to search it is a multi-year engineering investment that most competitors can't replicate quickly. For PMs: when you're evaluating a new AI product investment, ask "does this create a data/infrastructure moat?" The vector index is TikTok's moat.

Ashima Malik

The answer that wins interviews:

"TikTok's hardest design problem isn't the recommendation engine — it's the cold start. Every other recommendation system can lean on history. TikTok can't. Their solution is to treat the first 10 videos as a real-time Bayesian experiment: each video is a probe, each swipe is a data point, and the model updates after every interaction. That requires content embeddings generated at ingest time, a vector database for similarity search, and a re-ranking loop that operates in seconds, not minutes. Those three components together are what make the FYP feel magical to a brand-new user."

Ashima Malik

TikTok Sytem Design Interview Questions

Q: What's the biggest architectural difference between TikTok and YouTube's recommendation systems?

Both use a two-tower neural network — a user tower generating a user embedding and a video tower generating a video embedding, matched via ANN search. The difference is in the primary candidate pool and objective function. YouTube is follow-graph-first: subscriptions seed the candidate pool, and the model optimizes for session watch time. TikTok is interest-graph-first: the two-tower model itself is the primary candidate pool (follows are a weak secondary signal), and the model optimizes for per-video completion rate. The practical consequence: TikTok's video tower must do far more work — every new video finds its own audience through content similarity alone, with no subscriber base to seed distribution. That's why video embeddings are computed at ingest and cached in a vector database — the entire cold-start solution lives inside the two-tower architecture.

Q: How does TikTok handle regional content differences?

TikTok runs region-specific content pools. A video popular in Indonesia is unlikely to appear in the German FYP unless it crosses a global virality threshold. The recommendation engine maintains separate interest embeddings per region, and trending signals are computed regionally. The CDN is also region-aware — content is stored at edge nodes closest to its primary audience.

Q: Why does TikTok optimize for completion rate instead of likes?

Likes are cheap to give and easy to manipulate. Completion rate is harder to fake — you either watched the video or you didn't. More importantly, completion rate correlates with genuine interest better than likes do. A 15-second video you watch three times tells the algorithm more than a 5-minute video you liked but left after 30 seconds. The objective function change created the "raw, unpolished, authentic" content aesthetic TikTok is known for — because authentic content gets watched, while polished content gets scrolled past.

Q: How does the "sounds" mechanic affect the recommendation system?

Sound is a second distribution graph running in parallel with the creator graph. Using a trending sound doesn't just get you music licensing — it connects your video to a sound entity that the FYP already knows drives completions. The algorithm surfaces videos with high-completion sounds more aggressively because the sound itself is a quality signal. For creators, this is the fastest path to FYP distribution: attach a trending sound, and you inherit its recommendation history.

Q: How does TikTok protect younger users architecturally?

Minor protection is implemented as a policy layer at Stage 3 re-ranking — not at the model level. The recommendation model generates the same candidates regardless of age, but the re-ranking step applies age-based filters before serving. This is a deliberate architectural choice: keeping policy separate from modeling means policy rules can be updated without retraining the model. It also means auditors can inspect the policy layer independently.

Q: How should an AI PM talk about TikTok's architecture in an interview without just listing components?

Start with the objective function: "TikTok optimizes for completion rate — that's a product decision that created an entirely different content ecosystem." Explain why cold start is unique and how content embeddings solve it. Name the three-stage FYP pipeline and explain why each stage exists (scale reduction, quality ranking, policy enforcement). Close with what you'd measure: not just retention, but creator distribution diversity and content ecosystem health. The engineers want to know you understand that the recommendation model is a product decision with consequences for the entire creator economy — not just a technical optimization problem.

💡 The Honest Take

TikTok's architecture is the most studied AI product system in the world right now — not because the individual components are novel, but because the product decisions inside the architecture are.

The decision to optimize for completion rate isn't an ML decision — it's a product decision that created an entirely different content ecosystem. The decision to use content embeddings for cold start isn't a database decision — it's a product decision that made TikTok accessible to creators with zero audience. The decision to treat sounds as a distribution graph isn't a licensing decision — it's a product decision that created a viral mechanic that no other platform has successfully replicated.

Every architectural component exists because a PM or leader made a call about what trade-off was acceptable.

Understanding the architecture without understanding those trade-offs is just memorizing boxes and arrows.

Your edge as a senior AI PM isn't that you can draw the FYP pipeline. It's that you can explain why it's structured that way, what the alternative was, and what content creator ecosystem it produced.

That's the difference between a PM who can talk about AI systems and one who thinks in them. 🚀

📬 Found this useful? AI PM Insider publishes every week for AI PMs and leaders building at the frontier. This is Part 3 of the Core System Design & AI System Design series. Join at aiskillshub.io

Written by Ashima Malik · LinkedIn

Reply

Avatar

or to participate

Keep Reading