AI Agents- Interview Preparation Part 2

In my previous article — AI Model Evaluation & Robustness: 30+ Must-Know Metrics for Interviews — I covered all the essential evaluation metrics every AI engineer should be confident about.
👉 (You can read it here: https://www.aiskillshub.io/p/ai-model-evaluation-robustness-interview-preparation)

In this follow-up article, I’m focusing on the core system-design and architecture questions you’re likely to encounter in interviews for AI Agent, LLM Engineer, and Senior AI Engineer roles.

These are 30 carefully curated, high-impact questions that not only prepare you for interviews but also help you develop a deeper understanding of how modern AI Agent systems are designed, orchestrated, and deployed in production.

To make this easier to consume, I’ve organized the questions in a clean, readable table format.
After the table, you’ll find detailed explanations of key terminologies and concepts—especially those that may be unfamiliar, such as ZeRO, FSDP, DeepSpeed optimizations, vector databases, observability tools, and more.

I recommend reading the questions first to build intuition, and then reviewing the terminology section to strengthen your conceptual foundation.

Question	Explanation
Q1: Design a distributed system for training a massive foundational model (e.g., a 7B parameter LLM).	Goal: fit model & optimizer state in GPU memory, maximize throughput, minimize communication overhead. Key concepts & components: • Model Parallelism (Sharding): split model parameters across GPUs so each GPU stores only part of weights (layer-wise or tensor slicing). Reduces per-GPU memory need. Implemented via PyTorch FSDP or DeepSpeed ZeRO where optimizer state and gradients are sharded. • Data Parallelism: replicate the (sharded) model across multiple nodes; each replica processes different mini-batches and performs local backward; gradients are synchronized (all-reduce). • Hybrid Approach: combine model + data parallelism to scale both memory and throughput for very large models. • Communication fabric: high bandwidth, low latency interconnects (NVLink inside node, InfiniBand across nodes) and NCCL for efficient all-reduce. • ZeRO optimizer stages: ZeRO-1/2/3 progressively move optimizer states, gradients, and parameters off GPUs/replicas to reduce memory. • Checkpointing / Activation Checkpointing: trade compute for memory by recomputing activations during backward pass. • Storage & IO: parallel filesystems or object stores for checkpoints; staged prefetching to avoid IO stalls. • Autoscaling & orchestration: schedule jobs with K8s or cluster schedulers; preemptible spot instances can lower cost but require checkpointing. Why this matters: large models require carefully balancing memory, compute, and communication; design choices directly affect cost, convergence speed, and fault tolerance.
Q2: What are the trade-offs between monolithic and microservice architectures for an AI Agent platform?	Definitions: • Monolithic: single deployable process contains all functionality. • Microservices: system split into small, independently deployable services (retriever, ranker, policy, tool executor, etc.). Trade-offs explained: • Deployment & Complexity: Monolith is simpler to deploy and debug (single process, single logs). Microservices add operational complexity (service discovery, network management). • Scaling: Monolith must scale whole app; microservices let you scale only the heavy components (e.g., LLM inference) saving cost. • Fault isolation: Microservices can fail independently (circuit breakers), reducing blast radius; monolith failures often impact all functionality. • Team autonomy & tech heterogeneity: Microservices allow teams to choose appropriate languages/infra per service. • Latency & overhead: Microservices introduce RPC/network latency, serialization overhead, and increased testing effort. • Observability & debugging: More moving parts requires distributed tracing (OpenTelemetry) and structured logs. When to pick: small/simple systems → monolith; large, multi-model, multi-team systems with independent scaling needs → microservices.
Q3: How would you architect a RAG (Retrieval-Augmented Generation) pipeline for low-latency queries?	Goal: answer user queries with grounded LLM output fast and accurately. Pipeline components & optimizations: 1. Document preprocessing & chunking: split documents into semantically coherent chunks (passages) with overlap to avoid boundary loss. 2. Embedding precomputation: compute vector embeddings offline for all chunks, store in vector DB. 3. Vector Database (ANN): use a fast ANN engine (Milvus, Qdrant, Faiss backend) tuned with HNSW/IVF + quantization for memory/latency tradeoffs. 4. Index tuning: adjust index parameters (M/ef) to trade recall vs latency. 5. Caching: cache hot query embeddings, retrieval results, and common prompt→response pairs. 6. Parallelization & pipelining: run retrieval and LLM pre-processing in parallel; start LLM token generation as soon as partial context is ready (token streaming). 7. Low-latency LLM serving: use vLLM/Triton/TensorRT-LLM with dynamic batching and KV-cache reuse for short-latency inference. 8. Small verifier/evaluator: a lightweight faithfulness checker can filter hallucinations before returning result. 9. Monitoring & fallbacks: latency SLOs, degrade to cached answers or simpler models when load is high. Terminology: ANN = Approximate Nearest Neighbors; KV-cache = reuse transformer key/value caches to reuse context across calls.
Q4: Design a component for "Tool Use" or "Function Calling" within an Agent's architecture.	Component: Execution Handler (Tool Executor) — responsibilities and design: • Tool Registry: catalog of allowed tools: name, input schema (Pydantic), output schema, permission scope, timeout, and cost. • LLM Parser / Schema Enforcement: the LLM must return structured intent (e.g., JSON function call). Parser validates against Pydantic schema to avoid injection. • Sandboxed Execution: run tool calls in isolated environment (restricted filesystem, network ACLs, resource limits). Use ephemeral containers or function sandboxes (Firecracker). • Security & Access Controls: authentication, rate limits, and capability scoping (which agents/users can call which tools). • Result Normalization: convert raw tool output into a format the LLM expects; include provenance (timestamps, source). • Retry & Compensation: exponential backoff and idempotency keys for non-idempotent calls. • Observability: structured logs of tool calls, latency, errors for auditing and debugging. Why: tool execution must be trustworthy, auditable, and robust because agents interact with external systems and could otherwise cause harm.
Q5: What storage technologies would you select for a system that needs to store and serve billions of real-time user-feature vector data points?	Requirements: low latency reads/writes, horizontal scalability, and efficient similarity search for embeddings. Best-fit architecture: • Real-time features: Redis or Aerospike as low-latency key-value stores for frequently updated counters and session features. • Embeddings & similarity search: Vector DB (Pinecone, Milvus, Qdrant) optimized for ANN queries. Choose based on scale, index types supported (HNSW, IVF+PQ), persistence, and durability. • Cold/historical storage: S3 (object store) + data warehouse (Snowflake, BigQuery) for large-scale analytics and model training. • Consistency & streaming: Use a stream (Kafka) as source of truth for feature updates and changefeeds to keep KV stores and vector DBs in sync. Considerations: cost of replication, read/write patterns, consistency SLAs, backup & restore for vector indices, and ability to reindex.
Q6: Describe the role of API Gateways in securing and managing access to AI microservices.	API Gateway responsibilities: • Authentication & Authorization: validate tokens (JWT, OAuth), enforce roles & scopes for secure access. • Rate Limiting/Quota Management: protect expensive model endpoints from abusive or accidental spikes. • Request/Response Transformation & Validation: normalize client requests, enforce schemas, and reject malformed inputs before reaching model. • Routing & Versioning: route to correct microservice or model version (blue/green/canary). • Observability & Logging: central place for request metrics and traces. • WAF capabilities: simple filtering (block suspicious payloads) before reaching model. Why important for AI: model endpoints are expensive; gateway prevents misuse, enforces policy, and centralizes cross-cutting concerns.
Q7: How do you design for resilience in a multi-model agent system?	Resilience patterns & practices: • Circuit Breakers: stop calling a failing downstream to avoid cascading failure; open/half-open/closed states. • Retries with exponential backoff & jitter: for transient network errors. • Graceful degradation / fallbacks: replace heavy LLM calls with cached answers, smaller models, or template responses on outage. • Redundancy & Multi-AZ deployment: deploy replicas across availability zones/regions. • Health checks & auto-healing: liveness/readiness probes → autoscaler or orchestrator restarts. • Bulkhead partitioning: isolate failure domains (e.g., separate queues for high-priority vs low-priority). • Observability & SLOs: define SLAs, monitor error budgets, and use alerting + runbooks. Terminology: Circuit breaker = pattern to stop calls to failing service; bulkhead = isolate resources per function.
Q8: When designing a feature pipeline, what is the trade-off between streaming and batch processing?	Definitions: • Streaming: continuous, event-driven computation (Kafka, Flink). • Batch: periodic bulk processing (Spark, Airflow). Trade-offs: • Freshness: streaming provides low latency (near real-time features) suitable for personalization; batch produces aggregated/stable features (daily/ hourly) but is stale. • Complexity & cost: streaming systems are operationally complex and compute-intensive; batch is simpler and can handle large aggregations efficiently. • Consistency & recomputation: batch pipelines are easier to backfill and reproduce; streaming requires careful watermarking and late event handling. Typical design: hybrid (Lambda) → stream for real-time, batch for heavy aggregates; unify with a Feature Store (Feast) to avoid skew.
Q9: Explain the concept of 'State Management' for a conversational AI Agent.	State Management: storage and retrieval of contextual data across turns for coherence and personalization. Key elements: • Session State: immediate conversation history (utterances, timestamps). • Memory types: short-term (last n turns), long-term (user profile, preferences), episodic (previous actions/results). • Storage: low-latency KV (Redis/DynamoDB) keyed by session ID or user ID. • Consistency: atomic append/overwrite operations to avoid lost updates when parallel requests occur. • Privacy & retention: policies for what memory persists and for how long (GDPR). Why: prevents agents from starting fresh each turn and enables continuity, personalization, and tool re-use.
Q10: What is the primary motivation for decoupling the Feature Store from the Model Registry?	Reasoning & benefits: • Separation of concerns: Feature Store manages data transformations and online/offline serving of features; Model Registry manages model artifacts and metadata (versioning, lineage). • Reusability & stability: many models share same features; decoupling allows teams to update models without touching feature pipelines. • Avoid tight coupling & deployment risk: changing a model should not force reworking the feature serving infra and vice versa. • Operational control: separate SLAs, access control, and scaling strategies. Result: faster iteration, safer rollouts, and clearer governance.
Q11: How do you handle non-determinism when testing and deploying LLM-powered agents?	Sources of non-determinism: sampling temperature, beam search randomness, retrieval nondeterminism, async tool execution. Mitigations & testing strategy: • Deterministic modes: set temperature ≈ 0 and deterministic decoding for critical flows. • Large test suites & statistical evaluation: use many examples and measure aggregate metrics (mean, variance, CI) instead of single example matching. • Seed control & reproducibility: log RNG seeds and deterministic components. • Shadow / canary evaluation: run new behavior in shadow and compare distributions (token usage, latency, correctness). • Accept probabilistic correctness: define tolerance thresholds and use human validation for edge cases.
Q12: Describe the architecture and role of a sandbox environment in an AI Agent system.	Purpose: secure, isolated environment to execute untrusted code or interact with external systems safely. Design elements: • Isolation: containerization (Docker) or microVMs (Firecracker) limit kernel/system access. • Resource limiting: CPU, memory, disk, and network constraints to prevent DoS. • Filesystem & network ACLs: only allow explicit endpoints. • Timeouts & monitoring: cap execution time and track system calls. • Audit & logging: capture all tool outputs and commands for post-hoc review. Use cases: user-provided code execution, LLM-generated API calls, or third-party plugin invocation.
Q13: What is Dynamic Batching in LLM serving, and why is it crucial for cost efficiency?	Definition: runtime technique to merge multiple concurrent inference requests into a single GPU batch to amortize kernel launch overhead and increase GPU utilization. Benefits: • Throughput increases: more tokens processed per GPU second. • Reduced per-request cost: lower effective cost per inference. • Smoothing: handles spikes by grouping short requests. Considerations: latency SLOs vs throughput—excessive batching can increase tail latency; use adaptive policies (max latency cap). Engines: vLLM, TensorRT-LLM, NVIDIA Triton support dynamic batching/KV reuse.
Q14: Detail the components of a Traceability and Observability Stack for a multi-step agent.	Core stack pieces: • Metrics (Prometheus / Grafana): latency, token counts, queue lengths, tool-call success rates. • Structured logs (ELK / Splunk): prompts, responses, tool outputs, user IDs (PII redaction). • Distributed Tracing (OpenTelemetry / Jaeger): traces and spans for each LLM call, retrieval, tool call — reconstruct end-to-end flows. • Auditing & lineage: persistent traces tied to model versions and dataset versions for compliance. • Alerting & SLOs: set error budgets, anomaly detection for drift/hallucination spikes. Why: multi-step agents require visibility into every intermediate step to debug incorrect reasoning or data leaks.
Q15: How would you use a Canary Release strategy when upgrading a core LLM model used by an agent?	Approach: 1. Deploy new model as Canary (small % of traffic). 2. Collect metrics: latency, error rate, hallucination rate, business KPIs (conversion, task success). 3. Compare statistically (A/B testing or canary analysis) — watch for regressions. 4. Gradually ramp if stable; rollback automatically on anomalies. Key mechanisms: traffic split via gateway, mirrored logging, automated rollback triggers, and human oversight for safety-critical tasks.
Q16: Explain the difference between Data Parallelism and Pipeline Parallelism in LLM training.	Data Parallelism: entire model replicated across workers; each worker processes different data batches and synchronizes gradients. Pros: simple and scales well for moderate model sizes. Pipeline Parallelism: split the model across devices (layers assigned to GPUs) and stream micro-batches through pipeline stages. Pros: reduces memory footprint per device for very deep models. Hybrid: combine both to scale huge models; need micro-batching, activation checkpointing, and careful scheduling to avoid bubbles.
Q17: Describe the purpose of a Feature Store and how it prevents Training-Serving Skew in production.	Feature Store: a canonical layer for computing, storing, and serving features consistently for training and inference. Prevents skew by: • Single code path: same transformation logic applied offline & online. • Timestamps/backfills: consistent historical features for training and correct event time attribution to avoid leakage. • Online serving: low-latency feature retrieval for inference that matches training data shape. Example skew causes: computing aggregates differently in training vs online, or using future data by mistake. Feature Stores enforce commonality and lineage.
Q18: What is the main design consideration when choosing between an open-source LLM (self-hosted) and a proprietary LLM (API-based)?	Trade-offs: • Control & privacy: self-hosted gives full data control (no external API exposure), critical for sensitive data. • Cost & ops overhead: self-hosted needs GPUs, ops team, infra cost; APIs trade higher per-token cost for no infra. • Latency & customization: self-hosted can be tuned, fine-tuned locally; APIs may offer best-in-class quality but limited customization/speed dependent on network. • Compliance & SLAs: API providers may provide compliance attestations; self-hosted requires you to build compliance. Decision drivers: data sensitivity, team MLOps maturity, expected scale, and cost model.
Q19: How do you implement Confidence-Aware Routing for an AI Agent?	Mechanics: 1. Compute confidence signals: token log-probabilities, ensemble consensus, self-consistency votes, retrieval-overlap scores, or tool success indicators. 2. Define thresholds: high confidence → single cheaper LLM; medium → larger LLM or RAG; low → human review or restricted actions. 3. Implement routing layer that uses these signals to select model/handler. 4. Continuously calibrate thresholds using A/B experiments. Benefit: reduces cost and risk by only escalating uncertain cases to expensive or human workflows.
Q20: What is the role of Tokenization in the end-to-end architecture, and how does it impact latency?	Tokenization: converts raw text into sequence of discrete tokens the LLM processes (subwords via BPE, WordPiece, SentencePiece). Impacts: • Token count influences compute: more tokens = more transformer ops = higher latency and cost. • Granularity tradeoff: aggressive subword splitting reduces OOV errors but can inflate token count; good tokenizers keep token length compact. • Preprocessing time: tokenization itself must be low overhead; if done server-side repeatedly, it adds latency. Optimization: batch tokenization, reuse embeddings for cached inputs, and use efficient C++ tokenizers in production.
Q21: How would you design a multi-agent workflow orchestrator?	Components & behavior: • Task Graph / DAG engine (Ray, Airflow, Prefect): represent agents/tasks as nodes with dependencies and data contracts. • Coordinator / Controller: schedules agents, enforces timeouts, manages retries, aggregates outputs. • Message bus / event streaming: Kafka/Redis Streams for async inter-agent comms. • Shared storage & memory: feature store or vector DB for shared state. • Observability & tracing: trace interagent messages and decisions. Key policies: prioritization, resource allocation, isolation per agent to avoid resource contention, and idempotent processing.
Q22: How do you choose between synchronous vs asynchronous agent execution?	Synchronous: caller blocks until response (simple flow), suitable for low-latency single-step tasks. Asynchronous: caller receives token/result later (callbacks, webhooks) — useful when operations are long (tool calls, external API). Decision factors: SLOs (latency vs throughput), complexity of coordination (fan-out/fan-in), and user experience (immediate UX vs waiting). Use async for heavy tasks and orchestrate with state machines.
Q23: How do you prevent hallucinations in production LLM agents?	Strategies: • Grounding via RAG: force model to cite retrieved docs and prefer retrieval evidence. • Strict function calls: require structured outputs (schemas) for critical operations; validate results. • Confidence thresholds & verifiers: use a small verifier model or LLM judge to cross-check claims. • Post-processing with checks: rule-based or fact-check APIs for named entities & dates. • Prompt engineering: chain-of-thought moderation and constraint prompts. Monitoring: track hallucination metrics (human labels, automated checks) and create alerts.
Q24: How would you scale a multimodal agent (vision + text)?	Decomposition: separate encoders/decoders as microservices: image encoder (CNN/ViT) → produces embeddings, text LLM consumes embeddings via cross-attention or adapter layers. Scaling: keep vision encoder in GPU pool for batched inference; cache common image embeddings; use model distillation to create cheaper student models for high-volume inference. Latency patterns: pipeline image preprocessing and embedding generation before text generation; use async flows and reuse cached multimodal context when possible.
Q25: How to design a high-throughput inference cluster?	Key components: • Model servers optimized for GPU inference (vLLM, Triton) with support for dynamic batching and KV cache. • Load balancing with request routing and model sharding. • Autoscaling policies based on queue length and GPU utilization. • Edge caching & CDN for static responses. • Telemetry for token usage, per-request cost, and tail latency. Optimization techniques: mixed precision (FP16/BF16), quantization, kernel fusion, and batching windows to maximize utilization without violating latency SLOs.
Q26: What is an Agent Memory Store and how is it designed?	Purpose: persist agent memories to enable continuity & personalization. Design: • Short-term memory: recent turns kept in session store (Redis) for immediate context. • Long-term memory: compacted summaries or embeddings stored in vector DB or DB (Weaviate/Chroma). • Memory management: eviction policies, summarization to compress long histories, and privacy controls. Access patterns: fast read on turn start, append writes after each turn, and background summarization.
Q27: How do you design an Evaluation Harness for AI agents?	Components: • Synthetic and real test suites with diverse scenarios (OOD, edge cases, adversarial). • Deterministic replay of user sessions for regression tests. • Automated metric collection: success rate, tool accuracy, latency, hallucination rate. • Human evaluation pipelines: for subjective metrics and alignment. • Model gating: fail fast if regressions cross thresholds before promotion to prod.
Q28: How do you design a scalable embedding pipeline?	Flow: ingestion → text chunking → batching → GPU embedder workers → persist to vector DB. Scalability patterns: queueing (Kafka), autoscale worker pool, deduplication (hashing), incremental reindexing, and sharded vector DB indices. Maintaining freshness: TTL-based refresh, update streams for changed docs, and consistent hashing for rebalancing.
Q29: How do you prevent prompt injections and security risks in agents?	Mitigations: • Input sanitization & instruction stripping: remove suspicious system tokens or user content that tries to alter agent behavior. • Schema validation & Pydantic: require structured inputs/outputs for all tool calls. • Capability policy enforcement: ensure LLM cannot call sensitive tools without explicit scope. • Rate limits & monitoring: detect anomalous prompt patterns. • Sandboxing external actions and require human approval for destructive operations.
Q30: How do you design a human-in-the-loop (HITL) system for agents?	Design elements: • Escalation criteria: confidence thresholds or flagged outputs route to human reviewer. • Reviewer UI: show context, retrieved docs, chain-of-thought, and suggested corrections. • Asynchronous review & feedback loop: human corrections get logged and used to retrain verifiers or fine-tune models. • Operational metrics: review latency, reviewer accuracy, and cost. Goal: combine automation with human oversight to manage high-risk decisions while improving models via labeled corrections.

Terminologies Explanation

🧠 1. DeepSpeed ZeRO (Zero Redundancy Optimizer)

DeepSpeed ZeRO is a memory-optimization technique that makes it possible to train very large models (billions of parameters) on multiple GPUs without running out of memory.

💡 Why was ZeRO created?

Training large models requires storing:

Model weights
Gradients
Optimizer states (Adam stores 2 extra vectors for every weight → 3× memory!)

A single GPU cannot store all of this, so ZeRO shards (splits) these across multiple GPUs.

⭐ ZeRO has 3 main stages

Each stage shards a different component.

### 🔹 ZeRO Stage 1 – Optimizer State Sharding

Optimizers like Adam store:

weight
m (momentum)
v (variance)

For a 7B-parameter model → Adam states = 7B × 2 = 14B extra numbers.

Stage 1 splits these across GPUs
→ each GPU only stores 1/N of the optimizer states.

Weights and gradients still fully replicated.

🔹 ZeRO Stage 2 – + Gradient Sharding

Adds on top of Stage 1.

Now:

Optimizer states → sharded
Gradients → sharded

Only model weights are replicated.

This gives about 4× more memory savings than Stage 1.

🔹 ZeRO Stage 3 – + Parameter (Weights) Sharding

Adds on top of Stage 1 & 2.

Now everything is sharded:

Optimizer states
Gradients
Model weights

This enables training trillion-scale models.

Example (Simple)

Suppose:

Model has 4 parameters: [p1, p2, p3, p4]
2 GPUs

Component	Vanilla	ZeRO Stage 3
Weights	Both GPUs store [p1,p2,p3,p4]	GPU1: [p1,p2], GPU2: [p3,p4]
Gradients	both store all	sharded
Optim states	both store all	sharded

Result:
You reduce memory usage by 50% with 2 GPUs — in real models, savings go up to 10×–20×.

🧠 2. DeepSpeed ZeRO-Offload

Allows parts of the model (like optimizer states) to be offloaded to CPU RAM, freeing GPU memory.

Useful when:

GPU memory < 24GB (e.g., consumer GPUs)
you want to train multi-billion parameter models.

🧠 3. DeepSpeed ZeRO-Infinity

A super version of ZeRO:

Offloads model weights + optimizer states + gradients into NVMe SSDs
Streams only required parts into GPU when needed

This allows training 100B+ parameter models on a handful of GPUs.

🧠 4. FSDP (Fully Sharded Data Parallelism) – PyTorch Alternative to ZeRO

FSDP = PyTorch’s built-in version of ZeRO Stage 3.

It shards:

weights
grads
optimizer states

across GPUs.

💡 How FSDP works?

During forward/backward:

GPU gathers only the weights it needs for that layer
Runs compute
Re-shards weights
Moves to next layer

This allows training enormous models without OOM.

🧠 5. Tensor (Model) Parallelism

Instead of splitting optimizer states, we split the model itself.

Example:

Dense layer: y = Wx

If W is 10000 × 10000:

GPU1 holds first 5000 rows
GPU2 holds last 5000 rows

Each GPU computes part of the output → results merged.

Used in:

Megatron-LM
GPT-3 training

🧠 6. Pipeline Parallelism

Splits the model layer-wise across GPUs.

Example (4 GPUs):

GPU1 → Layers 1–12
GPU2 → Layers 13–24
GPU3 → Layers 25–36
GPU4 → Layers 37–48

Each microbatch flows through this pipeline.

Benefit:

Memory is reduced since each GPU holds only part of the model.

Downside:

Pipeline bubbles (idle time) if not enough microbatches.

🧠 7. Data Parallelism

Classic approach:

each GPU has full model copy
each GPU trains on a different batch
gradients are averaged

E.g.:

GPT-3 training = Data Parallel + Tensor Parallel + Pipeline Parallel (3D Parallelism)

🧠 8. Parameter Offloading / Activation Offloading

Techniques for saving memory:

Activation checkpointing: re-compute activations instead of storing them
CPU offload: move parameters to CPU
NVMe offload: move them to SSD

Used by ZeRO-Infinity & FSDP.

🧠 9. KV Cache & KV Cache Parallelism (for LLM Inference)

LLMs store Key/Value vectors from previous tokens so they don’t recompute history.

Problem: KV cache uses huge memory.

Solution:
Shard or compress KV cache across GPUs → faster inference.

Used by:

vLLM
TensorRT-LLM
LMDeploy

🧠 10. vLLM Paged Attention

Key innovation for fast inference.

Instead of storing KV cache in one big contiguous block, vLLM uses a virtual memory paging system.

Why is this important?

Reduces fragmentation
→ allows higher throughput
→ allows long context windows (think 200k tokens)

🧠 11. Token Streaming / Speculative Decoding

Used to speed up inference.

Technique:

A small model generates tokens fast (draft model)
A large model verifies/accepts them

This is how Llama 3.1, GPT-4o, etc achieve 2× speedup.

🧠 12. LLM Scheduling — Continuous Batching

In serving systems like vLLM:

multiple users send prompts
they are merged ("batched") dynamically
GPU utilization stays high

This is why vLLM is 10× more efficient than standard PyTorch.

🧠 13. Sharded Optimizers

An optimizer where its internal states are distributed across GPUs.

Used by:

ZeRO
FSDP
Megatron-LM

🧠 14. NCCL & High-Speed Networking

NCCL = NVIDIA library for GPU-to-GPU communication.

Used for:

all-reduce
broadcast
sharding communication

High-speed interconnects:

InfiniBand
NVLink
NVSwitch

These enable multi-node LLM training.

🧠 15. FlashAttention

A memory-efficient attention algorithm.

Why it matters?

Standard attention has O(n²) memory → impossible for long sequences.

FlashAttention:

computes attention in blocks
keeps everything in GPU SRAM
avoids unnecessary reads/writes

Result:

up to 3× speed improvements
10× less memory
enables long contexts (128k+)

🧠 16. Quantization (4-bit, 8-bit)

Reducing weights from 32-bit float → 4/8-bit integers.

Effect:

Model size shrinks
Inference becomes faster
Accuracy drops slightly

LLM.int8(), GPTQ, AWQ are popular methods.

Used for:

on-device LLM
cheaper inference

🧠 17. LoRA / QLoRA

Fine-tuning methods that replace full weight training.

LoRA

Adds small rank-decomposition matrices → trains only 0.1–2% of parameters.

QLoRA

Quantizes the base model to 4-bit + trains LoRA adapters.

This is why you can fine-tune LLaMA 7B using:

a single GPU
24 GB VRAM

🧠 18. Checkpointing vs Sharding

Checkpointing:

periodically saving the model to disk

Sharding:

splitting model across GPUs

Both used to prevent:

OOM
data loss
training crashes

🧠 19. Replica vs Shard

Term	Meaning
Replica	A full copy of the model
Shard	A part (slice) of the model

Data Parallel → replicas
ZeRO/FSDP → shards

🧠 20. Gradient Accumulation

Useful when batch size is too big to fit into GPU memory.