
Content teams on YouTube are making strategy decisions on week-old data. They publish a video, check the numbers days later, and by the time a pattern becomes clear — a format gaining traction, a topic losing audience interest, a posting window that consistently underperforms — the opportunity to act on it has already closed.
The data isn’t missing. YouTube generates enormous amounts of it. The problem is that most teams access it through a static report that shows what happened last week, not what is happening right now. A content strategy built on delayed data is always reacting. The goal here was to build a system that puts teams ahead of the curve instead.
A real-time YouTube analytics platform that continuously extracts video and channel metrics from the YouTube Data API v3, streams them through a live data pipeline, and surfaces them in an always-current dashboard — so content teams can make decisions based on what is happening now, not what happened last week.
┌─────────────────────┐
│ YouTube Data API v3 │ ← Python extraction layer
│ (video + channel │ view counts, likes, comments,
│ metrics, PII │ engagement rate, duration, tags,
│ masked at source) │ category, caption, thumbnail score,
└──────────┬──────────┘ subscriber count
│
▼
┌─────────────────────┐
│ Apache Kafka │ ← Real-time streaming ingestion
│ Topic: yt_metrics │ Decouples extraction from processing
│ 3 partitions │ No data loss under downstream load
│ Snappy compression │
└──────────┬──────────┘
│
▼
┌─────────────────────┐ ┌──────────────────────────┐
│ Spark Structured │ │ Apache Airflow │
│ Streaming │ │ (Orchestration layer) │
│ │ │ - Schedules pipeline │
│ - Parse JSON events │ │ - Monitors health │
│ - Compute derived │ │ - Data quality checks │
│ fields (engagement│ │ - Email alerts on │
│ rate, duration │ │ failure │
│ mins, publish │ └──────────────────────────┘
│ hour/day/month) │
│ - Aggregate channel │
│ rollups │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ BigQuery │ ← Cloud data warehouse
│ │ Partitioned by day
│ video_metrics │ Clustered by channel + category
│ channel_aggregates │ Controlled dashboard_view (PII excluded)
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Streamlit Dashboard │ ← Live analytics UI
│ (refreshes every │ Engagement trends, publish patterns,
│ 5 minutes) │ content strategy signals
└─────────────────────┘
| Category | Metrics |
|---|---|
| Engagement | View counts, like counts, comment volume, engagement rate |
| Audience | Subscriber count per channel |
| Content | Video duration, tags, category classification, caption availability |
| Visual | Thumbnail quality score (resolution-based, 1–5) |
| Derived | Engagement rate, duration in minutes, is_long_form flag, publish hour, publish day of week, publish month |
The pipeline starts with a Python extraction layer that queries the YouTube Data API v3 for every video in a channel’s upload history. I engineered the extraction to go beyond the obvious metrics — view and like counts — and capture thumbnail quality scores, caption availability, and category classification, which are the signals that give content strategy analysis real analytical depth.
The API returns nested, inconsistently typed JSON. Getting clean, consistently structured output required explicit schema design and type coercion from the start — any ambiguity here compounds into data quality problems downstream.
All free-text fields (video descriptions, titles) pass through a PII masking layer before leaving the extractor, stripping email addresses and phone numbers before they enter the pipeline.

Extracted events are published to a Kafka topic (youtube_metrics) with three partitions,
keyed on video_id to ensure ordered processing per video. Snappy compression reduces
network and disk overhead on high-volume runs.
The critical design decision here was decoupling extraction from processing. Without Kafka in the middle, a slow Spark job would back up the entire pipeline. With Kafka, the extractor publishes at its own pace and Spark consumes at its own pace — independently, without either blocking the other. No events are lost if downstream processing is temporarily under load.
Spark consumes from Kafka and handles all transformation and enrichment in flight,
before any data reaches BigQuery. This is where engagement rate is calculated, duration is
converted from ISO 8601 to minutes, is_long_form is derived, and publish timestamps are
decomposed into hour, day-of-week, and month fields for time-pattern analysis.
Computing these in Spark rather than retrospectively in SQL means the warehouse always contains analysis-ready data — no post-load transformation jobs required.
Two sinks run in parallel:
Processed events land in BigQuery, designed for analytical query performance from day one:
processing_time — queries that filter by date scan only
the relevant partitions, not the full tablechannel_id and category_id — dashboard queries that filter by
channel or content category run significantly faster at scaledashboard_view — a BigQuery view that deduplicates rows (keeping
only the most recent extract per video_id) and excludes the description field,
which carries PII risk in free textThe dashboard queries this view, never the raw table directly.
The full pipeline runs on a 6-hour schedule managed by Airflow. Each DAG run:
video_id check)Any infrastructure failure — API rate limiting, Kafka lag, Spark job error — triggers an immediate email alert. The pipeline never fails silently.
| Control | Where Applied |
|---|---|
| PII masking (email, phone) | Extractor — before Kafka |
| Encryption in transit | All API and BigQuery connections (TLS) |
| Encryption at rest | BigQuery default (Google-managed keys) |
| PII field exclusion | dashboard_view excludes description column |
| Deduplication | dashboard_view surfaces only latest extract per video |
Compliance was designed into the pipeline architecture, not bolted on afterwards.
Content teams get a live dashboard that updates every five minutes — tracking channel performance, engagement trends, and audience behaviour as they develop, not after the fact.
The dashboard surfaces:
| Layer | Tools |
|---|---|
| Language | Python 3.11 |
| Data Extraction | YouTube Data API v3 |
| Streaming Ingestion | Apache Kafka 3.6 (Confluent) |
| Stream Processing | Apache Spark Structured Streaming 3.5 |
| Cloud Data Warehouse | Google BigQuery |
| Pipeline Orchestration | Apache Airflow 2.9 |
| Dashboard | Streamlit + Plotly |
| Containerisation | Docker, Docker Compose |
| Compliance | PII masking, TLS, BigQuery encryption at rest |
Comment sentiment analysis — comment volume tells you how much conversation a video sparked. Comment sentiment tells you whether that conversation was positive, critical, or polarised — a qualitative signal that view counts alone can’t provide. Adding an NLP layer in the Spark processing stage to classify comment tone in-flight would make the engagement picture significantly more complete.
Predictive performance scoring — the current platform is descriptive: it tells you what is happening. The natural next layer is predictive: using a video’s first-hour metrics as input features to forecast its 7-day performance, giving content teams an early signal before the algorithm has fully decided whether to amplify or suppress it.
Competitor benchmarking — pulling the same metrics from competitor channels through the same pipeline would let content teams contextualise their own performance against the market rather than just their own historical baseline. The pipeline already supports multiple channel IDs; the analysis layer is the gap.
Anomaly detection alerting — Airflow currently alerts on pipeline failures. A separate layer that flags unusual spikes or drops in metric values — a video suddenly gaining 10x normal engagement, or a channel’s subscriber count dropping overnight — would let teams respond to both opportunities and problems in real time.
If you’re working on a data engineering, real-time analytics, or cloud data platform problem — get in touch.