Real-Time-YouTube-Analytics

Real-Time YouTube Analytics Platform

Portfolio LinkedIn

Banner

The Problem

Content teams on YouTube are making strategy decisions on week-old data. They publish a video, check the numbers days later, and by the time a pattern becomes clear — a format gaining traction, a topic losing audience interest, a posting window that consistently underperforms — the opportunity to act on it has already closed.

The data isn’t missing. YouTube generates enormous amounts of it. The problem is that most teams access it through a static report that shows what happened last week, not what is happening right now. A content strategy built on delayed data is always reacting. The goal here was to build a system that puts teams ahead of the curve instead.

What I Built

A real-time YouTube analytics platform that continuously extracts video and channel metrics from the YouTube Data API v3, streams them through a live data pipeline, and surfaces them in an always-current dashboard — so content teams can make decisions based on what is happening now, not what happened last week.

Pipeline Architecture

┌─────────────────────┐
│  YouTube Data API v3 │  ← Python extraction layer
│  (video + channel    │    view counts, likes, comments,
│   metrics, PII       │    engagement rate, duration, tags,
│   masked at source)  │    category, caption, thumbnail score,
└──────────┬──────────┘    subscriber count
           │
           ▼
┌─────────────────────┐
│    Apache Kafka      │  ← Real-time streaming ingestion
│  Topic: yt_metrics   │    Decouples extraction from processing
│  3 partitions        │    No data loss under downstream load
│  Snappy compression  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐       ┌──────────────────────────┐
│  Spark Structured    │       │     Apache Airflow        │
│  Streaming           │       │  (Orchestration layer)    │
│                      │       │  - Schedules pipeline     │
│  - Parse JSON events │       │  - Monitors health        │
│  - Compute derived   │       │  - Data quality checks    │
│    fields (engagement│       │  - Email alerts on        │
│    rate, duration    │       │    failure                │
│    mins, publish     │       └──────────────────────────┘
│    hour/day/month)   │
│  - Aggregate channel │
│    rollups           │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│     BigQuery         │  ← Cloud data warehouse
│                      │    Partitioned by day
│  video_metrics       │    Clustered by channel + category
│  channel_aggregates  │    Controlled dashboard_view (PII excluded)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Streamlit Dashboard │  ← Live analytics UI
│  (refreshes every    │    Engagement trends, publish patterns,
│   5 minutes)         │    content strategy signals
└─────────────────────┘

Metrics Captured

Category Metrics
Engagement View counts, like counts, comment volume, engagement rate
Audience Subscriber count per channel
Content Video duration, tags, category classification, caption availability
Visual Thumbnail quality score (resolution-based, 1–5)
Derived Engagement rate, duration in minutes, is_long_form flag, publish hour, publish day of week, publish month

How I Built It — Layer by Layer

Data Extraction — YouTube Data API v3

The pipeline starts with a Python extraction layer that queries the YouTube Data API v3 for every video in a channel’s upload history. I engineered the extraction to go beyond the obvious metrics — view and like counts — and capture thumbnail quality scores, caption availability, and category classification, which are the signals that give content strategy analysis real analytical depth.

The API returns nested, inconsistently typed JSON. Getting clean, consistently structured output required explicit schema design and type coercion from the start — any ambiguity here compounds into data quality problems downstream.

All free-text fields (video descriptions, titles) pass through a PII masking layer before leaving the extractor, stripping email addresses and phone numbers before they enter the pipeline.

Real-Time Streaming — Apache Kafka

Kafka Flow

Extracted events are published to a Kafka topic (youtube_metrics) with three partitions, keyed on video_id to ensure ordered processing per video. Snappy compression reduces network and disk overhead on high-volume runs.

The critical design decision here was decoupling extraction from processing. Without Kafka in the middle, a slow Spark job would back up the entire pipeline. With Kafka, the extractor publishes at its own pace and Spark consumes at its own pace — independently, without either blocking the other. No events are lost if downstream processing is temporarily under load.

Stream Processing — Apache Spark Structured Streaming

Spark consumes from Kafka and handles all transformation and enrichment in flight, before any data reaches BigQuery. This is where engagement rate is calculated, duration is converted from ISO 8601 to minutes, is_long_form is derived, and publish timestamps are decomposed into hour, day-of-week, and month fields for time-pattern analysis.

Computing these in Spark rather than retrospectively in SQL means the warehouse always contains analysis-ready data — no post-load transformation jobs required.

Two sinks run in parallel:

Cloud Data Warehouse — BigQuery

Processed events land in BigQuery, designed for analytical query performance from day one:

The dashboard queries this view, never the raw table directly.

Orchestration — Apache Airflow

The full pipeline runs on a 6-hour schedule managed by Airflow. Each DAG run:

  1. Extracts from YouTube API and publishes to Kafka
  2. Runs data quality checks against BigQuery (row count, null video_id check)
  3. Branches — if checks pass, refreshes channel aggregates; if they fail, fires an alert email and halts
  4. Logs a completion summary with events published and channels processed

Any infrastructure failure — API rate limiting, Kafka lag, Spark job error — triggers an immediate email alert. The pipeline never fails silently.

Compliance — PII Masking & Encryption

Control Where Applied
PII masking (email, phone) Extractor — before Kafka
Encryption in transit All API and BigQuery connections (TLS)
Encryption at rest BigQuery default (Google-managed keys)
PII field exclusion dashboard_view excludes description column
Deduplication dashboard_view surfaces only latest extract per video

Compliance was designed into the pipeline architecture, not bolted on afterwards.

The Outcome

Content teams get a live dashboard that updates every five minutes — tracking channel performance, engagement trends, and audience behaviour as they develop, not after the fact.

The dashboard surfaces:

Tech Stack

Layer Tools
Language Python 3.11
Data Extraction YouTube Data API v3
Streaming Ingestion Apache Kafka 3.6 (Confluent)
Stream Processing Apache Spark Structured Streaming 3.5
Cloud Data Warehouse Google BigQuery
Pipeline Orchestration Apache Airflow 2.9
Dashboard Streamlit + Plotly
Containerisation Docker, Docker Compose
Compliance PII masking, TLS, BigQuery encryption at rest

What I’d Do Differently

Comment sentiment analysis — comment volume tells you how much conversation a video sparked. Comment sentiment tells you whether that conversation was positive, critical, or polarised — a qualitative signal that view counts alone can’t provide. Adding an NLP layer in the Spark processing stage to classify comment tone in-flight would make the engagement picture significantly more complete.

Predictive performance scoring — the current platform is descriptive: it tells you what is happening. The natural next layer is predictive: using a video’s first-hour metrics as input features to forecast its 7-day performance, giving content teams an early signal before the algorithm has fully decided whether to amplify or suppress it.

Competitor benchmarking — pulling the same metrics from competitor channels through the same pipeline would let content teams contextualise their own performance against the market rather than just their own historical baseline. The pipeline already supports multiple channel IDs; the analysis layer is the gap.

Anomaly detection alerting — Airflow currently alerts on pipeline failures. A separate layer that flags unusual spikes or drops in metric values — a video suddenly gaining 10x normal engagement, or a channel’s subscriber count dropping overnight — would let teams respond to both opportunities and problems in real time.

Let’s Talk

If you’re working on a data engineering, real-time analytics, or cloud data platform problem — get in touch.