How Glance turns hours of video into mobile-ready clips with AI

Overview

Google Cloud describes how Glance, a mobile-first content platform, uses AI to turn 1-2 hour horizontal videos into short vertical clips for mobile lock screens. With daily volume projected to grow from 3,500 to over 10,000 videos per day, manual editing was not viable, so Glance built an automated pipeline. The solution combines Google Cloud Speech-to-Text v2, Gemini and the Google Vision API with open-source tools to identify key moments, detect active speakers and reframe video.

Key Takeaways

Glance converts 1-2 hour horizontal videos into 30 to 180-second vertical clips for mobile lock screens.
Daily volume is projected to grow from 3,500 to over 10,000 videos per day, making manual editing impractical.
The pipeline uses Google Cloud Speech-to-Text v2, Gemini and the Google Vision API.
It also uses open-source tools including Samurai for object tracking, OpenCV and MoviePy.
The architecture is split into three modules: video clipping, intelligent reframing and (a third covered by the broader pipeline).

Stats & Key Facts

#Source videos are 1-2 hours long
#Output clips are 30 to 180 seconds
#Daily volume projected to grow from 3,500 to over 10,000 videos per day
#Input format is 16:9, output is 9:16
#Uses Gemini 2.5 Flash for segment identification

How Glance turns hours of video into mobile-ready clips with AI

The problem Glance set out to solve

Long horizontal video does not fit vertical mobile feeds.

›Most video lives in long-form, horizontal formats while audiences scroll vertical feeds.
›Glance processes 1-2 hour videos from podcasts, news reports, movies and web series.
›It turns them into 30 to 180-second vertical clips optimized for mobile lock screens.

With daily volume projected to grow from 3,500 to over 10,000 videos per day, manual editing was not a realistic path. The solution also needed to go beyond simple cropping, centering the primary speaker or splitting the screen to stack speakers during conversations.

What the pipeline must handle

The goal is landscape-to-portrait conversion at scale.

›Key moment identification finds the most engaging 60-second segments within hours of footage.
›Active speaker detection identifies who is talking and positions them at the top of a split screen.
›Split screen detection recognizes interview layouts and stacks frames vertically.

The pipeline also performs intelligent reframing, dynamic caption highlighting with word-level timestamps for karaoke-style captions, and automated branding that applies masks, logos and overlays programmatically.

The technology stack

Glance combines Google services with open-source tools.

›The solution uses Google Cloud Speech-to-Text v2, Gemini and the Google Vision API.
›It adds custom video manipulation with Samurai, an open-source object tracking tool.
›It also uses OpenCV and MoviePy.

Module 1: Video clipping

This module turns long videos into transcript-aligned clips.

›It extracts audio, transcribes speech to text with precise word-level timestamps, and clips the video.
›It uses Gemini 2.5 Flash to analyze transcripts and identify optimal start and end timestamps.
›It uses Gemini to verify that phrases and words are accurately captured, though not word timing.

The output is a set of short video clips, each paired with its time-aligned transcript, ready for the reframing engine.

Module 2: Intelligent reframing engine

This module converts 16:9 frames into 9:16 vertical frames.

›A simple center crop often cuts out key speakers or action.
›The solution uses a multi-stage scene analysis pipeline instead.
›Active speaker detection runs frame-by-frame using the Google Cloud Vision API face detection.

The reframing engine distinguishes between a static image and a live person so the crop focuses on the actual speaker.

Frequently Asked Questions

What does Glance's pipeline do?

It converts 1-2 hour horizontal (16:9) videos into multiple short vertical (9:16) clips of 30 to 180 seconds, optimized for mobile lock screens.

Why did Glance automate this?

Daily volume is projected to grow from 3,500 to over 10,000 videos per day, so manual editing was not a realistic path forward.

What Google technologies does it use?

Google Cloud Speech-to-Text v2, Gemini (including Gemini 2.5 Flash) and the Google Vision API, plus open-source tools like Samurai, OpenCV and MoviePy.

How does it know who is speaking?

Active speaker detection runs frame-by-frame using the face detection capabilities of the Google Cloud Vision API, distinguishing a live person from a static image.

What is the role of word-level timestamps?

They ensure clips start and end exactly where they should and enable karaoke-style caption highlighting that increases engagement on silent mobile screens.

Glance's three-module pipeline shows how Google Cloud and open-source tools can turn hours of horizontal video into mobile-ready vertical clips at scale.

Why It Matters for Business

Real business deployments are the most reliable signal of where AI is generating measurable ROI. Watching which sectors operationalize AI, what they pay for it, and how it changes their P&L tells you more than any vendor demo. These case studies are what serious buyers and investors triangulate on.