Back to News Hub
☁️Google Cloud AI
May 21, 2026
Business

How Glance turns hours of video into mobile-ready clips with AI

Overview

Every day, thousands of hours of new video content sits waiting to be discovered. Most of it lives in long-form, horizontal formats, while audiences are scrolling through vertical feeds on their phones. Glance, a mobile-first content platform, knows this challenge well.

Key Takeaways

  • The company processes 1-2 hour videos from sources like podcasts, news reports, movies, and web series, and transforms them into 30 to 180-second vertical clips optimized for mobile lock screens.

    With daily volume projected to grow from 3,500 to over 10,000 videos per day, manual editing wasn't a realistic path forward.

  • Here's how Glance's video generation solution works.

    Building for the lock screen era The goal was to create a complete pipeline that takes a long-form landscape video (16:9) and outputs multiple ready-to-publish short-form portrait videos (9:16).

  • Architecture overview The pipeline is divided into three distinct modules.

    2: High-level architecture Module 1: Video clipping This module converts long videos to transcripts, identifies key segments, and clips the video.

  • The module performs the following key functions: Audio extraction: Extracting the audio from the original video file.

    Speech-to-text transcription: Converting audio into text with precise timestamps for each word Segment identification: Using Gemini 2.

  • This happens on a frame-by-frame basis using the face detection capabilities of the Google Cloud Vision API.
How Glance turns hours of video into mobile-ready clips with AI

The company processes 1-2 hour videos from sources like podcasts, news reports, movies, and web series, and transforms them into 30 to 180-second vertical clips optimized for mobile lock screens. With daily volume projected to grow from 3,500 to over 10,000 videos per day, manual editing wasn't a realistic path forward. The solution also needed to go beyond simple cropping.

It required the intelligence to identify and center the primary speaker, or dynamically split the screen to stack speakers vertically during conversations, preserving the context that makes content worth watching. Here's how Glance's video generation solution works. Building for the lock screen era The goal was to create a complete pipeline that takes a long-form landscape video (16:9) and outputs multiple ready-to-publish short-form portrait videos (9:16).

The solution needed to handle: Key Moment Identification: Finding the most engaging 60-second segments within hours of long-form footage Active Speaker Detection: Identifying who's talking in each frame and positioning them at the top of a split screen. This includes distinguishing between a static image and a live person to ensure the crop focuses on the actual speaker. Architecture overview The pipeline is divided into three distinct modules.

2: High-level architecture Module 1: Video clipping This module converts long videos to transcripts, identifies key segments, and clips the video. Accuracy matters here: precise word-level timestamps ensure clips start and end exactly where they should. 3: Video Clipping Workflow The process involves audio extraction, speech-to-text transcription, and timestamp identification using generative AI.

The module performs the following key functions: Audio extraction: Extracting the audio from the original video file. Speech-to-text transcription: Converting audio into text with precise timestamps for each word Segment identification: Using Gemini 2. Module 2: Intelligent Reframing Engine The core technical work here is converting a horizontal 16:9 frame into a compelling 9:16 vertical frame.

A simple center crop often cuts out key speakers or action, so our solution uses a multi-stage scene analysis pipeline. 4: Intelligent reframing engine Active speaker detection To know what to crop, we first need to know who's talking. This happens on a frame-by-frame basis using the face detection capabilities of the Google Cloud Vision API.

For more details please read the original article at Google Cloud AI.

Why It Matters for Business

Real business deployments are the most reliable signal of where AI is generating measurable ROI. Watching which sectors operationalize AI, what they pay for it, and how it changes their P&L tells you more than any vendor demo. These case studies are what serious buyers and investors triangulate on.

Continue Learning

Originally published by Google Cloud AI
Read the original

Comments

Sign in to join the conversation