Extraction Guide
A comprehensive guide to extracting data with Cloudglue
Video contains a wealth of information, but developers often need this information in a structured format that their applications can easily consume. While transcription gives you raw information from a video, entity extraction allows you to get precisely the structured data you need.
With Cloudglue’s Extract API, you can define exactly what information you want to extract and receive it in a format that’s ready for your database or application logic.
Cloudglue allows you to extract entities from both locally uploaded files and YouTube videos.
Understanding Entities
Entities are structured pieces of information that can be extracted from videos. They come in two forms:
- Video-level entities: Information that applies to the entire video
- Segment entities: Information that appears at specific moments in the video
For example, if you’re analyzing product review videos, you might want to extract:
- Video-level: Overall product rating, reviewer name, product category
- Segments: Individual features discussed, pros/cons mentioned at different timestamps
When to Use Entity Extraction
While transcription gives you comprehensive raw data from a video, entity extraction is better when you:
- Need specific structured data rather than raw transcripts
- Want to populate a database with consistent fields
- Need to track information across time segments
- Want type-safe data that’s ready for your application
Defining Extraction Schemas
Schema Structure
An entity schema defines what information you want to extract. You can specify it in two ways:
1. Abbreviated Form (Recommended for simple schemas)
2. Full JSON Schema Specification
Using the Schema Builder
We provide a graphical tool to help you build and test your schemas:
Example: Political Speech Analysis
For this example, we’ll analyze a political speech about tariffs. While the video below shows the source content, our analysis was performed using a local copy to enable full multimodal understanding:
Let’s look at a real-world example of extracting structured information from a political speech video. This example demonstrates how to combine schema definition with a clear prompt to get both video-level and segment-level information.
The Prompt
The Schema
The Output
The extraction produces both video-level entities and time-segmented entities:
This example demonstrates how Cloudglue can:
- Extract structured information from both visual and audio content
- Capture both high-level video entities and time-specific segments
- Handle complex nested schemas with arrays and objects
- Combine multiple types of information (visual, audio, textual) into a coherent structure
Extraction Methods
On-demand Extraction
For quick experiments or single extractions, use the direct extract endpoint:
Using Collections
For production use cases, create an entities collection to process multiple videos consistently:
Best Practices
-
Focus on Observable Attributes: Design your schema around information that can be:
- Visually seen in the video
- Read from on-screen text
- Heard in speech or narration
- Understood from actions and events
Avoid requesting:
- Technical metadata (e.g., bitrate, duration)
- Highly subjective interpretations
- Information that requires external context
-
Be Specific: Only extract the information you actually need. More fields aren’t always better.
-
Use Prompts Effectively: Combine schemas with prompts to guide the extraction:
-
Test Diverse Content: Use the Extract Playground to test your schema against different types of videos.
-
Start Simple: Begin with a minimal schema and expand based on needs.
-
Consider Segments: Decide if you need information at the video level, segment level, or both.
-
Structure Based on Occurrence: Choose the right structure based on how entities appear in scenes:
- Use lists (
[]
) for entities that can appear multiple times in a scene: - Use single objects (
{}
) for attributes that typically appear once per scene:
- Use lists (
-
Use Collections: For production systems, use collections to ensure consistent extraction across multiple videos.
Try it out
Check out our Extract Endpoint to get started with entity extraction. Get started on our platform.
YouTube
At the moment, if you want to extract entities from a YouTube video directly, we only support extracting from speech content. For full multimodal entity extraction, download the video and upload it to Cloudglue.
When working with YouTube videos, consider adjusting your schema and prompt to focus on audio-centric information. Visual elements like camera_shots
, backdrop
, or on-screen symbols
won’t be available through direct YouTube processing. For applications requiring comprehensive visual analysis, we recommend downloading the video first and using our Files API for complete multimodal understanding.