What is On-Demand Extract?
On-Demand Extract is Cloudglue’s powerful capability that transforms individual videos into structured, programmable data. Unlike Entity Collections which process multiple videos with consistent schemas, On-Demand Extract allows you to customize extraction parameters for each video, making it perfect for one-off analyses, exploratory data extraction, or videos with unique content structures. The Extract API uses a combination of natural language prompts and structured schemas to identify and extract the exact entities you need from your video content. With On-Demand Extract, you can define exactly what information you want to extract on a per-video basis. This flexibility is ideal when you’re refining your extraction approach, working with diverse video content, or when you need to quickly extract specific information without setting up a collection. For example, you might use On-Demand Extract to analyze a product demonstration video before deciding on a schema for your entire product catalog, or to extract unique information from a specific marketing video that doesn’t fit your standard extraction patterns. On-Demand Extract works directly with your files (both video and audio), and YouTube videos (though YouTube extraction only uses speech and metadata as input signals). You can choose between two extraction modes:- Video-level entities: Information that applies to the entire video as a whole
- Segment-level entities: Information tied to specific timestamps within the video
These modes are mutually exclusive - you must choose one or the other per extraction job. Segment-level extraction is enabled by default.
Transcript-Based Extraction
You can enable transcript-based extraction using theenable_transcript_mode flag. When enabled, entities are extracted from the transcript only, similar to how YouTube videos are processed. This is useful for:
- Speech-heavy content like podcasts and interviews
- Faster and more cost-effective extraction when visual analysis isn’t needed
- Audio files where only speech content matters