A comprehensive guide to extracting data with Cloudglue
Video contains a wealth of information, but developers often need this information in a structured format that their applications can easily consume. While transcription gives you raw information from a video, entity extraction allows you to get precisely the structured data you need.With Cloudglue’s Extract API, you can define exactly what information you want to extract and receive it in a format that’s ready for your database or application logic.
Cloudglue allows you to extract entities from both locally uploaded files and
YouTube videos.
Entities are structured pieces of information that can be extracted from videos. They come in two extraction modes:
Video-level entities: Information that applies to the entire video as a whole
Segment-level entities: Information that appears at specific moments in the video (default)
These modes are mutually exclusive - each extraction job can only use one mode. Segment-level extraction is enabled by default. Examples below show video-level and segment-level outputs from separate extraction jobs.
For example, if you’re analyzing product review videos, you might want to extract:
For this example, we’ll analyze a political speech about tariffs. While the video below shows the source content, our analysis was performed using a local copy to enable full multimodal understanding:
Let’s look at a real-world example of extracting structured information from a political speech video. This example demonstrates how to combine schema definition with a clear prompt to get both video-level and segment-level information.
Extract the following structured information from C-SPAN videos:1. SPEAKER: Identify main speakers by name and title2. DISCOURSE: Determine the main topic, extract notable key phrases, identify rhetorical techniques with examples, and document stated policy positions.3. REFERENCES: Record any executive orders, legislation (with names/numbers), or agreements mentioned.4. VISUAL: Note on-screen text (chyrons), backdrop elements, types of camera shots used, and significant visual symbols.
For quick experiments or single extractions, use the direct extract endpoint:
JavaScript SDK
Python SDK
Copy
Ask AI
// Define your schemaconst schema = { products: [ { name: "string", price: "string", rating: "string" } ]};// Create an extract jobconst extractJob = await client.extract.createExtract(fileUri, {schema: schema,// Optionally include a prompt to guide the extractionprompt: "Extract product details including exact prices and ratings"});// Get the resultsconst result = await client.extract.getExtract(extractJob.job_id);console.log(result.data);
Copy
Ask AI
# Define your schemaschema = { "products": [ { "name": "string", "price": "string", "rating": "string" } ]}# Create an extract jobextract_job = client.extract.create( url=file_uri, schema=schema, # Optionally include a prompt to guide the extraction prompt="Extract product details including exact prices and ratings")# Get the resultsresult = client.extract.get(job_id=extract_job.job_id)print(result.data)
Focus on Observable Attributes: Design your schema around information that can be:
Visually seen in the video
Read from on-screen text
Heard in speech or narration
Understood from actions and events
Avoid requesting:
Technical metadata (e.g., bitrate, duration)
Highly subjective interpretations
Information that requires external context
Be Specific: Only extract the information you actually need. More fields aren’t always better.
Use Prompts Effectively: Combine schemas with prompts to guide the extraction:
Copy
Ask AI
// JavaScript SDKconst extractJob = await client.extract.createExtract(fileUri, { schema: productSchema, prompt: "Focus on extracting exact prices in USD and ratings out of 5 stars that are explicitly shown or mentioned"});
Copy
Ask AI
# Python SDKextract_job = client.extract.create( url=file_uri, schema=product_schema, prompt="Focus on extracting exact prices in USD and ratings out of 5 stars that are explicitly shown or mentioned")
Test Diverse Content: Use the Extract Playground to test your schema against different types of videos.
Start Simple: Begin with a minimal schema and expand based on needs.
Choose Your Extraction Mode: Decide if you need video-level entities (holistic information about the entire video) or segment-level entities (time-stamped information). You can only use one mode per extraction job.
Structure Based on Occurrence: Choose the right structure based on how entities appear in scenes:
Use lists ([]) for entities that can appear multiple times in a scene:
At the moment, if you want to extract entities from a YouTube video directly,
we only support extracting from speech content. For full multimodal entity
extraction, download the video and upload it to Cloudglue.
Copy
Ask AI
// Extract entities from YouTube video speechconst result = await client.extract.createExtract( 'https://www.youtube.com/watch?v=VIDEO_ID', { schema: mySchema, prompt: "Extract key information from the speaker's content" });
When working with YouTube videos, consider adjusting your schema and prompt to focus on audio-centric information. Visual elements like camera_shots, backdrop, or on-screen symbols won’t be available through direct YouTube processing. For applications requiring comprehensive visual analysis, we recommend downloading the video first and using our Files API for complete multimodal understanding.