Extraction Guide

Video contains a wealth of information, but developers often need this information in a structured format that their applications can easily consume. While transcription gives you raw information from a video, entity extraction allows you to get precisely the structured data you need.

With Cloudglue’s Extract API, you can define exactly what information you want to extract and receive it in a format that’s ready for your database or application logic.

Cloudglue allows you to extract entities from both locally uploaded files and YouTube videos.

Understanding Entities

Entities are structured pieces of information that can be extracted from videos. They come in two forms:

Video-level entities: Information that applies to the entire video
Segment entities: Information that appears at specific moments in the video

For example, if you’re analyzing product review videos, you might want to extract:

Video-level: Overall product rating, reviewer name, product category
Segments: Individual features discussed, pros/cons mentioned at different timestamps

When to Use Entity Extraction

While transcription gives you comprehensive raw data from a video, entity extraction is better when you:

Need specific structured data rather than raw transcripts
Want to populate a database with consistent fields
Need to track information across time segments
Want type-safe data that’s ready for your application

Defining Extraction Schemas

Schema Structure

An entity schema defines what information you want to extract. You can specify it in two ways:

1. Abbreviated Form (Recommended for simple schemas)

{
  "people": [
    {
      "name": "string",
      "description": "string",
      "gender": "string",
      "age_group": "string"
    }
  ],
  "vehicles": [
    {
      "make": "string",
      "model": "string",
      "color": "string"
    }
  ]
}

2. Full JSON Schema Specification

{
  "type": "object",
  "properties": {
    "people": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "description": { "type": "string" },
          "gender": { "type": "string" },
          "age_group": { "type": "string" }
        }
      }
    },
    "vehicles": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "make": { "type": "string" },
          "model": { "type": "string" },
          "color": { "type": "string" }
        }
      }
    }
  }
}

Using the Schema Builder

We provide a graphical tool to help you build and test your schemas:

Example: Political Speech Analysis

For this example, we’ll analyze a political speech about tariffs. While the video below shows the source content, our analysis was performed using a local copy to enable full multimodal understanding:

Let’s look at a real-world example of extracting structured information from a political speech video. This example demonstrates how to combine schema definition with a clear prompt to get both video-level and segment-level information.

The Prompt

Extract the following structured information from C-SPAN videos:

SPEAKER: Identify main speakers by name and title
DISCOURSE: Determine the main topic, extract notable key phrases, identify rhetorical techniques with examples, and document stated policy positions.
REFERENCES: Record any executive orders, legislation (with names/numbers), or agreements mentioned.
VISUAL: Note on-screen text (chyrons), backdrop elements, types of camera shots used, and significant visual symbols.

The Schema

{
  "speaker": {
    "name": "string",
    "title": "string"
  },
  "discourse": {
    "topic": "string",
    "key_phrases": ["string"],
    "rhetorical_devices": [
      {
        "type": "string",
        "example": "string"
      }
    ],
    "policy_positions": ["string"]
  },
  "references": {
    "executive_orders": ["string"],
    "legislation": [
      {
        "name": "string",
        "number": "string"
      }
    ],
    "agreements": ["string"]
  },
  "visual": {
    "chyrons": ["string"],
    "backdrop": ["string"],
    "camera_shots": ["string"],
    "symbols": ["string"]
  }
}

The Output

The extraction produces both video-level entities and time-segmented entities:

Video-Level Entities

{
  "visual": {
    "chyrons": [
      "PRES. TRUMP ANNOUNCES NEW TARIFFS"
    ],
    "symbols": [
      "Presidential seal on the podium"
    ],
    "backdrop": [
      "American flag"
    ],
    "camera_shots": [
      "Wide shot of audience",
      "Close up of speaker",
      "Medium shot of speaker"
    ]
  },
  "speaker": {
    "name": "Donald Trump",
    "title": "President"
  },
  "discourse": {
    "topic": "New tariffs on countries throughout the world",
    "key_phrases": [
      "historic executive order",
      "reciprocal tariffs",
      "economic independence",
      "make America great again"
    ],
    "policy_positions": [
      "Implement reciprocal tariffs",
      "Reduce taxes",
      "Pay down national debt"
    ],
    "rhetorical_devices": [
      {
        "type": "Analogy",
        "example": "It's our declaration of economic independence"
      },
      {
        "type": "Repetition",
        "example": "trillions and trillions of dollars"
      }
    ]
  },
  "references": {
    "agreements": [],
    "legislation": [],
    "executive_orders": [
      "Historic executive order instituting reciprocal tariffs"
    ]
  }
}

Segment-Level Entities (20-40s segment)

{
  "visual": {
    "chyrons": [
      "PRES. TRUMP ANNOUNCES NEW TARIFFS"
    ],
    "symbols": [
      "American flag"
    ],
    "backdrop": [
      "American flag"
    ],
    "camera_shots": [
      "Close-up",
      "Medium Shot"
    ]
  },
  "speaker": {
    "name": "Donald Trump",
    "title": "President"
  },
  "discourse": {
    "topic": "Announcing new tariffs, emphasizing economic independence",
    "key_phrases": [
      "new tariffs",
      "declaration of economic independence"
    ],
    "policy_positions": [
      "American citizens were forced to sit on the sidelines as other nations got rich and powerful"
    ],
    "rhetorical_devices": [
      {
        "type": "Metaphor",
        "example": "It's our declaration of economic independence"
      }
    ]
  },
  "references": {
    "agreements": [],
    "legislation": [
      {
        "name": "N/A",
        "number": "N/A"
      }
    ],
    "executive_orders": []
  }
}

This example demonstrates how Cloudglue can:

Extract structured information from both visual and audio content
Capture both high-level video entities and time-specific segments
Handle complex nested schemas with arrays and objects
Combine multiple types of information (visual, audio, textual) into a coherent structure

Extraction Methods

On-demand Extraction

For quick experiments or single extractions, use the direct extract endpoint:

// Define your schema
const schema = {
  products: [
    {
      name: "string",
      price: "string",
      rating: "string"
    }
  ]
};

// Create an extract job
const extractJob = await client.extract.createExtract(fileUri, {
schema: schema,
// Optionally include a prompt to guide the extraction
prompt: "Extract product details including exact prices and ratings"
});

// Get the results
const result = await client.extract.getExtract(extractJob.job_id);
console.log(result.data);

// Define your schema
const schema = {
  products: [
    {
      name: "string",
      price: "string",
      rating: "string"
    }
  ]
};

// Create an extract job
const extractJob = await client.extract.createExtract(fileUri, {
schema: schema,
// Optionally include a prompt to guide the extraction
prompt: "Extract product details including exact prices and ratings"
});

// Get the results
const result = await client.extract.getExtract(extractJob.job_id);
console.log(result.data);

Using Collections

For production use cases, create an entities collection to process multiple videos consistently:

// Create a collection for product reviews
const collection = await client.collections.createCollection({
  name: "Product Reviews",
  collection_type: "entities",
  description: "Product review videos with structured data extraction",
  extract_config: {
    schema: {
      products: [
        {
          name: "string",
          price: "string",
          rating: "string"
        }
      ]
    }
  }
});

// Add a video to the collection
const fileInfo = await client.collections.addVideo(collection.id, fileId);

// Get the extracted entities
const entities = await client.collections.getEntities(collection.id, fileId);
console.log(entities);

// Create a collection for product reviews
const collection = await client.collections.createCollection({
  name: "Product Reviews",
  collection_type: "entities",
  description: "Product review videos with structured data extraction",
  extract_config: {
    schema: {
      products: [
        {
          name: "string",
          price: "string",
          rating: "string"
        }
      ]
    }
  }
});

// Add a video to the collection
const fileInfo = await client.collections.addVideo(collection.id, fileId);

// Get the extracted entities
const entities = await client.collections.getEntities(collection.id, fileId);
console.log(entities);

Best Practices

Focus on Observable Attributes: Design your schema around information that can be:
- Visually seen in the video
- Read from on-screen text
- Heard in speech or narration
- Understood from actions and events
Avoid requesting:
- Technical metadata (e.g., bitrate, duration)
- Highly subjective interpretations
- Information that requires external context
Be Specific: Only extract the information you actually need. More fields aren’t always better.

Use Prompts Effectively: Combine schemas with prompts to guide the extraction:

{
  schema: productSchema,
  prompt: "Focus on extracting exact prices in USD and ratings out of 5 stars that are explicitly shown or mentioned"
}

Test Diverse Content: Use the Extract Playground to test your schema against different types of videos.
Start Simple: Begin with a minimal schema and expand based on needs.
Consider Segments: Decide if you need information at the video level, segment level, or both.

Structure Based on Occurrence: Choose the right structure based on how entities appear in scenes:

Use lists ([]) for entities that can appear multiple times in a scene:

{
  "vehicles": [
    {
      "type": "string",
      "color": "string"
    }
  ],
  "people": [
    {
      "clothing": "string",
      "action": "string"
    }
  ]
}

Use single objects ({}) for attributes that typically appear once per scene:

{
  "scene_lighting": {
    "brightness": "string",
    "type": "string"
  },
  "weather": {
    "condition": "string",
    "time_of_day": "string"
  }
}

Use Collections: For production systems, use collections to ensure consistent extraction across multiple videos.

Try it out

Check out our Extract Endpoint to get started with entity extraction. Get started on our platform.

YouTube

At the moment, if you want to extract entities from a YouTube video directly, we only support extracting from speech content. For full multimodal entity extraction, download the video and upload it to Cloudglue.

// Extract entities from YouTube video speech
const result = await client.extract.createExtract(
  'https://www.youtube.com/watch?v=VIDEO_ID',
  {
    schema: mySchema,
    prompt: "Extract key information from the speaker's content",
  },
);

When working with YouTube videos, consider adjusting your schema and prompt to focus on audio-centric information. Visual elements like camera_shots, backdrop, or on-screen symbols won’t be available through direct YouTube processing. For applications requiring comprehensive visual analysis, we recommend downloading the video first and using our Files API for complete multimodal understanding.

Getting Started

Core Concepts

Deep Dives

Use Cases

Understanding Entities

When to Use Entity Extraction

Defining Extraction Schemas

Schema Structure

1. Abbreviated Form (Recommended for simple schemas)

2. Full JSON Schema Specification

Using the Schema Builder

Example: Political Speech Analysis

The Prompt

The Schema

The Output

Extraction Methods

On-demand Extraction

Using Collections

Best Practices

Try it out

YouTube

Getting Started

Core Concepts

Deep Dives

Use Cases

​Understanding Entities

​When to Use Entity Extraction

​Defining Extraction Schemas

​Schema Structure

​1. Abbreviated Form (Recommended for simple schemas)

​2. Full JSON Schema Specification

​Using the Schema Builder

​Example: Political Speech Analysis

​The Prompt

​The Schema

​The Output

​Extraction Methods

​On-demand Extraction

​Using Collections

​Best Practices

​Try it out

​YouTube

Understanding Entities

When to Use Entity Extraction

Defining Extraction Schemas

Schema Structure

1. Abbreviated Form (Recommended for simple schemas)

2. Full JSON Schema Specification

Using the Schema Builder

Example: Political Speech Analysis

The Prompt

The Schema

The Output

Extraction Methods

On-demand Extraction

Using Collections

Best Practices

Try it out

YouTube