Video contains a wealth of information, but developers often need this information in a structured format that their applications can easily consume. While transcription gives you raw information from a video, entity extraction allows you to get precisely the structured data you need.

With Cloudglue’s Extract API, you can define exactly what information you want to extract and receive it in a format that’s ready for your database or application logic.

Cloudglue allows you to extract entities from both locally uploaded files and YouTube videos.

Understanding Entities

Entities are structured pieces of information that can be extracted from videos. They come in two forms:

  1. Video-level entities: Information that applies to the entire video
  2. Segment entities: Information that appears at specific moments in the video

For example, if you’re analyzing product review videos, you might want to extract:

  • Video-level: Overall product rating, reviewer name, product category
  • Segments: Individual features discussed, pros/cons mentioned at different timestamps

When to Use Entity Extraction

While transcription gives you comprehensive raw data from a video, entity extraction is better when you:

  1. Need specific structured data rather than raw transcripts
  2. Want to populate a database with consistent fields
  3. Need to track information across time segments
  4. Want type-safe data that’s ready for your application

Defining Extraction Schemas

Schema Structure

An entity schema defines what information you want to extract. You can specify it in two ways:

{
  "people": [
    {
      "name": "string",
      "description": "string",
      "gender": "string",
      "age_group": "string"
    }
  ],
  "vehicles": [
    {
      "make": "string",
      "model": "string",
      "color": "string"
    }
  ]
}

2. Full JSON Schema Specification

{
  "type": "object",
  "properties": {
    "people": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "description": { "type": "string" },
          "gender": { "type": "string" },
          "age_group": { "type": "string" }
        }
      }
    },
    "vehicles": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "make": { "type": "string" },
          "model": { "type": "string" },
          "color": { "type": "string" }
        }
      }
    }
  }
}

Using the Schema Builder

We provide a graphical tool to help you build and test your schemas:

Example: Political Speech Analysis

For this example, we’ll analyze a political speech about tariffs. While the video below shows the source content, our analysis was performed using a local copy to enable full multimodal understanding:

Let’s look at a real-world example of extracting structured information from a political speech video. This example demonstrates how to combine schema definition with a clear prompt to get both video-level and segment-level information.

The Prompt

Extract the following structured information from C-SPAN videos:

1. SPEAKER: Identify main speakers by name and title
2. DISCOURSE: Determine the main topic, extract notable key phrases, identify rhetorical techniques with examples, and document stated policy positions.
3. REFERENCES: Record any executive orders, legislation (with names/numbers), or agreements mentioned.
4. VISUAL: Note on-screen text (chyrons), backdrop elements, types of camera shots used, and significant visual symbols.

The Schema

{
  "speaker": {
    "name": "string",
    "title": "string"
  },
  "discourse": {
    "topic": "string",
    "key_phrases": ["string"],
    "rhetorical_devices": [
      {
        "type": "string",
        "example": "string"
      }
    ],
    "policy_positions": ["string"]
  },
  "references": {
    "executive_orders": ["string"],
    "legislation": [
      {
        "name": "string",
        "number": "string"
      }
    ],
    "agreements": ["string"]
  },
  "visual": {
    "chyrons": ["string"],
    "backdrop": ["string"],
    "camera_shots": ["string"],
    "symbols": ["string"]
  }
}

The Output

The extraction produces both video-level entities and time-segmented entities:

This example demonstrates how Cloudglue can:

  • Extract structured information from both visual and audio content
  • Capture both high-level video entities and time-specific segments
  • Handle complex nested schemas with arrays and objects
  • Combine multiple types of information (visual, audio, textual) into a coherent structure

Extraction Methods

On-demand Extraction

For quick experiments or single extractions, use the direct extract endpoint:

// Define your schema
const schema = {
  products: [
    {
      name: "string",
      price: "string",
      rating: "string"
    }
  ]
};

// Create an extract job
const extractJob = await client.extract.createExtract(fileUri, {
schema: schema,
// Optionally include a prompt to guide the extraction
prompt: "Extract product details including exact prices and ratings"
});

// Get the results
const result = await client.extract.getExtract(extractJob.job_id);
console.log(result.data);

Using Collections

For production use cases, create an entities collection to process multiple videos consistently:

// Create a collection for product reviews
const collection = await client.collections.createCollection({
  name: "Product Reviews",
  collection_type: "entities",
  description: "Product review videos with structured data extraction",
  extract_config: {
    schema: {
      products: [
        {
          name: "string",
          price: "string",
          rating: "string"
        }
      ]
    }
  }
});

// Add a video to the collection
const fileInfo = await client.collections.addVideo(collection.id, fileId);

// Get the extracted entities
const entities = await client.collections.getEntities(collection.id, fileId);
console.log(entities);

Best Practices

  1. Focus on Observable Attributes: Design your schema around information that can be:

    • Visually seen in the video
    • Read from on-screen text
    • Heard in speech or narration
    • Understood from actions and events

    Avoid requesting:

    • Technical metadata (e.g., bitrate, duration)
    • Highly subjective interpretations
    • Information that requires external context
  2. Be Specific: Only extract the information you actually need. More fields aren’t always better.

  3. Use Prompts Effectively: Combine schemas with prompts to guide the extraction:

    {
      schema: productSchema,
      prompt: "Focus on extracting exact prices in USD and ratings out of 5 stars that are explicitly shown or mentioned"
    }
    
  4. Test Diverse Content: Use the Extract Playground to test your schema against different types of videos.

  5. Start Simple: Begin with a minimal schema and expand based on needs.

  6. Consider Segments: Decide if you need information at the video level, segment level, or both.

  7. Structure Based on Occurrence: Choose the right structure based on how entities appear in scenes:

    • Use lists ([]) for entities that can appear multiple times in a scene:
      {
        "vehicles": [
          {
            "type": "string",
            "color": "string"
          }
        ],
        "people": [
          {
            "clothing": "string",
            "action": "string"
          }
        ]
      }
      
    • Use single objects ({}) for attributes that typically appear once per scene:
      {
        "scene_lighting": {
          "brightness": "string",
          "type": "string"
        },
        "weather": {
          "condition": "string",
          "time_of_day": "string"
        }
      }
      
  8. Use Collections: For production systems, use collections to ensure consistent extraction across multiple videos.

Try it out

Check out our Extract Endpoint to get started with entity extraction. Get started on our platform.

YouTube

At the moment, if you want to extract entities from a YouTube video directly, we only support extracting from speech content. For full multimodal entity extraction, download the video and upload it to Cloudglue.

// Extract entities from YouTube video speech
const result = await client.extract.createExtract(
  'https://www.youtube.com/watch?v=VIDEO_ID',
  {
    schema: mySchema,
    prompt: "Extract key information from the speaker's content",
  },
);

When working with YouTube videos, consider adjusting your schema and prompt to focus on audio-centric information. Visual elements like camera_shots, backdrop, or on-screen symbols won’t be available through direct YouTube processing. For applications requiring comprehensive visual analysis, we recommend downloading the video first and using our Files API for complete multimodal understanding.