Entities

What are Entities?

Entities are structured pieces of information extracted from video content. Think of them as the meaningful objects, events, or concepts that appear in a video. For example, in a cooking video, entities might include ingredients, cooking techniques, or kitchen equipment. Entities can be tied to specific moments in the video (like when a chef uses a particular technique) or apply to the entire video (like the cuisine type or difficulty level).

While rich transcription provides dense descriptions of everything in a video (speech, text, and visuals), and chat interfaces enable ad-hoc queries, entities serve a distinct purpose: they extract specific, structured information in a format ready for your application. Instead of parsing through paragraphs of description or formulating questions, you get exactly the data points you need in a consistent, programmable format.

Video-Level vs. Segment-Level Entities

Unlike documents or images which are static, videos unfold over time. This temporal nature creates two distinct types of entities:

Video-Level Entities: Information that applies to the entire video (e.g., the overall topic, the presenter’s name, or the production quality)
Segment-Level Entities: Information that appears at specific timestamps (e.g., when a product is shown, when a specific person speaks, or when a technique is demonstrated)

This distinction is crucial for applications that need to understand both the overall context and the specific moments where information appears. For instance, in a product review video, you might want to know both the overall rating (video-level) and exactly when specific features are discussed (segment-level).

Prompts and Schemas

Cloudglue uses two complementary approaches to extract entities:

Natural Language Prompts: These guide the extraction process by describing what information you’re interested in. For instance, “Extract all kitchen equipment used and ingredients shown in this cooking video” tells Cloudglue what to look for in culinary content.
Entity Schemas: These define the exact structure of the information you want. A schema acts like a template, ensuring the extracted data follows a consistent format. For example:

{
  "recipe": {
    "name": "string",
    "cuisine": "string",
    "servings": "string"
  },
  "equipment": [
    {
      "name": "string",
      "type": "string"
    }
  ],
  "ingredients": [
    {
      "name": "string",
      "amount": "string"
    }
  ]
}

You can use these approaches independently or together, depending on your needs:

Schema Only: Best for straightforward, unambiguous concepts where the structure matters most. For example, extracting speaker names and timestamps, or identifying jersey numbers and license plates. The concepts are clear enough that no additional guidance is needed.
Prompt Only: Ideal for exploration or when you’re more interested in discovering what’s in the video than enforcing a specific structure. For instance, “What teaching methods does this instructor use?” or “Extract the key arguments from this debate.” This approach provides flexibility in understanding new content.
Both Together: The most powerful approach when you need both specific guidance and structured output. The prompt helps focus on exactly what you want to extract (e.g., “Identify weather conditions in each scene as either ‘sunny’, ‘cloudy’, ‘rainy’, or ‘snowy’”), while the schema ensures the data comes back in a format your application can immediately use. This combination is particularly valuable for building reliable, production-ready applications.

Value and Applications

Entity extraction transforms unstructured video content into structured, programmable data. This opens up numerous possibilities:

Content Management: Automatically catalog and organize video libraries based on their content
Search and Discovery: Enable precise querying of video content using entity attributes (e.g., “WHERE scene.weather = ‘sunny’ AND person.attire = ‘business_suit’” maps directly to structured data fields)
Analytics: Track trends and patterns across video content (e.g., “COUNT scenes WHERE weather = ‘rainy’ GROUP BY month”)
Application Integration: Feed structured video data directly into applications, databases, or recommendation systems

The power of entities lies in their ability to bridge the gap between rich video content and the structured data that applications need. Whether you’re building a video-heavy application, analyzing content at scale, or automating video workflows, entities provide the foundation for working with video content programmatically.

Next Steps

To learn more about implementing entity extraction, including detailed examples and step-by-step guides, check out our Extraction Guide. The guide provides comprehensive examples of schema definition, prompt crafting, and practical applications of entity extraction.

Getting Started

Core Concepts

Deep Dives

Use Cases

What are Entities?

Video-Level vs. Segment-Level Entities

Prompts and Schemas

Value and Applications

Next Steps

Getting Started

Core Concepts

Deep Dives

Use Cases

​What are Entities?

​Video-Level vs. Segment-Level Entities

​Prompts and Schemas

​Value and Applications

​Next Steps

What are Entities?

Video-Level vs. Segment-Level Entities

Prompts and Schemas

Value and Applications

Next Steps