Build a Video Knowledge Base

A video knowledge base is a searchable repository that allows AI applications to find semantically similar content based on user queries. This enables more accurate and contextual responses by grounding AI outputs in actual video content. Before we dive into building our custom solution, it’s worth noting that Cloudglue’s Chat Completion API already provides powerful RAG capabilities for video content. Using collections and the chat API, you can get semantically relevant responses with citations directly. However, there are cases where a custom implementation makes sense:

Integrating with existing knowledge infrastructure
Custom embedding models for domain-specific knowledge
Fine-grained control over chunking and retrieval
Cost optimization for specific use cases

Implementation Overview

Our implementation will:

Use Cloudglue to transcribe videos
Parse the markdown into manageable chunks
Embed chunks using sentence-transformers
Store these in a simple vector database
Provide a search interface that returns relevant chunks

Let’s get started!

Setting Up the Environment

To build our video knowledge base, we’ll need several Python libraries to handle different aspects of the workflow:

cloudglue for accessing the transcription and video analysis capabilities
sentence-transformers for creating semantic embeddings from text
numpy and pandas for data manipulation
scipy for computing similarity between vectors
openai for integrating with AI language models

Let’s start by installing these packages and importing the necessary modules:

# Install required packages
#!pip install cloudglue sentence-transformers numpy pandas tf-keras openai

# Import libraries
import os
import json
import numpy as np
import pandas as pd
from cloudglue import CloudGlue
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Tuple, Optional
import re
from scipy.spatial.distance import cosine
import openai

Step 1: Configure Cloudglue Client

# Initialize the Cloudglue client
api_key = os.environ.get("CLOUDGLUE_API_KEY")
client = CloudGlue(api_key=api_key)

Step 2: Upload Videos

Before we can analyze our videos, we need to upload them to Cloudglue’s platform. This step is essential as it makes the videos accessible through Cloudglue’s infrastructure for subsequent processing. The following function handles the upload process and returns a file ID that will be used in later steps.

def upload_video(file_path: str) -> str:
    """
    Upload a video to Cloudglue and return the file ID.

    Args:
        file_path: Path to the video file

    Returns:
        str: The Cloudglue file ID
    """
    file = client.files.upload(file_path, wait_until_finish=True)
    print(f"Uploaded video: {file_path} with ID: {file.id}")
    return file.id

# Example usage - for this walkthrough we uploaded italian cooking videos
video_paths = [
    "path/to/cooking_video1.mp4",
    "path/to/cooking_video2.mp4",
]

video_ids = [upload_video(path) for path in video_paths]

Step 3: Transcribe Videos

Now that our videos are uploaded to Cloudglue, we can leverage its powerful transcription capabilities. Cloudglue’s transcription service not only captures spoken content but also analyzes visual elements, on-screen text, and can even generate summaries. This comprehensive approach gives us rich data to work with when building our knowledge base.

def transcribe_video(file_id: str) -> Dict:
    """
    Transcribe a video using Cloudglue and get JSON output directly.

    Args:
        file_id: Cloudglue file ID

    Returns:
        Dict: Transcription data with ID
    """
    # Create the transcription
    transcription = client.transcribe.run(
        url=f'cloudglue://files/{file_id}',
        enable_summary=True,
        enable_speech=True,
        enable_scene_text=True,
        enable_visual_scene_description=True,
        response_format='json'
    )
    transcription_id = transcription.job_id

    return {
        "id": file_id,
        "transcription_id": transcription_id,
        "data": transcription.data
    }

# Transcribe all videos
transcriptions = [transcribe_video(video_id) for video_id in video_ids]

Step 4: Process and Chunk the Transcription Data

The raw transcription data from Cloudglue contains various types of information - speech, visual descriptions, on-screen text, and metadata. To make this data usable for semantic search, we need to organize it into manageable chunks. Chunking is crucial for two reasons: it helps maintain semantic coherence by keeping related content together, and it ensures that our vector embeddings represent meaningful units of information. In this step, we’ll define functions to process the transcription data into chunks, with options for controlling chunk size and overlap between chunks:

def process_transcription_into_chunks(transcription_data, chunk_size: int = 1000, overlap: int = 200) -> list[dict]:
    """
    Process JSON transcription data into meaningful chunks.
    """
    # Extract basic metadata
    title = transcription_data.title
    summary = transcription_data.summary

    chunks = []

    # Add title and summary as a chunk
    chunks.append({
        "id": f"meta-{len(chunks)}",
        "type": "metadata",
        "title": title,
        "content": f"Title: {title}\n\nSummary: {summary}",
        "time_range": "00:00 - END"
    })

    # Process speech data
    speech_entries = transcription_data.speech
    if speech_entries:
        # Group speech entries into chunks based on size
        current_chunk = []
        current_length = 0
        current_start_time = speech_entries[0].start_time

        for entry in speech_entries:
            line = f"{format_time(entry.start_time)}: {entry.text}"
            line_length = len(line)

            # If adding this line exceeds the chunk size, save the current chunk and start a new one
            if current_length + line_length > chunk_size and current_chunk:
                end_time = entry.start_time
                time_range = f"{format_time(current_start_time)} - {format_time(end_time)}"

                # Save the current chunk
                chunk_content = "\n".join(current_chunk)
                chunks.append({
                    "id": f"speech-{len(chunks)}",
                    "type": "speech",
                    "content": f"Time: {time_range}\n\nSpeech:\n{chunk_content}",
                    "time_range": time_range
                })

                # Start a new chunk with overlap
                # Find where to start the new chunk with overlap
                overlap_entries = []
                total_length = 0
                for entry in reversed(current_chunk):
                    if total_length + len(entry) <= overlap:
                        overlap_entries.insert(0, entry)
                        total_length += len(entry)
                    else:
                        break

                current_chunk = overlap_entries
                current_length = total_length
                current_start_time = end_time  # Update start time for the new chunk

            current_chunk.append(line)
            current_length += line_length

        # Add the last chunk if not empty
        if current_chunk:
            end_time = speech_entries[-1].end_time if speech_entries else 0
            time_range = f"{format_time(current_start_time)} - {format_time(end_time)}"

            chunk_content = "\n".join(current_chunk)
            chunks.append({
                "id": f"speech-{len(chunks)}",
                "type": "speech",
                "content": f"Time: {time_range}\n\nSpeech:\n{chunk_content}",
                "time_range": time_range
            })

    # Process visual scene descriptions
    visual_descriptions = transcription_data.visual_scene_description
    for i, desc in enumerate(visual_descriptions):
        time_range = f"{format_time(desc.start_time)} - {format_time(desc.end_time)}"
        chunks.append({
            "id": f"visual-{i}",
            "type": "visual_content",
            "content": f"Time: {time_range}\n\nVisual Content:\n{desc.text}",
            "time_range": time_range
        })

    # Process scene text
    scene_text_entries = transcription_data.scene_text
    for i, text in enumerate(scene_text_entries):
        time_range = f"{format_time(text.start_time)} - {format_time(text.end_time)}"
        chunks.append({
            "id": f"screen_text-{i}",
            "type": "on_screen_text",
            "content": f"Time: {time_range}\n\nOn-screen Text:\n{text.text}",
            "time_range": time_range
        })

    return chunks

def format_time(seconds: float) -> str:
    """Convert seconds to MM:SS format"""
    minutes = int(seconds // 60)
    secs = int(seconds % 60)
    return f"{minutes:02d}:{secs:02d}"

# Process each transcription into chunks
all_chunks = []
for trans in transcriptions:
    video_chunks = process_transcription_into_chunks(trans["data"])
    # Add video ID to each chunk
    for chunk in video_chunks:
        chunk["video_id"] = trans["id"]

    all_chunks.extend(video_chunks)
    print(f"Processed {len(video_chunks)} chunks from video {trans['id']}")

Step 5: Embed the Chunks

Now that we’ve processed our transcriptions into meaningful chunks, we need to convert these text representations into numerical vectors that capture their semantic meaning. This process, known as embedding, allows us to measure semantic similarity between different pieces of content. We’ll use the Sentence Transformers library with the ‘all-MiniLM-L6-v2’ model, which provides a good balance between performance and accuracy. Each chunk will be embedded into a fixed-length vector, creating a matrix of embeddings that we can search through efficiently.

def embed_chunks(chunks):
    """
    Embed chunks using sentence-transformers.

    Args:
        chunks: List of chunk dictionaries

    Returns:
        Tuple[np.ndarray, List[Dict]]: Matrix of embeddings and enhanced chunks
    """
    # Initialize the embedding model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Extract content for embedding
    texts = [chunk["content"] for chunk in chunks]

    # Generate embeddings
    embeddings = model.encode(texts)

    # Add embedding indices to chunks for future reference
    for i, chunk in enumerate(chunks):
        chunk["embedding_index"] = i

    return embeddings, chunks

# Embed all chunks
embeddings_matrix, enhanced_chunks = embed_chunks(all_chunks)
print(f"Created embeddings with shape: {embeddings_matrix.shape}")

Example chunks

After processing our transcription data, we end up with well-structured chunks like the examples below. Each chunk contains metadata about its source, type, content, and timestamp range, making it easy to track and reference when returned in search results.

[{'id': 'meta-0',
  'type': 'metadata',
  'title': "Gennaro Contaldo's Authentic Italian Family Ragu Recipe",
  'content': "Title: Gennaro Contaldo's Authentic Italian Family Ragu Recipe\n\nSummary: In this video, chef Gennaro Contaldo takes viewers on a journey to Italy to share his cherished family ragu recipe. Starting with fresh ingredients picked from a garden, he demonstrates how to prepare an authentic pork and beef ragu, including seasoning, cooking techniques, and the importance of using quality wine. The rustic kitchen setting enhances the traditional cooking atmosphere as Gennaro carefully stews the meat with tomatoes, onions, and herbs, explaining key culinary tips like sweating onions and avoiding burning. The video concludes with the preparation of tagliatelle pasta served with the rich ragu sauce, evoking memories of Sunday family lunches, and encourages viewers to enjoy the finished dish.",
  'time_range': '00:00 - END',
  'video_id': '5d47c409-7c0d-45c4-88e3-da1e51f0e7b6'},
 {'id': 'speech-1',
  'type': 'speech',
  'content': "Time: 00:10 - 01:34\n\nSpeech:\n00:10: I am Genera Contaldo, and this is my family revel.\n00:15: It's so good.\n00:17: Olive oil goes in.\n00:20: 500 gram topside of beef.\n00:23: You can use brisket as well.\n00:24: It's fantastic.\n00:26: Roughly, this is the size you want.\n00:28: I have a 500 gram of spare ribs, 200 grams of sausages.\n00:33: Just cut them in half.\n00:35: Little salt.\n00:37: Pepper.\n00:39: Mix and mix and mix and mix.\n00:44: Yeah.\n00:45: I can hear that beautiful noise.\n00:47: It's the music I wanted.\n00:50: You get two bay leaves.\n00:51: Come on.\n00:52: I'm gonna put the three inside.\n00:55: Don't just put the leaves.\n00:56: Break it inside.\n00:59: Seal the meat properly.\n01:01: I wanted them all like this.\n01:03: Yes.\n01:05: One onion.\n01:07: Yes.\n01:08: I got it.\n01:10: Don't worry if it's not too thick.\n01:12: The onions will dissolve it.\n01:14: Come have a look.\n01:15: Get inside.\n01:16: Look at the color.\n01:23: Wine goes in.\n01:27: The wine you bring is the one you use.\n01:31: Do not use a cooking wine.\n01:33: Come on.",
  'time_range': '00:10 - 01:34',
  'video_id': '5d47c409-7c0d-45c4-88e3-da1e51f0e7b6'},
 {'id': 'speech-2',
  'type': 'speech',
  'content': "Time: 01:34 - 02:33\n\nSpeech:\n01:14: Come have a look.\n01:15: Get inside.\n01:16: Look at the color.\n01:23: Wine goes in.\n01:27: The wine you bring is the one you use.\n01:31: Do not use a cooking wine.\n01:33: Come on.\n01:34: What is a cooking wine?\n01:35: Colored water?\n01:36: No.\n01:37: Way.\n01:38: It's almost dancing inside.\n01:40: Let the wine evaporate it.\n01:43: So what that should you do?\n01:45: You evaporate the dark cold.\n01:49: I want the danya to sweat, not to burn.\n01:52: Did you know what the difference between sweat and burn?\n01:56: Let me tell you.\n02:02: What do you do when you jogging?\n02:04: You sweat.\n02:05: That is a sweat.\n02:06: Okay.\n02:07: Burn.\n02:08: You go to a holiday anywhere around the world, including Italy.\n02:13: You stand under the sun for two hours or three hours, you burn.\n02:17: You don't wanna do that.\n02:19: Treating some tomato, chopped.\n02:21: They kiss one each other, and they go all in.\n02:24: Give them a little stir.\n02:27: Remember, it's still on a high flame.\n02:32: Good.",
  'time_range': '01:34 - 02:33',
  'video_id': '5d47c409-7c0d-45c4-88e3-da1e51f0e7b6'}]

Step 6: Build a Simple Vector Database

The heart of our video knowledge base is a vector database that stores embeddings and allows for efficient semantic search. Before diving into the implementation, let’s understand the architecture: With our chunks embedded as vectors, we need a way to store and efficiently search through them. While there are many sophisticated vector database solutions available (like Pinecone, Weaviate, or FAISS), for this tutorial we’ll implement a simple in-memory vector database using NumPy and SciPy’s cosine similarity function. Our vector database will allow us to:

Store embeddings and their associated metadata
Search for semantically similar content based on query embeddings
Save and load the database for persistence

class SimpleVectorDB:
    """A simple in-memory vector database using numpy and cosine similarity."""

    def __init__(self, embeddings: np.ndarray, chunks: List[Dict]):
        """
        Initialize the database.

        Args:
            embeddings: Matrix of embeddings
            chunks: List of chunk dictionaries with metadata
        """
        self.embeddings = embeddings
        self.chunks = chunks
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

    def search(self, query: str, top_k: int = 5) -> List[Dict]:
        """
        Search for chunks similar to the query.

        Args:
            query: Search query
            top_k: Number of results to return

        Returns:
            List[Dict]: Top k most similar chunks
        """
        # Embed the query
        query_embedding = self.embedding_model.encode([query])[0]

        # Calculate similarity scores using cosine similarity
        # (1 - cosine distance = cosine similarity)
        similarities = [1 - cosine(query_embedding, emb) for emb in self.embeddings]

        # Get indices of top_k most similar chunks
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        # Return results with similarity scores
        results = []
        for idx in top_indices:
            chunk = self.chunks[idx].copy()
            chunk["similarity"] = similarities[idx]
            results.append(chunk)

        return results

    def save(self, filepath: str):
        """
        Save the database to a file.

        Args:
            filepath: Path to save the database
        """
        data = {
            "embeddings": self.embeddings.tolist(),
            "chunks": self.chunks
        }

        with open(filepath, 'w') as f:
            json.dump(data, f)

        print(f"Database saved to {filepath}")

    @classmethod
    def load(cls, filepath: str):
        """
        Load a database from a file.

        Args:
            filepath: Path to the database file

        Returns:
            SimpleVectorDB: Loaded database
        """
        with open(filepath, 'r') as f:
            data = json.load(f)

        embeddings = np.array(data["embeddings"])
        chunks = data["chunks"]

        return cls(embeddings, chunks)

# Create our vector database
db = SimpleVectorDB(embeddings_matrix, enhanced_chunks)

# Save the database
db.save("video_knowledge_base.json")

Step 7: Using the Knowledge Base

With our vector database built and saved, we can now start using it to search for relevant video content. In this step, we’ll define a function to format search results for better readability, then demonstrate several example queries to test the system’s capabilities. These queries will show how the semantic search can find relevant content even when the exact wording isn’t present in the transcription. For example, we can ask about cooking techniques, ingredients, or visual elements, and the system will return contextually relevant results from across our video collection.

def format_result(result: Dict) -> str:
    """Format a search result for display."""
    return f"""
Similarity: {result['similarity']:.4f}
Time Range: {result['time_range']}
Type: {result['type']}
Content:
{result['content'][:300]}{'...' if len(result['content']) > 300 else ''}
Video ID: {result['video_id']}
---
"""

# Load the database
db = SimpleVectorDB.load("video_knowledge_base.json")

# Example 1: Search for cooking techniques
query = "What cooking techniques are used for pasta dishes?"
results = db.search(query, top_k=3)

print(f"Query: {query}\n")
for result in results:
    print(format_result(result))

# Example 2: Search for ingredients
query = "What ingredients are needed for the main recipe?"
results = db.search(query, top_k=3)

print(f"Query: {query}\n")
for result in results:
    print(format_result(result))

# Example 3: Look for specific visual elements
query = "Show me when the chef is plating the food"
results = db.search(query, top_k=3)

print(f"Query: {query}\n")
for result in results:
    print(format_result(result))

Sample Output

Let’s examine the raw results from our semantic search functionality. These are the direct outputs from our vector database before any AI processing. For each query, we see the top matches along with their similarity scores (higher is better), time ranges in the video, content types, and snippets of the actual content. Notice how even without AI interpretation, the system finds relevant video segments based on semantic meaning rather than just keyword matching:

Query: What cooking techniques are used for pasta dishes?


Similarity: 0.5498
Time Range: 04:00 - 04:20
Type: visual_content
Content:
Time: 04:00 - 04:20

Visual Content:
A close up shot of pasta being held up by tongs.
A man in a kitchen is cooking pasta and sauce in a frying pan on a gas stove. The kitchen is rustic with an old-fashioned oven and various cooking utensils and decorations.
Close up of a man in a striped apron stir...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---


Similarity: 0.5273
Time Range: 04:20 - 04:40
Type: visual_content
Content:
Time: 04:20 - 04:40

Visual Content:
A close-up shot of pasta on a blue and white plate, being served with tongs.
A man with a light purple shirt and striped apron is speaking to the camera.
Close up shot of a pasta dish being topped with parmesan cheese.
A man stands in a rustic kitchen, gesturing ...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---


Similarity: 0.4681
Time Range: 03:56 - 03:59
Type: on_screen_text
Content:
Time: 03:56 - 03:59

On-screen Text:
HOW TO MAKE PASTA
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---

Query: What ingredients are needed for the main recipe?


Similarity: 0.4463
Time Range: 00:00 - END
Type: metadata
Content:
Title: Gennaro Contaldo's Authentic Italian Family Ragu Recipe

Summary: In this video, chef Gennaro Contaldo takes viewers on a journey to Italy to share his cherished family ragu recipe. Starting with fresh ingredients picked from a garden, he demonstrates how to prepare an authentic pork and beef...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---


Similarity: 0.4229
Time Range: 00:10 - 01:34
Type: speech
Content:
Time: 00:10 - 01:34

Speech:
00:10: I am Genera Contaldo, and this is my family revel.
00:15: It's so good.
00:17: Olive oil goes in.
00:20: 500 gram topside of beef.
00:23: You can use brisket as well.
00:24: It's fantastic.
00:26: Roughly, this is the size you want.
00:28: I have a 500 gram of spa...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---


Similarity: 0.3809
Time Range: 03:40 - 04:00
Type: visual_content
Content:
Time: 03:40 - 04:00

Visual Content:
A man is shown from the chest up, looking at the camera and talking. He is wearing a lavender shirt and striped apron. The Food Tube logo is in the upper left corner.
A faded black and white photo of two women and a young boy is displayed. The photo is damaged wi...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---

Query: Show me when the chef is plating the food


Similarity: 0.5961
Time Range: 01:00 - 01:20
Type: visual_content
Content:
Time: 01:00 - 01:20

Visual Content:
A pot containing pieces of meat is being stirred with a wooden spoon.
A man in a kitchen holds up a wooden spoon with a cooked piece of meat on it to show the camera.
The chef holds up a peeled onion.
The chef cuts up an onion on a wooden cutting board in a rusti...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---


Similarity: 0.5780
Time Range: 00:40 - 01:00
Type: visual_content
Content:
Time: 00:40 - 01:00

Visual Content:
A man wearing a striped apron stands at a wooden table, preparing food in a rustic kitchen. Behind him is a whitewashed brick fireplace with lit candles. Bottles and various cooking ingredients are arranged on the shelves and table around him. The lighting is war...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---


Similarity: 0.5513
Time Range: 02:40 - 03:00
Type: visual_content
Content:
Time: 02:40 - 03:00

Visual Content:
Close-up shot of a person wearing a striped apron, stirring a red liquid in a glass with a spoon. The person is also wearing a watch.
The red liquid from the glass is poured into a white pot with handles, containing a dark red sauce. The pot is sitting on a gas b...
Video ID: 5d47c409-7c0d-45c4-88e3-da1e51f0e7b6
---

Integrating with AI Agents

While our semantic search implementation is already useful for finding relevant video content, we can take it a step further by integrating with large language models (LLMs). By combining our knowledge base with an AI agent like OpenAI’s GPT models, we can create a powerful question-answering system that provides natural language responses grounded in our video content. In this section, we’ll create two key functions:

A wrapper to query our knowledge base and format the results for an AI agent
A function to generate structured, informative responses using OpenAI’s API

def query_knowledge_base(query: str, db: SimpleVectorDB, top_k: int = 25) -> str:
    """
    Query the knowledge base and format results for an AI agent.

    Args:
        query: User query
        db: Vector database
        top_k: Number of results to return

    Returns:
        str: Formatted context for the AI agent
    """
    results = db.search(query, top_k=top_k)

    context = f"Based on the video content, here are the most relevant sections:\n\n"

    for i, result in enumerate(results, 1):
        context += f"[Section {i} - {result['time_range']}]\n"
        context += f"{result['content']}\n\n"

    return context

def get_ai_response(user_query: str, db: SimpleVectorDB, model: str = "gpt-4.1"):
    """
    Get an AI-generated response to a query based on the knowledge base.

    Args:
        user_query: The user's question
        db: The vector database
        model: OpenAI model to use

    Returns:
        str: The AI-generated response
    """
    # Get relevant context from the knowledge base
    context = query_knowledge_base(user_query, db)

    # Construct the system and user messages
    messages = [
        {
            "role": "system",
            "content": """You are an AI assistant that answers questions about cooking videos.
Use ONLY the information provided in the context to answer questions.
If the information is not in the context, say that you don't have enough information.
Include relevant timestamps from the context in your answer when appropriate.
Format your response in Markdown."""
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {user_query}"
        }
    ]

    # Call the OpenAI API
    try:
        response = openai.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.3,  # Lower temperature for more focused responses
            max_tokens=800
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error generating response: {str(e)}"

# Example usage with real OpenAI integration
def run_qa_demo():
    """
    Run a demonstration of the QA system using the knowledge base.
    """
    # Load API key - make sure to set this in your environment or use another secure method
    openai.api_key = os.environ.get("OPENAI_API_KEY")
    if not openai.api_key:
        print("WARNING: OpenAI API key not found. Please set the OPENAI_API_KEY environment variable.")
        return

    # Load the database
    db = SimpleVectorDB.load("video_knowledge_base.json")

    # Define some sample questions
    sample_questions = [
        "What ingredients are needed for the dish?",
        "How does the chef prepare the sauce?",
        "What special techniques are used in the cooking video?",
        "When does the chef plate the final dish?"
    ]

    # Run the demo for each question
    for i, question in enumerate(sample_questions, 1):
        print(f"\n--- Question {i}: {question} ---\n")
        response = get_ai_response(question, db)
        print(response)
        print("\n" + "-" * 50)

# To run the demo, uncomment the line below
run_qa_demo()

Sample Output

Here’s what the output might look like when running the QA system with real videos:

--- Question 1: What ingredients are needed for the dish? ---

Based on the video, the ingredients needed for Gennaro Contaldo's authentic Italian family ragu are:

- Olive oil
- 500 grams topside of beef (or brisket as an alternative)
- 500 grams spare ribs
- 200 grams sausages (cut in half)
- Salt
- Pepper
- 2 bay leaves (broken before adding)
- 1 onion
- Wine (use the wine you would drink, not cooking wine)
- Chopped or crushed tomatoes (three cans, e.g., Pomilia or Mutti brand, see [02:55 - 02:58](#section-8))
- Tomato concentrate (two tablespoons)
- Water
- Fresh basil

These ingredients are mentioned and shown throughout the preparation process, especially in the speech from [00:10 - 01:34](#section-5), [02:33 - 04:03](#section-20), and the visual content in [02:40 - 03:00](#section-6) and [03:00 - 03:20](#section-10).

--------------------------------------------------

--- Question 2: How does the chef prepare the sauce? ---

Based on the video content, here is how the chef, Gennaro Contaldo, prepares the sauce for his family ragu:

1. **Meat Preparation**
   - He starts by seasoning 500g topside of beef (or brisket), 500g spare ribs, and 200g sausages with salt and pepper, then mixes them together (00:20–00:40, [Section 12], [Section 18]).
   - The meat is placed into a white pot with water and seared on a gas burner (00:40–01:00, [Section 9]).

2. **Aromatics**
   - Two bay leaves are torn and added to the pot (00:40–01:00, [Section 9]; 00:50–00:59, [Section 18]).
   - One onion is chopped and added to the pot (01:00–01:20, [Section 8]; 01:05–01:12, [Section 18]).

3. **Deglazing with Wine**
   - Red wine is poured into the pot with the meat and onions (01:20–01:40, [Section 13]; 01:23–01:27, [Section 18]).
   - The chef emphasizes using a good quality wine, not cooking wine (01:27–01:33, [Section 18]).

4. **Tomatoes and Tomato Paste**
   - Chopped tomatoes (from cans) are added to the pot (02:20–02:40, [Section 7]).
   - About half a glass of wine is mixed with two tablespoons of concentrated tomato paste, dissolved, and then poured into the pot (02:33–03:00, [Section 1]; 02:33–02:39, [Section 4]).

5. **Adding Water**
   - Water is poured into the empty tomato cans to rinse them, and then this water is added to the pot (02:47–03:00, [Section 1]; 02:49–02:55, [Section 4]).

6. **Basil and Simmering**
   - A bunch of fresh basil is chopped and added to the sauce (03:00–03:20, [Section 3]; 03:08–03:18, [Section 4]).
   - The sauce is stirred, covered, and left to cook slowly on a low flame for about two hours (03:20–03:40, [Section 11]; 03:23–03:38, [Section 4]).

In summary, the chef prepares the sauce by browning and seasoning the meat, adding aromatics, deglazing with wine, incorporating tomatoes and tomato paste, adding water, seasoning with basil, and then simmering the mixture slowly for about two hours ([Sections 1, 3, 4, 7, 9, 11, 12, 13, 18]).

--------------------------------------------------

--- Question 3: What special techniques are used in the cooking video? ---

Based on the provided context, the cooking video features several special techniques:

1. **Sweating Onions and Avoiding Burning**
   Gennaro emphasizes the importance of sweating onions and ensuring they do not burn, which is a key culinary tip for building flavor in the ragu (Summary, Section 16).

2. **Sealing the Meat**
   The chef instructs to "seal the meat properly" before proceeding with the rest of the recipe. This technique helps to lock in the juices and flavor of the meat ([00:44 - 01:01], Section 18).

3. **Breaking Bay Leaves Before Adding**
   Instead of just adding bay leaves whole, Gennaro breaks them before dropping them into the pot to release more flavor ([00:50 - 00:56], Section 18; [00:40 - 01:00], Section 4).

4. **Dissolving Tomato Paste in Wine**
   The chef dissolves concentrated tomato paste in red wine before adding it to the sauce, ensuring even distribution and enhanced flavor ([02:33 - 02:39], Section 15; [02:20 - 02:40], Section 1).

5. **Using Quality Wine**
   Gennaro stresses to "not use a cooking wine" but rather the same quality wine you would drink, which impacts the final taste of the sauce ([01:27 - 01:31], Section 18).

6. **Adding Basil at the Right Moment**
   Fresh basil is added to the sauce once it starts bubbling, which preserves its aroma and flavor ([03:08 - 03:21], Section 15; [03:00 - 03:20], Section 11).

7. **Slow Cooking**
   The ragu is cooked "slowly slowly for about two hours," which is crucial for developing deep, rich flavors ([03:23], Section 15).

8. **Finishing Touches**
   The finished pasta dish is topped with parmesan cheese and a drizzle of olive oil, likened to "a bride and groom throwing confetti" ([04:24 - 04:32], Section 19; [04:20 - 04:40], Section 10).

These techniques are demonstrated throughout the video, particularly in the following timestamps:
- Sealing meat and breaking bay leaves: [00:40 - 01:00], [00:44 - 01:01]
- Dissolving tomato paste in wine: [02:33 - 02:39]
- Adding basil: [03:08 - 03:21]
- Slow cooking: [03:23]
- Finishing with parmesan and olive oil: [04:24 - 04:32]

These steps highlight the traditional and thoughtful approach Gennaro uses to create an authentic Italian family ragu.

--------------------------------------------------

--- Question 4: When does the chef plate the final dish? ---

The chef plates the final dish between **04:00 and 04:40**. Specifically:

- At **04:00 - 04:20** ([Section 5]), there is a close-up of pasta being held up by tongs, and the chef is seen using tongs to put the pasta and sauce onto a plate.
- At **04:20 - 04:40** ([Section 1]), there is a close-up shot of pasta on a blue and white plate being served with tongs, and the dish is topped with parmesan cheese.

These visuals indicate the plating of the final dish.

--------------------------------------------------

Conclusion

As demonstrated by our examples above, combining a semantic search vector database with large language models creates a powerful system for extracting and presenting information from video content. The integration produces natural, well-structured responses that feel more like answers from a knowledgeable assistant than raw search results. In this tutorial, we built a custom video knowledge base using Cloudglue’s transcription capabilities and open-source embedding models. This approach gives you full control over the embedding and retrieval process while leveraging Cloudglue’s powerful video analysis features. Our implementation provides:

Video transcription with speech, visual content, and on-screen text
Intelligent chunking of content
Vector embeddings using sentence-transformers
A simple in-memory vector database with cosine similarity search
Integration points for AI agents

For production applications, you might want to consider:

Using a more scalable vector database like Pinecone, Weaviate, or FAISS
Implementing more sophisticated chunking strategies
Adding metadata filtering capabilities
Developing a feedback loop to improve retrieval quality

Remember that while this custom approach gives you flexibility, Cloudglue’s Chat Completion API provides many of these capabilities out of the box with its rich transcript collections and search capabilities.

Getting Started

Core Concepts

Deep Dives

Use Cases

Build a Video Knowledge Base

Implementation Overview

Setting Up the Environment

Step 1: Configure Cloudglue Client

Step 2: Upload Videos

Step 3: Transcribe Videos

Step 4: Process and Chunk the Transcription Data

Step 5: Embed the Chunks

Example chunks

Step 6: Build a Simple Vector Database

Step 7: Using the Knowledge Base

Sample Output

Integrating with AI Agents

Sample Output

Conclusion

Getting Started

Core Concepts

Deep Dives

Use Cases

​Implementation Overview

​Setting Up the Environment

​Step 1: Configure Cloudglue Client

​Step 2: Upload Videos

​Step 3: Transcribe Videos

​Step 4: Process and Chunk the Transcription Data

​Step 5: Embed the Chunks

​Example chunks

​Step 6: Build a Simple Vector Database

​Step 7: Using the Knowledge Base

​Sample Output

​Integrating with AI Agents

​Sample Output

​Conclusion

Implementation Overview

Setting Up the Environment

Step 1: Configure Cloudglue Client

Step 2: Upload Videos

Step 3: Transcribe Videos

Step 4: Process and Chunk the Transcription Data

Step 5: Embed the Chunks

Example chunks

Step 6: Build a Simple Vector Database

Step 7: Using the Knowledge Base

Sample Output

Integrating with AI Agents

Sample Output

Conclusion