Build a Video Knowledge Base
Create a searchable knowledge base from your video content
A video knowledge base is a searchable repository that allows AI applications to find semantically similar content based on user queries. This enables more accurate and contextual responses by grounding AI outputs in actual video content.
Before we dive into building our custom solution, it’s worth noting that Cloudglue’s Chat Completion API already provides powerful RAG capabilities for video content. Using collections and the chat API, you can get semantically relevant responses with citations directly.
However, there are cases where a custom implementation makes sense:
- Integrating with existing knowledge infrastructure
- Custom embedding models for domain-specific knowledge
- Fine-grained control over chunking and retrieval
- Cost optimization for specific use cases
Implementation Overview
Our implementation will:
- Use Cloudglue to transcribe videos
- Parse the markdown into manageable chunks
- Embed chunks using sentence-transformers
- Store these in a simple vector database
- Provide a search interface that returns relevant chunks
Let’s get started!
Setting Up the Environment
To build our video knowledge base, we’ll need several Python libraries to handle different aspects of the workflow:
cloudglue
for accessing the transcription and video analysis capabilitiessentence-transformers
for creating semantic embeddings from textnumpy
andpandas
for data manipulationscipy
for computing similarity between vectorsopenai
for integrating with AI language models
Let’s start by installing these packages and importing the necessary modules:
Step 1: Configure Cloudglue Client
Step 2: Upload Videos
Before we can analyze our videos, we need to upload them to Cloudglue’s platform. This step is essential as it makes the videos accessible through Cloudglue’s infrastructure for subsequent processing. The following function handles the upload process and returns a file ID that will be used in later steps.
Step 3: Transcribe Videos
Now that our videos are uploaded to Cloudglue, we can leverage its powerful transcription capabilities. Cloudglue’s transcription service not only captures spoken content but also analyzes visual elements, on-screen text, and can even generate summaries. This comprehensive approach gives us rich data to work with when building our knowledge base.
Step 4: Process and Chunk the Transcription Data
The raw transcription data from Cloudglue contains various types of information - speech, visual descriptions, on-screen text, and metadata. To make this data usable for semantic search, we need to organize it into manageable chunks. Chunking is crucial for two reasons: it helps maintain semantic coherence by keeping related content together, and it ensures that our vector embeddings represent meaningful units of information.
In this step, we’ll define functions to process the transcription data into chunks, with options for controlling chunk size and overlap between chunks:
Step 5: Embed the Chunks
Now that we’ve processed our transcriptions into meaningful chunks, we need to convert these text representations into numerical vectors that capture their semantic meaning. This process, known as embedding, allows us to measure semantic similarity between different pieces of content.
We’ll use the Sentence Transformers library with the ‘all-MiniLM-L6-v2’ model, which provides a good balance between performance and accuracy. Each chunk will be embedded into a fixed-length vector, creating a matrix of embeddings that we can search through efficiently.
Example chunks
After processing our transcription data, we end up with well-structured chunks like the examples below. Each chunk contains metadata about its source, type, content, and timestamp range, making it easy to track and reference when returned in search results.
Step 6: Build a Simple Vector Database
The heart of our video knowledge base is a vector database that stores embeddings and allows for efficient semantic search. Before diving into the implementation, let’s understand the architecture:
With our chunks embedded as vectors, we need a way to store and efficiently search through them. While there are many sophisticated vector database solutions available (like Pinecone, Weaviate, or FAISS), for this tutorial we’ll implement a simple in-memory vector database using NumPy and SciPy’s cosine similarity function.
Our vector database will allow us to:
- Store embeddings and their associated metadata
- Search for semantically similar content based on query embeddings
- Save and load the database for persistence
Step 7: Using the Knowledge Base
With our vector database built and saved, we can now start using it to search for relevant video content. In this step, we’ll define a function to format search results for better readability, then demonstrate several example queries to test the system’s capabilities.
These queries will show how the semantic search can find relevant content even when the exact wording isn’t present in the transcription. For example, we can ask about cooking techniques, ingredients, or visual elements, and the system will return contextually relevant results from across our video collection.
Sample Output
Let’s examine the raw results from our semantic search functionality. These are the direct outputs from our vector database before any AI processing. For each query, we see the top matches along with their similarity scores (higher is better), time ranges in the video, content types, and snippets of the actual content. Notice how even without AI interpretation, the system finds relevant video segments based on semantic meaning rather than just keyword matching:
Integrating with AI Agents
While our semantic search implementation is already useful for finding relevant video content, we can take it a step further by integrating with large language models (LLMs). By combining our knowledge base with an AI agent like OpenAI’s GPT models, we can create a powerful question-answering system that provides natural language responses grounded in our video content.
In this section, we’ll create two key functions:
- A wrapper to query our knowledge base and format the results for an AI agent
- A function to generate structured, informative responses using OpenAI’s API
Sample Output
Here’s what the output might look like when running the QA system with real videos:
Conclusion
As demonstrated by our examples above, combining a semantic search vector database with large language models creates a powerful system for extracting and presenting information from video content. The integration produces natural, well-structured responses that feel more like answers from a knowledgeable assistant than raw search results.
In this tutorial, we built a custom video knowledge base using Cloudglue’s transcription capabilities and open-source embedding models. This approach gives you full control over the embedding and retrieval process while leveraging Cloudglue’s powerful video analysis features.
Our implementation provides:
- Video transcription with speech, visual content, and on-screen text
- Intelligent chunking of content
- Vector embeddings using sentence-transformers
- A simple in-memory vector database with cosine similarity search
- Integration points for AI agents
For production applications, you might want to consider:
- Using a more scalable vector database like Pinecone, Weaviate, or FAISS
- Implementing more sophisticated chunking strategies
- Adding metadata filtering capabilities
- Developing a feedback loop to improve retrieval quality
Remember that while this custom approach gives you flexibility, Cloudglue’s Chat Completion API provides many of these capabilities out of the box with its rich transcript collections and search capabilities.