Transcription Guide
Deep dive into how to transcribe and analyze videos using our Transcription API
Video is a treasure trove of information, but developers need a way to get that information out. In some cases, developers may have used speech-to-text APIs to get the speech from a video. However, this doesn’t give you the full picture.
With Cloudglue, you can transcribe a video and get the speech alongside other information like on-screen text and visual scene descriptions.
This allows you to build more powerful applications that can operate on more information from the video, and in a simple and straightforward manner.
Cloudglue allows you to either transcribe locally uploaded files, or videos from YouTube.
The Basics
Transcribe Config
The transcribe config is a JSON object that contains the configuration for the transcription. It dictates what the transcription will include from the video/audio file.
By default, the transcription will include the speech, and summary of the video/audio file. However, you can also include visual scene descriptions, and on screen text or captions from the video/audio file.
Example Config
Here’s an example of the config you can use to transcribe a video/audio file.
By altering the config, you can explicitly control what gets generated from the video/audio file.
Generating a Transcription
Example
All you need is a few lines of code to generate a transcription with Cloudglue.
Transcription Outputs
Getting the transcription is as simple as making one call to get the transcription.
Getting the Transcription
Examples
Here’s an example of a truncated output you get when you transcribe a video/audio file.
JSON Example
Let’s look at the the different transcriptions available from the JSON output.
title
: The generated title of the video based on the transcriptions generated.summary
: A generated summary of the video based on the transcriptions generated.speech
: The speech transcription of the video.visual_scene_description
: A description of the scene at different timestamps in the video.scene_text
: The on-screen text at different timestamps in the video.
We also support markdown outputs, ideal for using in LLMs.
Markdown Example
The following is a truncated example of the markdown output.
Looking for other categories of information from the video? Learn more about our extraction features and what they can do for you.
Key Features
- Speech transcription: Speech to text transcription of video/audio files.
- Scene text transcriptions: On screen text or captions from a video/audio file at different timestamps.
- Visual scene transcripts: Get descriptions of different scenes in a video, at different timestamps.
- Title + Summary: Get a generated title and summary of the video/audio file based on all the transcriptions we have available.
- Markdown compatible: Our transcriptions are also able to generated with markdown, so you can use them in LLMs right away.
Try it out
Check out our Transcribe Video Endpoint to get started with building your own video/audio processing with Cloudglue. Get started on our platform.
YouTube
At the moment, if you want to transcribe a video directly from YouTube, we only support generating speech transcriptions.
If you would like to get the full spectrum of transcriptions for a YouTube video, you’ll need to download the video and upload it to Cloudglue.