Skip to content

This SDK generates datasets for training Video LLMs from youtube videos.

License

Notifications You must be signed in to change notification settings

Chase-Aldridge/vlm_databuilder

 
 

Repository files navigation

TensorSense Data Generation SDK

This SDK generates datasets for training Video LLMs from youtube videos. More sources coming later!

🐠 What it Does

  • Generate search queries with GPT.
  • Search for youtube videos for each query using scrapetube.
  • Download the videos that were found and subtitles using yt-dlp.
  • Detect segments from each video using CLIP and a fancy manual algorithm.
  • Generate annotations for each segment with GPT using audio transcript (eg instructions) in 2 steps: first extract clues from the trancript, then generate annotations based on these clues.
  • Aggregate segments with annotations into one file
  • Cut segments into separate video clips with ffmpeg.

In the end you'll have a directory with useful video clips and an annotation file, which you can then train a model on.

🐬 Installation

  • pip install -r requirements.txt. If it doesn't work, try updating pip install -U -r requirements.txt.
  • make .env file with:
    • OPENAI_API_KEY for openai
    • AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY for azure
    • OPENAI_API_VERSION='2023-07-01-preview'
  • set config params in the notebook:
    • openai.type: openai/azure
    • openai.temperature: the bigger, the more random/creative output will be
    • openai.deployment: model for openai / deployment for azure. Needs to be able to do structured output and process images. Tested on gpt4o on azure.
    • data_dir: the path where all the results will be saved. Change it for each experiment/dataset.

🐙 Usage

Please refer to getting_started.ipynb

If you have your own videos with descriptions, you can skip the download/filtering steps and move straight to generating annotaions!

About

This SDK generates datasets for training Video LLMs from youtube videos.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 69.3%
  • Jupyter Notebook 30.7%