TensorSense Data Generation SDK

This SDK generates datasets for training Video LLMs from youtube videos. More sources coming later!

🐠 What it Does

Generate search queries with GPT.
Search for youtube videos for each query using scrapetube.
Download the videos that were found and subtitles using yt-dlp.
Detect segments from each video using CLIP and a fancy manual algorithm.
Generate annotations for each segment with GPT using audio transcript (eg instructions) in 2 steps: first extract clues from the trancript, then generate annotations based on these clues.
Aggregate segments with annotations into one file
Cut segments into separate video clips with ffmpeg.

In the end you'll have a directory with useful video clips and an annotation file, which you can then train a model on.

pip install -r requirements.txt. If it doesn't work, try updating pip install -U -r requirements.txt.
make .env file with:
- OPENAI_API_KEY for openai
- AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY for azure
- OPENAI_API_VERSION='2023-07-01-preview'
set config params in the notebook:
- openai.type: openai/azure
- openai.temperature: the bigger, the more random/creative output will be
- openai.deployment: model for openai / deployment for azure. Needs to be able to do structured output and process images. Tested on gpt4o on azure.
- data_dir: the path where all the results will be saved. Change it for each experiment/dataset.

If you have your own videos with descriptions, you can skip the download/filtering steps and move straight to generating annotaions!

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
datagen		datagen
examples/squats		examples/squats
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
getting_started.ipynb		getting_started.ipynb
requirements.txt		requirements.txt