- 09/20/2024: Our paper has been accepted by EMNLP 2024. See you in Miami!🏝
- 07/04/2024: The OmAgent open-source project has been unveiled. 🎉
- 06/24/2024: The OmAgent research paper has been published.
OmAgent is a sophisticated multimodal intelligent agent system, dedicated to harnessing the power of multimodal large language models and other multimodal algorithms to accomplish intriguing tasks. The OmAgent project encompasses a lightweight intelligent agent framework, omagent_core, meticulously designed to address multimodal challenges. With this framework, we have constructed an intricate long-form video comprehension system—OmAgent. Naturally, you have the liberty to employ it to realize any of your innovative ideas.
OmAgent comprises three core components:
- Video2RAG: The concept behind this component is to transform the comprehension of long videos into a multimodal RAG task. The advantage of this approach is that it transcends the limitations imposed by video length; however, the downside is that such preprocessing may lead to the loss of substantial video detail.
- DnCLoop: Inspired by the classical algorithmic paradigm of Divide and Conquer, we devised a recursive general-task processing logic. This method iteratively refines complex problems into a task tree, ultimately transforming intricate tasks into a series of solvable, simpler tasks.
- Rewinder Tool: To address the issue of information loss in the Video2RAG process, we have designed a "progress bar" tool named Rewinder that can be autonomously used by agents. This enables the agents to revisit any video details, allowing them to seek out the necessary information.
For more details, check out our paper OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
- python >= 3.10
- Install omagent_core
cd omagent-core pip install -e .
- Other requirements
cd .. pip install -r requirements.txt
-
Create a configuration file and set some necessary variables
cd workflows/general && vim config.yaml
Configuration Name Usage custom_openai_endpoint API address for calling OpenAI GPT or other MLLM, format: {custom_openai_endpoint}/chat/completions
custom_openai_key api_key provided by the MLLM provider bing_api_key Bing's api key, used for websearch -
Set up
run.py
def run_agent(task): logging.init_logger("omagent", "omagent", level="INFO") registry.import_module(project_root=Path(__file__).parent, custom=["./engine"]) bot_builder = Builder.from_file("workflows/general") # General task processing workflow configuration directory input = DnCInterface(bot_id="1", task=AgentTask(id=0, task=task)) bot_builder.run_bot(input) return input.last_output if __name__ == "__main__": run_agent("Your Query") # Enter your query
-
Start OmAgent by running
python run.py
.
-
Optional
OmAgent, uses Milvus Lite as a vector database to store vector data by default. If you wish to use the full Milvus service, you can deploy it milvus vector database using docker. The vector database is used to store video feature vectors and retrieve relevant vectors based on queries to reduce MLLM computation. Not installed docker? Refer to docker installation guide.# Download milvus startup script curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh # Start milvus in standalone mode bash standalone_embed.sh start
Fill in the relevant configuration information after the deployment
workflows/video_understanding/config.yml
-
Optional
Configure the face recognition algorithm. The face recognition algorithm can be called as a tool by the agent, but it is optional. You can disable this feature by modifying theworkflows/video_understanding/tools/video_tools.json
configuration file and removing the FaceRecognition section. The default face recognition database is stored in thedata/face_db
directory, with different folders corresponding to different individuals. -
Optional
Open Vocabulary Detection (OVD) service, used to enhance OmAgent's ability to recognize various objects. The OVD tools depend on this service, but it is optional. You can disable OVD tools by following these steps. Remove the following fromworkflows/video_understanding/tools/video_tools.json
{ "name": "ObjectDetection", "ovd_endpoint": "$<ovd_endpoint::http://host_ip:8000/inf_predict>", "model_id": "$<ovd_model_id::OmDet-Turbo_tiny_SWIN_T>" }
If using ovd tools, we use OmDet for demonstration.
- Install OmDet and its environment according to the OmDet Installation Guide.
- Install requirements to turn OmDet Inference into API calls
pip install pydantic fastapi uvicorn
- Create a
wsgi.py
file to expose OmDet Inference as an APICopy the OmDet Inference API code to wsgi.pycd OmDet && vim wsgi.py
- Start OmDet Inference API, the default port is 8000
python wsgi.py
-
Download some interesting videos
-
Create a configuration file and set some necessary environment variables
cd workflows/video_understanding && vim config.yaml
-
Configure the API addresses and API keys for MLLM and tools.
Configuration Name Usage custom_openai_endpoint API address for calling OpenAI GPT or other MLLM, format: {custom_openai_endpoint}/chat/completions
custom_openai_key api_key provided by the respective API provider bing_api_key Bing's api key, used for web search ovd_endpoint ovd tool API address. If using OmDet, the address should be http://host:8000/inf_predict
ovd_model_id Model ID used by the ovd tool. If using OmDet, the model ID should be OmDet-Turbo_tiny_SWIN_T
-
Set up
run.py
def run_agent(task): logging.init_logger("omagent", "omagent", level="INFO") registry.import_module(project_root=Path(__file__).parent, custom=["./engine"]) bot_builder = Builder.from_file("workflows/video_understanding") # Video understanding task workflow configuration directory input = DnCInterface(bot_id="1", task=AgentTask(id=0, task=task)) bot_builder.run_bot(input) return input.last_output if __name__ == "__main__": run_agent("") # You will be prompted to enter the query in the console
-
Start OmAgent by running
python run.py
. Enter the path of the video you want to process, wait a moment, then enter your query, and OmAgent will answer based on the query.
If you are intrigued by multimodal algorithms, large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (AAAI24)
🏠 GitHub Repository
🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network (IET Computer Vision)
🏠 Github Repository
If you find our repository beneficial, please cite our paper:
@article{zhang2024omagent,
title={OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer},
author={Zhang, Lu and Zhao, Tiancheng and Ying, Heting and Ma, Yibo and Lee, Kyusong},
journal={arXiv preprint arXiv:2406.16620},
year={2024}
}