diff --git a/docs/examples/DnC.md b/docs/examples/DnC.md index e69de29b..8d4add95 100644 --- a/docs/examples/DnC.md +++ b/docs/examples/DnC.md @@ -0,0 +1,131 @@ +# Divide-and-Conquer Example + +In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem. + +This example demonstrates how to use the framework for divide-and-conquer tasks. The example code can be found in the `examples/general_dnc` directory. + +```bash + cd examples/general_dnc +``` + +## Overview + +This example implements a general divide-and-conquer workflow that consists of following components: + +1. **DnC Input Interface** + - Handles user input containing questions and(or) images + - Construct data structure for workflow running + +2. **Init Set Variable Task** + - Initialize global workflow variables that is needed in the entire workflow + +3. **Conqueror Task** + - Conqueror task executes and manages complex task trees: direct agent answer, conquer current task, use tool call for current task or break current task into several subtasks + - It takes a hierarchical task tree and processes each task node, maintaining context and state between task executions + +4. **Conqueror Update Set Variable Task** + - Update global workflow variables changed after conqueror task excution for better reading experience in conductor UI + +5. **Divider Task** + - Break down complex task into multiple smaller subtasks + - Generate and match milestones to each subtask + +6. **Divider Update Set Variable Task** + - Update global workflow variables changed after divider task excution for better reading experience in conductor UI + +7. **Rescue Task** + - Rescue failed tool call task, attempt to fix the issue by retrying with corrected parameters + +8. **Conclude Task** + - Solid end of the workflow, conclude the original root task based on all related information + +9. **Switch Task** + - After conqueror task, based on it's dicision, switch to specific next worker. + - Default case is the next conqueror task + - If too complex, switch to divider task + - If failed, switch to rescue task + +10. **Task Exit Monitor Task** + - Monitor whether the exit condition of the DnC loop task is met + - Based on the conqueror and divider task(s), the task tree is dynamicly generated and continuesly updated in the whole workflow + +11. **Post Set Variable Task** + - Update global workflow variables changed after task exit monitor task execution for better reading experience in conductor UI + +12. **DnC Loop Task** + - The core of the DnC workflow, takes a hierarchical task tree and processes each task node, maintaining context and state between task executions + - It contains three main tasks: conqueror task, divider task and rescue task, and other supporting tasks mentioned above + +### This whole workflow is looked like the following diagram: + +![DnC Workflow](../images/general_dnc_workflow_diagram.png) + +## Prerequisites + +- Python 3.10+ +- Required packages installed (see requirements.txt) +- Access to OpenAI API or compatible endpoint (see configs/llms/*.yml) +- [Optional] Access to Bing API for WebSearch tool (see configs/tools/*.yml) +- Redis server running locally or remotely +- Conductor server running locally or remotely + +## Configuration + +The container.yaml file is a configuration file that manages dependencies and settings for different components of the system, including Conductor connections, Redis connections, and other service configurations. To set up your configuration: + +1. Generate the container.yaml file: + ```bash + python compile_container.py + ``` + This will create a container.yaml file with default settings under `examples/general_dnc`. + + +2. Configure your LLM and tool settings in `configs/llms/*.yml` and `configs/tools/*.yml`: + - Set your OpenAI API key or compatible endpoint through environment variable or by directly modifying the yml file + ```bash + export custom_openai_key="your_openai_api_key" + export custom_openai_endpoint="your_openai_endpoint" + ``` + - [Optional] Set your Bing API key or compatible endpoint through environment variable or by directly modifying the yml file + ```bash + export bing_api_key="your_bing_api_key" + ``` + **Note: It isn't mandatory to set the Bing API key, as the WebSearch tool will rollback to use duckduckgo search. But it is recommended to set it for better search quality.** + - Configure other model settings like temperature as needed through environment variable or by directly modifying the yml file + +3. Update settings in the generated `container.yaml`: + - Modify Redis connection settings: + - Set the host, port and credentials for your Redis instance + - Configure both `redis_stream_client` and `redis_stm_client` sections + - Update the Conductor server URL under conductor_config section + - Adjust any other component settings as needed + +## Running the Example + +3. Run the general DnC example: + + For terminal/CLI usage: + ```bash + python run_cli.py + ``` + + For app/GUI usage: + ```bash + python run_app.py + ``` + +## Troubleshooting + +If you encounter issues: +- Verify Redis is running and accessible +- Check your OpenAI API key is valid +- Check your Bing API key is valid if search results are not as expected +- Ensure all dependencies are installed correctly +- Review logs for any error messages +- **Open an issue on GitHub if you can't find a solution, we will do our best to help you out!** + + +## Building the Example + +Coming soon! This section will provide detailed instructions for building and packaging the general_dnc example step by step. + diff --git a/docs/examples/outfit_with_loop.md b/docs/examples/outfit_with_loop.md index abb377be..c4831386 100644 --- a/docs/examples/outfit_with_loop.md +++ b/docs/examples/outfit_with_loop.md @@ -44,7 +44,7 @@ The workflow leverages Redis for state management and the Conductor server for w ## Prerequisites -- Python 3.8+ +- Python 3.10+ - Required packages installed (see requirements.txt) - Access to OpenAI API or compatible endpoint - Access to Bing API key for web search functionality to search real-time weather information for outfit recommendations (see configs/tools/websearch.yml) diff --git a/docs/examples/outfit_with_ltm.md b/docs/examples/outfit_with_ltm.md index 9b1283df..5f25dff5 100644 --- a/docs/examples/outfit_with_ltm.md +++ b/docs/examples/outfit_with_ltm.md @@ -45,7 +45,7 @@ The system uses Redis for state management, Milvus for long-term image storage, ## Prerequisites -- Python 3.8+ +- Python 3.10+ - Required packages installed (see requirements.txt) - Access to OpenAI API or compatible endpoint (see configs/llms/gpt.yml) - Access to Bing API key for web search functionality to search real-time weather information for outfit recommendations (see configs/tools/websearch.yml) diff --git a/docs/examples/outfit_with_switch.md b/docs/examples/outfit_with_switch.md index 873a1718..3eb3762f 100644 --- a/docs/examples/outfit_with_switch.md +++ b/docs/examples/outfit_with_switch.md @@ -37,7 +37,7 @@ The workflow follows this sequence: ## Prerequisites -- Python 3.8+ +- Python 3.10+ - Required packages installed (see requirements.txt) - Access to OpenAI API or compatible endpoint (see configs/llms/gpt.yml) - Access to Bing API key for web search functionality to search real-time weather information for outfit recommendations (see configs/tools/websearch.yml) diff --git a/docs/examples/simple_qa.md b/docs/examples/simple_qa.md index bfc6c5d1..93ca1075 100644 --- a/docs/examples/simple_qa.md +++ b/docs/examples/simple_qa.md @@ -24,7 +24,7 @@ The workflow follows a straightforward sequence: ## Prerequisites -- Python 3.8+ +- Python 3.10+ - Required packages installed (see requirements.txt) - Access to OpenAI API or compatible endpoint (see configs/llms/gpt.yml) - Redis server running locally or remotely diff --git a/docs/examples/video_understanding.md b/docs/examples/video_understanding.md index e69de29b..387e348b 100644 --- a/docs/examples/video_understanding.md +++ b/docs/examples/video_understanding.md @@ -0,0 +1,116 @@ +# Video Understanding Example + +This example demonstrates how to use the framework for hour-long video understanding task. The example code can be found in the `examples/video_understanding` directory. + +```bash + cd examples/video_understanding +``` + +## Overview + +This example implements a video understanding task workflow based on the DnC workflow, which consists of following components: + +1. **Video Preprocess Task** + - Preprocess the video with audio information via speech-to-text capability + - It detects the scene boundaries, splits the video into several chunks and extract frames at specified intervals + - Each scene chunk is summarized by MLLM with detailed information, cached and updated into vector database for Q&A retrieval + - Video metadata and video file md5 are transferred for filtering + +2. **Video QA Task** + - Take the user input question about the video + - Retrieve related information from the vector database with the question + - Extract the approximate start and end time of the video segment related to the question + - Generate video object from serialized data in short-term memory(stm) + - Build init task tree with the question to DnC task + +3. **Divide and Conquer Task** + - Execute the task tree with the question + - Detailed information is referred to the [DnC Example](./DnC.md#overview) + +The system uses Redis for state management, Milvus for long-tern memory storage and Conductor for workflow orchestration. + +### This whole workflow is looked like the following diagram: + +![Video Understanding Workflow](../images/video_understanding_workflow_diagram.png) + +## Prerequisites + +- Python 3.10+ +- Required packages installed (see requirements.txt) +- Access to OpenAI API or compatible endpoint (see configs/llms/*.yml) +- [Optional] Access to Bing API for WebSearch tool (see configs/tools/*.yml) +- Redis server running locally or remotely +- Conductor server running locally or remotely + +## Configuration + +The container.yaml file is a configuration file that manages dependencies and settings for different components of the system, including Conductor connections, Redis connections, and other service configurations. To set up your configuration: + +1. Generate the container.yaml file: + ```bash + python compile_container.py + ``` + This will create a container.yaml file with default settings under `examples/video_understanding`. + + +2. Configure your LLM and tool settings in `configs/llms/*.yml` and `configs/tools/*.yml`: + - Set your OpenAI API key or compatible endpoint through environment variable or by directly modifying the yml file + ```bash + export custom_openai_key="your_openai_api_key" + export custom_openai_endpoint="your_openai_endpoint" + ``` + - [Optional] Set your Bing API key or compatible endpoint through environment variable or by directly modifying the yml file + ```bash + export bing_api_key="your_bing_api_key" + ``` + **Note: It isn't mandatory to set the Bing API key, as the WebSearch tool will rollback to use duckduckgo search. But it is recommended to set it for better search quality.** + - The default text encoder configuration uses OpenAI `text-embedding-3-large` with **3072** dimensions, make sure you change the dim value of `MilvusLTM` in `container.yaml` + - Configure other model settings like temperature as needed through environment variable or by directly modifying the yml file + +3. Update settings in the generated `container.yaml`: + - Modify Redis connection settings: + - Set the host, port and credentials for your Redis instance + - Configure both `redis_stream_client` and `redis_stm_client` sections + - Update the Conductor server URL under conductor_config section + - Configure MilvusLTM in `components` section: + - Set the `storage_name` and `dim` for MilvusLTM + - Set `dim` is to **3072** if you use default OpenAI encoder, make sure to modify corresponding dimension if you use other custom text encoder model or endpoint + - Adjust other settings as needed + - Configure hyper-parameters for video preprocess task in `examples/video_understanding/configs/workers/video_preprocessor.yml` + - `use_cache`: Whether to use cache for the video preprocess task + - `scene_detect_threshold`: The threshold for scene detection, which is used to determine if a scene change occurs in the video, min value means more scenes will be detected, default value is **27** + - `frame_extraction_interval`: The interval between frames to extract from the video, default value is **5** + - `kernel_size`: The size of the kernel for scene detection, should be **odd** number, default value is automatically calculated based on the resolution of the video. For hour-long videos, it is recommended to leave it blank, but for short videos, it is recommended to set a smaller value, like **3**, **5** to make it more sensitive to the scene change + - `stt.endpoint`: The endpoint for the speech-to-text service, default uses OpenAI ASR service + - `stt.api_key`: The API key for the speech-to-text service, default uses OpenAI API key + - Adjust any other component settings as needed + +## Running the Example + +3. Run the video understanding example, currently only supports CLI usage: + + ```bash + python run_cli.py + ``` + + First time you need to input the video file path, it will take a while to preprocess the video and store the information into vector database. + After the video is preprocessed, you can input your question about the video and the system will answer it. Note that the agent may give the wrong or vague answer, especially some questions are related the name of the characters in the video. + + +## Troubleshooting + +If you encounter issues: +- Verify Redis is running and accessible +- Try smaller `scene_detect_threshold` and `frame_extraction_interval` if you find too many scenes are detected +- Check your OpenAI API key is valid +- Check your Bing API key is valid if search results are not as expected +- Check the `dim` value in `MilvusLTM` in `container.yaml` is set correctly, currently unmatched dimension setting will not raise error but lose partial of the information(we will add more checks in the future) +- Ensure all dependencies are installed correctly +- Review logs for any error messages +- **Open an issue on GitHub if you can't find a solution, we will do our best to help you out!** + + +## Building the Example + +Coming soon! This section will provide detailed instructions for building and packaging the general_dnc example step by step. + diff --git a/docs/images/general_dnc_workflow_diagram.png b/docs/images/general_dnc_workflow_diagram.png new file mode 100644 index 00000000..04f5671c --- /dev/null +++ b/docs/images/general_dnc_workflow_diagram.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:83b9d130e092d03580fc3a8b1f36be76bb5ed010ca1e56e90ddc6a1433473857 +size 94207 diff --git a/docs/images/video_understanding_workflow_diagram.png b/docs/images/video_understanding_workflow_diagram.png new file mode 100644 index 00000000..bc596f99 --- /dev/null +++ b/docs/images/video_understanding_workflow_diagram.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c67ace7251c7ab98ee484eaa3311a106b4bc968338ebb0d3b3fbedb348b874af +size 70400 diff --git a/examples/general_dnc/configs/tools/all_tools.yml b/examples/general_dnc/configs/tools/all_tools.yml index 576c81d1..1352f09e 100755 --- a/examples/general_dnc/configs/tools/all_tools.yml +++ b/examples/general_dnc/configs/tools/all_tools.yml @@ -6,5 +6,5 @@ tools: - WriteFileContent - ShellTool - name: WebSearch - bing_api_key: ${env|bing_api_key, microsoft_bing_api_key} + bing_api_key: ${env|bing_api_key, null} llm: ${sub|text_res} diff --git a/examples/video_understanding/agent/memories/video_ltm_milvus.py b/examples/video_understanding/agent/memories/video_ltm_milvus.py index 3070ab25..4dccbbb4 100644 --- a/examples/video_understanding/agent/memories/video_ltm_milvus.py +++ b/examples/video_understanding/agent/memories/video_ltm_milvus.py @@ -17,7 +17,9 @@ class VideoMilvusLTM(LTMBase): dim: int = Field(default=128) def model_post_init(self, __context: Any) -> None: - + pass + + def _create_collection(self) -> None: # Check if collection exists if not self.milvus_ltm_client._client.has_collection(self.storage_name): index_params = self.milvus_ltm_client._client.prepare_index_params() @@ -54,9 +56,6 @@ def model_post_init(self, __context: Any) -> None: # Create index separately after collection creation print(f"Created storage {self.storage_name} successfully") - else: - - print(f"{self.storage_name} storage already exists") def __getitem__(self, key: Any) -> Any: key_str = str(key) @@ -71,6 +70,8 @@ def __getitem__(self, key: Any) -> Any: raise KeyError(f"Key {key} not found") def __setitem__(self, key: Any, value: Any) -> None: + self._create_collection() + key_str = str(key) # Check if value is a dictionary containing 'value' and 'embedding' diff --git a/examples/video_understanding/agent/misc/scene.py b/examples/video_understanding/agent/misc/scene.py index f71a9e4d..34eb4367 100755 --- a/examples/video_understanding/agent/misc/scene.py +++ b/examples/video_understanding/agent/misc/scene.py @@ -35,7 +35,7 @@ def conversation(self): class VideoScenes(BaseModel): stream: VideoStream - audio: AudioSegment + audio: Union[AudioSegment, None] scenes: List[Scene] frame_extraction_interval: int @@ -53,6 +53,7 @@ def load( min_scene_len: int = 1, frame_extraction_interval: int = 5, show_progress: bool = False, + kernel_size: Optional[int] = None, ): """Load a video file. @@ -65,16 +66,32 @@ def load( """ video = open_video(video_path) scene_manager = SceneManager() - scene_manager.add_detector( - ContentDetector( - threshold=threshold, min_scene_len=video.frame_rate * min_scene_len - ) + weight = ContentDetector.Components( + delta_hue=1.0, + delta_sat=1.0, + delta_lum=0.0, + delta_edges=1.0, ) + if kernel_size is None: + scene_manager.add_detector( + ContentDetector( + threshold=threshold, min_scene_len=video.frame_rate * min_scene_len, weights=weight + ) + ) + else: + scene_manager.add_detector( + ContentDetector( + threshold=threshold, min_scene_len=video.frame_rate * min_scene_len, weights=weight, kernel_size=kernel_size + ) + ) scene_manager.detect_scenes(video, show_progress=show_progress) scenes = scene_manager.get_scene_list(start_in_scene=True) - audio = AudioSegment.from_file(video_path) - audio = normalize(audio) + try: + audio = AudioSegment.from_file(video_path) + audio = normalize(audio) + except (IndexError, OSError): + audio = None return cls( stream=video, scenes=[Scene.init(*scene) for scene in scenes], @@ -139,6 +156,8 @@ def get_audio_clip( Returns: AudioSegment: The audio clip of the scene. """ + if self.audio is None: + return None if isinstance(scene, int): scene = self.scenes[scene] start, end = scene.start, scene.end @@ -197,8 +216,11 @@ def to_serializable(self) -> dict: def from_serializable(cls, data: dict): """Rebuild VideoScenes from serialized data.""" video = open_video(data['video_path']) - audio = AudioSegment.from_file(data['video_path']) - audio = normalize(audio) + try: + audio = AudioSegment.from_file(data['video_path']) + audio = normalize(audio) + except Exception: + audio = None # Rebuild scenes list scenes = [] diff --git a/examples/video_understanding/agent/tools/video_rewinder/rewinder.py b/examples/video_understanding/agent/tools/video_rewinder/rewinder.py index 7b1bbf7a..c970eed7 100755 --- a/examples/video_understanding/agent/tools/video_rewinder/rewinder.py +++ b/examples/video_understanding/agent/tools/video_rewinder/rewinder.py @@ -95,6 +95,7 @@ def _run( ) )["choices"][0]["message"]["content"] image_contents = json_repair.loads(res) + self.stm(self.workflow_instance_id)['image_cache'] = {} return f"{extracted_frames} described as: {image_contents}." async def _arun( diff --git a/examples/video_understanding/agent/video_preprocessor/sys_prompt.prompt b/examples/video_understanding/agent/video_preprocessor/sys_prompt.prompt index 404859eb..1cf15dfa 100644 --- a/examples/video_understanding/agent/video_preprocessor/sys_prompt.prompt +++ b/examples/video_understanding/agent/video_preprocessor/sys_prompt.prompt @@ -7,12 +7,12 @@ You will be provided with a series of video frame images arranged in the order o "time": Optional[str].The time information of current video clip in terms of periods like morning or evening, seasons like spring or autumn, or specific years and time points. Please make sure to directly obtain the information from the provided context without inference or fabrication. If the relevant information cannot be obtained, please return null. "location": Optional[str]. Describe the location where the current event is taking place, including scene details. If the relevant information cannot be obtained, please return null. "character": Optional[str]. Provide a detailed description of the current characters, including their names, relationships, and what they are doing, etc. If the relevant information cannot be obtained, please return null. - "events": List[str]. List all the detailed events in the video content in chronological order. Please integrate the information provided by the video frames and the textual information in the audio, and do not overlook any key points. + "events": List[str]. List and describe all the detailed events in the video content in chronological order. Please integrate the information provided by the video frames and the textual information in the audio, and do not overlook any key points. "scene": List[str]. Give some detailed description of the scene of the video. This includes, but is not limited to, scene information, textual information, character status expressions, and events displayed in the video. - "summary": str.Provide an overall description and summary of the content of this video. Please ensure that it remains objective and does not include any speculation or fabrication. This field is mandatory. + "summary": str. Provide an detailed overall description and summary of the content of this video clip. Ensure that it remains objective and does not include any speculation or fabrication. This field is mandatory. } *** Important Notice *** -1. You will be provided with the video frames and speech-to-text results, so don't say you cannot watch the video. You have enough information to answer the questions. +1. You will be provided with the video frames and speech-to-text results. You have enough information to answer the questions. 2. Sometimes the speech-to-text results maybe empty since there are no person talking. Please analyze based on the information in the images in this situation. \ No newline at end of file diff --git a/examples/video_understanding/agent/video_preprocessor/video_preprocess.py b/examples/video_understanding/agent/video_preprocessor/video_preprocess.py index c9eef648..5fa4f12d 100755 --- a/examples/video_understanding/agent/video_preprocessor/video_preprocess.py +++ b/examples/video_understanding/agent/video_preprocessor/video_preprocess.py @@ -1,7 +1,7 @@ import hashlib import pickle from pathlib import Path -from typing import List +from typing import List, Optional, Union from omagent_core.models.llms.base import BaseLLMBackend from omagent_core.engine.worker.base import BaseWorker @@ -11,10 +11,12 @@ from omagent_core.models.asr.stt import STT from pydantic import Field, field_validator from pydub import AudioSegment +from pydub.effects import normalize from scenedetect import open_video import json_repair from ..misc.scene import VideoScenes from omagent_core.models.encoders.openai_encoder import OpenaiTextEmbeddingV3 +import time CURRENT_PATH = root_path = Path(__file__).parents[0] @@ -34,9 +36,10 @@ class VideoPreprocessor(BaseLLMBackend, BaseWorker): text_encoder: OpenaiTextEmbeddingV3 stt: STT - scene_detect_threshold: int = 27 + scene_detect_threshold: Union[float, int] = 27 min_scene_len: int = 1 frame_extraction_interval: int = 5 + kernel_size: Optional[int] = None show_progress: bool = True use_cache: bool = False @@ -80,7 +83,6 @@ def _run(self, video_path: str, *args, **kwargs): dict: Dictionary containing video_md5 and video_path """ video_path = self.input.read_input(workflow_instance_id=self.workflow_instance_id, input_prompt="Please input the video path:")['messages'][0]['content'][0]['data'] - print(f"video_path: {video_path}") video_md5 = self.calculate_md5(video_path) kwargs["video_md5"] = video_md5 @@ -93,9 +95,14 @@ def _run(self, video_path: str, *args, **kwargs): if self.use_cache and cache_path.exists(): with open(cache_path, "rb") as f: loaded_scene = pickle.load(f) + try: + audio = AudioSegment.from_file(video_path) + audio = normalize(audio) + except Exception: + audio = None video = VideoScenes( stream=open_video(video_path), - audio=AudioSegment.from_file(video_path), + audio=audio, scenes=loaded_scene, frame_extraction_interval=self.frame_extraction_interval, ) @@ -171,13 +178,17 @@ def _run(self, video_path: str, *args, **kwargs): min_scene_len=self.min_scene_len, frame_extraction_interval=self.frame_extraction_interval, show_progress=self.show_progress, + kernel_size=self.kernel_size, ) self.stm(self.workflow_instance_id)['video'] = video.to_serializable() for index, scene in enumerate(video.scenes): print(f"Processing scene {index} / {len(video.scenes)}...") audio_clip = video.get_audio_clip(scene) - scene.stt_res = self.stt.infer(audio_clip) + if audio_clip is None: + scene.stt_res = {"text": ""} + else: + scene.stt_res = self.stt.infer(audio_clip) video_frames, time_stamps = video.get_video_frames(scene) try: face_rec = registry.get_tool("FaceRecognition") diff --git a/examples/video_understanding/agent/video_qa/qa.py b/examples/video_understanding/agent/video_qa/qa.py index 9c0a4795..b59ad582 100755 --- a/examples/video_understanding/agent/video_qa/qa.py +++ b/examples/video_understanding/agent/video_qa/qa.py @@ -64,7 +64,8 @@ def _run(self, video_md5: str, video_path: str, instance_id: str, *args, **kwarg for _, each in related_information ] video = VideoScenes.from_serializable(self.stm(self.workflow_instance_id)['video']) - self.stm(self.workflow_instance_id).extra = { + self.stm(self.workflow_instance_id)['extra'] = { + "video_information": "video is already loaded in the short-term memory(stm).", "video_duration_seconds(s)": video.stream.duration.get_seconds(), "frame_rate": video.stream.frame_rate, "video_summary": "\n---\n".join(related_information), diff --git a/examples/video_understanding/configs/llms/gpt4o.yml b/examples/video_understanding/configs/llms/gpt4o.yml index e7962308..0d612dfc 100755 --- a/examples/video_understanding/configs/llms/gpt4o.yml +++ b/examples/video_understanding/configs/llms/gpt4o.yml @@ -1,5 +1,5 @@ name: OpenaiGPTLLM -model_id: gpt-4o-mini +model_id: gpt-4o api_key: ${env| custom_openai_key, openai_api_key} endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1} temperature: 0 diff --git a/examples/video_understanding/configs/llms/json_res.yml b/examples/video_understanding/configs/llms/json_res.yml index 3cc0655f..c5cf8adb 100755 --- a/examples/video_understanding/configs/llms/json_res.yml +++ b/examples/video_understanding/configs/llms/json_res.yml @@ -1,5 +1,5 @@ name: OpenaiGPTLLM -model_id: gpt-4o-mini +model_id: gpt-4o api_key: ${env| custom_openai_key, openai_api_key} endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1} temperature: 0 diff --git a/examples/video_understanding/configs/llms/text_res.yml b/examples/video_understanding/configs/llms/text_res.yml index 0505fb5c..b6c80bdf 100755 --- a/examples/video_understanding/configs/llms/text_res.yml +++ b/examples/video_understanding/configs/llms/text_res.yml @@ -1,5 +1,5 @@ name: OpenaiGPTLLM -model_id: gpt-4o-mini +model_id: gpt-4o api_key: ${env| custom_openai_key, openai_api_key} endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1} temperature: 0 diff --git a/examples/video_understanding/configs/workers/video_preprocessor.yml b/examples/video_understanding/configs/workers/video_preprocessor.yml index d416921e..0fc9083a 100755 --- a/examples/video_understanding/configs/workers/video_preprocessor.yml +++ b/examples/video_understanding/configs/workers/video_preprocessor.yml @@ -1,6 +1,8 @@ name: VideoPreprocessor llm: ${sub|gpt4o} use_cache: true +scene_detect_threshold: 27 +frame_extraction_interval: 5 stt: name: STT endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1} diff --git a/examples/video_understanding/run_cli.py b/examples/video_understanding/run_cli.py index 5ac2661b..cefe11e4 100644 --- a/examples/video_understanding/run_cli.py +++ b/examples/video_understanding/run_cli.py @@ -33,7 +33,7 @@ video_preprocess_task = simple_task(task_def_name='VideoPreprocessor', task_reference_name='video_preprocess') # 2. Video QA task for video QA -video_qa_task = simple_task(task_def_name='VideoQA', task_reference_name='video_qa', inputs={'video_md5': '${workflow.input.video_md5}', 'video_path': '${workflow.input.video_path}', 'instance_id': '${workflow.input.instance_id}'}) +video_qa_task = simple_task(task_def_name='VideoQA', task_reference_name='video_qa', inputs={'video_md5': video_preprocess_task.output('video_md5'), 'video_path': video_preprocess_task.output('video_path'), 'instance_id': video_preprocess_task.output('instance_id')}) # Divide-and-conquer workflow # 3. Initialize set variable task for global workflow variables diff --git a/omagent-core/src/omagent_core/advanced_components/node/conqueror/conqueror.py b/omagent-core/src/omagent_core/advanced_components/node/conqueror/conqueror.py index 1c1119a0..650b09b8 100644 --- a/omagent-core/src/omagent_core/advanced_components/node/conqueror/conqueror.py +++ b/omagent-core/src/omagent_core/advanced_components/node/conqueror/conqueror.py @@ -149,7 +149,7 @@ def _run(self, agent_task: dict, last_output: str, *args, **kwargs): # Call tool_manager to decide which tool to use and execute the tool execution_status, execution_results = self.tool_manager.execute_task( - content["tool_call"], related_info=self.stm['former_results'] + content["tool_call"], related_info=self.stm(self.workflow_instance_id)['former_results'] ) former_results = self.stm(self.workflow_instance_id)['former_results'] former_results['tool_call'] = content['tool_call'] @@ -169,7 +169,7 @@ def _run(self, agent_task: dict, last_output: str, *args, **kwargs): "tool_status": current_node.status, "tool_result": current_node.result, } - self.callback.info(agent_id=self.workflow_instance_id, progress=f'Conqueror', message=f'Tool call success.') + self.callback.info(agent_id=self.workflow_instance_id, progress=f'Conqueror', message=f'Tool call success. {toolcall_success_output_structure}') return {"agent_task": task.model_dump(), "switch_case_value": "success", "last_output": last_output, "kwargs": kwargs} # Handle the case where the tool call is failed else: @@ -177,7 +177,7 @@ def _run(self, agent_task: dict, last_output: str, *args, **kwargs): current_node.status = TaskStatus.FAILED former_results['tool_call_error'] = f"tool_call {content['tool_call']} raise error: {current_node.result}" self.stm(self.workflow_instance_id)['former_results'] = former_results - self.callback.info(agent_id=self.workflow_instance_id, progress=f'Conqueror', message=f'Tool call failed.') + self.callback.info(agent_id=self.workflow_instance_id, progress=f'Conqueror', message=f'Tool call failed. {former_results["tool_call_error"]}') return {"agent_task": task.model_dump(), "switch_case_value": "failed", "last_output": last_output, "kwargs": kwargs} # Handle the case where the LLM generation is not valid else: diff --git a/omagent-core/src/omagent_core/advanced_components/node/conqueror/sys_prompt.prompt b/omagent-core/src/omagent_core/advanced_components/node/conqueror/sys_prompt.prompt index 5eef23a3..0ee1e3a6 100644 --- a/omagent-core/src/omagent_core/advanced_components/node/conqueror/sys_prompt.prompt +++ b/omagent-core/src/omagent_core/advanced_components/node/conqueror/sys_prompt.prompt @@ -2,11 +2,11 @@ You are the most powerful AI agent with the ability to see images. As a Super Ag --- Output --- The output should be a dict in json format, key is one of "divide", "tool_call" and "agent_answer": -"divide": string. The subtasks that need to be done if you want to complete the current task. If the goal of the task cannot be achieved using a single tool, please use this field. Be sure to pay attention to each milestone of the task and ensure that all of them are achieved. Bring as much detail as possible to the task. +"divide": string. The subtasks that need to be done if you want to complete the current task or current information can't answer current task precisely. If the goal of the task cannot be achieved using a single tool or agent answer is vague, must use this field. Be sure to pay attention to each milestone of the task and ensure that all of them are achieved. Bring as much detail as possible to the task. -"tool_call": string. string. The task needs more specific details by calling a single tool that listed in Tools section. This value must be a comprehensive and detailed description of the task. +"tool_call": string. The task needs more specific details by calling a single tool that listed in Tools section. This value must be a comprehensive and detailed description of the task. -"agent_answer": string. The task can be answered directly by the agent without any tools, retain the important parts of the original information. +"agent_answer": string. The task can be answered directly by the agent without any tools in precisely and detailed, rather than vague and general. { "divide" or "tool_call" or "agent_answer": "string" } @@ -41,6 +41,4 @@ Result:{"divide":" I need to first extract a frame from the video at 3 minutes a - If the task is something like text understanding, classification, analyzing, etc, output your answer directly, don't use external tools or execute codes. - You can only give up when the tools are insufficient. - You must pass literal and concrete value to tools, do not use reference or pointer, labels. -- Run any program with the shell if you need. You are not limited. -- You can doing any task with the help of given tools, even developing complex programs or configure complex develop environment. -- Only use tool to ask for help, do not use file system to write down your question to human. \ No newline at end of file +- You can doing any task with the help of given tools, even developing complex programs or configure complex develop environment. \ No newline at end of file diff --git a/omagent-core/src/omagent_core/advanced_components/node/divider/sys_prompt.prompt b/omagent-core/src/omagent_core/advanced_components/node/divider/sys_prompt.prompt index 9cda28a1..450f4a8d 100644 --- a/omagent-core/src/omagent_core/advanced_components/node/divider/sys_prompt.prompt +++ b/omagent-core/src/omagent_core/advanced_components/node/divider/sys_prompt.prompt @@ -1,39 +1,40 @@ -You are an efficient plan-generation agent, your job is to decompose a task into several subtasks that describe must achieved goals for the job. After you generate subtasks, another agent will attempt to execute them. Your workflow follows the principle of "divide and conquer", so try your best to make the subtasks clear and executable. -Remember you need to add a conclusion subtask to the end of the subtasks list, which should be the final goal of the parent task. The conclusion subtask conclude former results. The conclusion subtask should be the only subtask that does not have any subtasks. +You are an efficient plan-generation agent. +Your job is to decompose an overly complex task into multiple subtasks based on the available tools. +These subtasks need to be precise and include all relevant information required to complete them. +Another agent will execute these subtasks, and you both share all the tools. +Your workflow follows the principle of "divide and conquer," so the subtasks must be clear and executable. +A conclusion subtask must exist at the end of the subtasks list, providing a reasonable answer and explanation for the parent task. --- Output Structure --- The output result should follows this json structure: { - "tasks": List[SubTask]. The series subtasks decomposed from the parent task. Each subtask should present in SubTask json object structure which will be explained in detail later. These subtasks should be executed sequentially and ultimately achieve the desired result of the parent task. You should control the number of subtasks between 2 to 4. If you can not decompose the parent task for some reasons, leave this field a empty list and put the reason in 'failed_reason' field. - "failed_reason": string. describe the reason why you can not generate subtasks. Only provide when the 'task' field is an empty list, else return empty list. + "tasks": List[SubTask]. The list of SubTask object. The subtasks are decomposed from the parent task. These subtasks should be executed sequentially and ultimately achieve the desired result of the parent task. If you can not decompose the parent task for some reasons, leave this field a empty list and put the reason in 'failed_reason' field. + "failed_reason": string. The reason why you can not generate subtasks. Only provide when the 'tasks' field is an empty list, else return empty string "". } -SubTask json object structure: +SubTask object json structure: { - "task": string. The main purpose of the sub-task should handle, and what will you do to reach this goal? - "criticism": string. What problems may the current subtask and goal have? - "milestones": list[string]. How to automatically check the sub-task is done? - "status": string = "waiting". Indicates the status of this task. Value 'running' for the now executing task; 'waiting' for tasks waiting for executing; 'success' for the successfully finished ones and 'failed' for failed ones. + "task": string. The purposes of the sub-task should handle, and detailed implementation path. + "criticism": string. What problems may the current subtask and goal have. + "milestones": list[string]. How can we automatically verify the completion of each sub-task? + "status": string. Indicates the status of this task. Enum of ['running', 'waiting', 'success', 'failed']. 'running' for the now executing; 'waiting' for tasks waiting for executing; 'success' for the successfully finished this and 'failed' for this executing failed. } --- Background Information --- -The background information include uplevel tasks, former results. You can refer to these information to answer. 1. uplevel tasks: The tasks series of the level of the parent task. Structured in List[SubTask] format. This could be empty when the task is the original task. -2. former results: The result generated by the former task. This could be empty if the parent task is the first task in it's level. +2. former results: The result generated by the former subtask. This could be empty if the parent task is the first task in it's level. ---- Resources --- -When generate subtasks, consider these kinds of tools can be used in task execution: +--- Available Tools --- {{tools}} --- Note --- The user is busy, so make efficient plans that can lead to successful task solving. Do not waste time on making irrelevant or unnecessary plans. -Don't use search engine if you have the knowledge for planning. -Don't divide trivial task into multiple steps. +Don't divide trivial task into multiple steps. If task is un-solvable, give up and return with the reason. *** Important Notice *** - Think step by step. Do not omit any necessary subtasks and do not plan unnecessary subtasks. -- Never create new subtasks that similar or same as the existing subtasks. -- For subtasks with similar goals, try to do them together in one subtask, rather than split them into multiple subtasks. -- The task handler is powered by sota LLM, which can directly answer many questions. So make sure your plan can fully utilize its ability and reduce the complexity of the subtasks tree. +- Never create new subtasks that similar or same as the existing subtasks or executed tasks. +- For subtasks with similar goals, try to merge them together in one subtask. +- The task handler is powered by SOTA LLM and enormous tools, which can directly solve any subtasks. So make sure your plan can fully utilize its ability and reduce the complexity of the subtasks tree. - The output should strictly adhere to the given output structure. \ No newline at end of file diff --git a/omagent-core/src/omagent_core/models/llms/base.py b/omagent-core/src/omagent_core/models/llms/base.py index b49f047d..0d2abf95 100644 --- a/omagent-core/src/omagent_core/models/llms/base.py +++ b/omagent-core/src/omagent_core/models/llms/base.py @@ -171,7 +171,7 @@ def infer(self, input_list: List[Dict[str, Any]], **kwargs) -> List[T]: output = self.llm.generate(prompt, **kwargs) for key, value in output["usage"].items(): if value is not None: - self.token_usage[key] += value + pass for choice in output["choices"]: if choice.get("message"): choice["message"]["content"] = self.output_parser.parse(