Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Feature enhancement: multimodal input (#30)
* Initialize PR for issue #29 * feat: add support for multiple input types This commit adds support for handling multiple types of input (text, image, video, audio) via the -input flag. The flag now accepts a comma-separated list of files, and each file's type is detected and validated against the model's capabilities. Key changes: - Add InputType enum and Input struct to represent different input types - Update ModelConfig to specify supported input types per model - Update Executor interface to handle multiple inputs - Add input type detection using mime types - Add input type validation against model capabilities - Update all executors with TODOs for handling non-text inputs Models support: - Claude models: text and images - Gemini models: text, images, video, and audio - GPT-4o models: text and images - Other models: text only Note: Actual handling of non-text inputs will be implemented in future commits. * fix: error on non-existent input files Instead of treating non-existent files in the -input flag as text input, throw an error. This prevents confusion and potential mistakes where a user might accidentally misspell a file path and have it treated as text input. * feat: implement multimodal support for Claude models Implement support for image inputs in the Anthropic executor: - Add support for reading and base64 encoding image files - Add support for detecting image mime types - Add support for creating image content blocks - Add error handling for unsupported input types (video and audio) Note: Actual handling of non-text inputs for other executors will be implemented in future commits. * feat: implement multimodal support for OpenAI models Implement support for image inputs in the OpenAI executor: - Add support for reading and base64 encoding image files - Add support for detecting image mime types - Add support for creating image content parts with data URIs - Add error handling for unsupported input types (video and audio) Note: The image data is sent as a data URI (data:image/png;base64,...) as required by the OpenAI API. * feat: implement multimodal support for Gemini models Implement support for image and audio inputs in the Gemini executor: - Add support for reading image and audio files - Add support for detecting mime types - Add support for creating image and audio parts - Add validation for supported image types (PNG, JPEG, WEBP, HEIC, HEIF) - Add validation for supported audio types (WAV, MP3, AIFF, AAC, OGG, FLAC) - Add TODO for video support Note: Video support is marked as not yet supported as it requires additional implementation. * feat: update remaining executors for input type handling - Update DeepSeek executor to error on non-text inputs - Update OpenAI Reasoning executor to error on non-text inputs - Update executor wrapper to only store text inputs in userMessage - Improve error messages to clearly indicate supported input types * feat: improve input file handling UX Change how input files are specified: - When -input flag is used, all arguments except the last one are treated as input files - The last argument is always the prompt text - Without -input flag, only one argument (the prompt) is accepted - Update help text to explain the new behavior Example usage: With input files: cpe -input file.txt image.png "What's in these files?" Without input files: cpe "What is 2+2?" * refactor: improve input flag and file handling The -input flag has been converted from a string to a boolean flag for better UX: - When -input is provided, all but last argument must be valid files - Last argument can be either a file or prompt text - Without -input flag, only one argument (prompt) is allowed - Improved error messages for invalid input combinations - Fixed empty text content handling in model executors Example usage: cpe -input file.txt image.png "What's in these files?" # Input files + prompt cpe "What is 2+2?" # Single prompt * feat: allow input files without prompt When using -input flag: - At least one input file is required - Last argument can be either a file or prompt text - If last argument is not a file, treat it as prompt text - If no prompt text is provided, use default prompt Example usage: cpe -input file1.txt file2.txt # Just input files cpe -input file1.txt file2.txt "Analyze these" # Input files + prompt cpe -input file1.txt # Single input file * feat: read text file contents as input When using -input flag: - If a file is of text mimetype, read its contents and use as text input - For non-text files (images, etc.), use file path as input - Last argument can still be either a file or prompt text - If last argument is not a file, treat it as prompt text Example usage: cpe -input prompt.txt # Read prompt from file cpe -input prompt.txt image.png # Text from file + image file cpe -input prompt.txt "Additional context" # Text from file + text prompt * fix: register genai.Blob type with gob encoder The genai.Blob type is used by Gemini models for handling binary data like images and audio. It needs to be registered with the gob encoder to allow saving conversations that include binary content. * fix: register OpenAI image content type with gob encoder The ChatCompletionContentPartImageParam type is used by OpenAI models for handling image inputs. It needs to be registered with the gob encoder to allow saving conversations that include image content. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
- Loading branch information