Feature enhancement: multimodal input (#30)

* Initialize PR for issue #29 * feat: add support for multiple input types This commit adds support for handling multiple types of input (text, image, video, audio) via the -input flag. The flag now accepts a comma-separated list of files, and each file's type is detected and validated against the model's capabilities. Key changes: - Add InputType enum and Input struct to represent different input types - Update ModelConfig to specify supported input types per model - Update Executor interface to handle multiple inputs - Add input type detection using mime types - Add input type validation against model capabilities - Update all executors with TODOs for handling non-text inputs Models support: - Claude models: text and images - Gemini models: text, images, video, and audio - GPT-4o models: text and images - Other models: text only Note: Actual handling of non-text inputs will be implemented in future commits. * fix: error on non-existent input files Instead of treating non-existent files in the -input flag as text input, throw an error. This prevents confusion and potential mistakes where a user might accidentally misspell a file path and have it treated as text input. * feat: implement multimodal support for Claude models Implement support for image inputs in the Anthropic executor: - Add support for reading and base64 encoding image files - Add support for detecting image mime types - Add support for creating image content blocks - Add error handling for unsupported input types (video and audio) Note: Actual handling of non-text inputs for other executors will be implemented in future commits. * feat: implement multimodal support for OpenAI models Implement support for image inputs in the OpenAI executor: - Add support for reading and base64 encoding image files - Add support for detecting image mime types - Add support for creating image content parts with data URIs - Add error handling for unsupported input types (video and audio) Note: The image data is sent as a data URI (data:image/png;base64,...) as required by the OpenAI API. * feat: implement multimodal support for Gemini models Implement support for image and audio inputs in the Gemini executor: - Add support for reading image and audio files - Add support for detecting mime types - Add support for creating image and audio parts - Add validation for supported image types (PNG, JPEG, WEBP, HEIC, HEIF) - Add validation for supported audio types (WAV, MP3, AIFF, AAC, OGG, FLAC) - Add TODO for video support Note: Video support is marked as not yet supported as it requires additional implementation. * feat: update remaining executors for input type handling - Update DeepSeek executor to error on non-text inputs - Update OpenAI Reasoning executor to error on non-text inputs - Update executor wrapper to only store text inputs in userMessage - Improve error messages to clearly indicate supported input types * feat: improve input file handling UX Change how input files are specified: - When -input flag is used, all arguments except the last one are treated as input files - The last argument is always the prompt text - Without -input flag, only one argument (the prompt) is accepted - Update help text to explain the new behavior Example usage: With input files: cpe -input file.txt image.png "What's in these files?" Without input files: cpe "What is 2+2?" * refactor: improve input flag and file handling The -input flag has been converted from a string to a boolean flag for better UX: - When -input is provided, all but last argument must be valid files - Last argument can be either a file or prompt text - Without -input flag, only one argument (prompt) is allowed - Improved error messages for invalid input combinations - Fixed empty text content handling in model executors Example usage: cpe -input file.txt image.png "What's in these files?" # Input files + prompt cpe "What is 2+2?" # Single prompt * feat: allow input files without prompt When using -input flag: - At least one input file is required - Last argument can be either a file or prompt text - If last argument is not a file, treat it as prompt text - If no prompt text is provided, use default prompt Example usage: cpe -input file1.txt file2.txt # Just input files cpe -input file1.txt file2.txt "Analyze these" # Input files + prompt cpe -input file1.txt # Single input file * feat: read text file contents as input When using -input flag: - If a file is of text mimetype, read its contents and use as text input - For non-text files (images, etc.), use file path as input - Last argument can still be either a file or prompt text - If last argument is not a file, treat it as prompt text Example usage: cpe -input prompt.txt # Read prompt from file cpe -input prompt.txt image.png # Text from file + image file cpe -input prompt.txt "Additional context" # Text from file + text prompt * fix: register genai.Blob type with gob encoder The genai.Blob type is used by Gemini models for handling binary data like images and audio. It needs to be registered with the gob encoder to allow saving conversations that include binary content. * fix: register OpenAI image content type with gob encoder The ChatCompletionContentPartImageParam type is used by OpenAI models for handling image inputs. It needs to be registered with the gob encoder to allow saving conversations that include image content. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
spachava753 · Feb 10, 2025 · dd1ffa8 · dd1ffa8
1 parent 9c83c83
commit dd1ffa8
Show file tree

Hide file tree

Showing 11 changed files with 392 additions and 52 deletions.
diff --git a/internal/agent/anthropic.go b/internal/agent/anthropic.go
@@ -3,13 +3,16 @@ package agent
 import (
 	"context"
 	_ "embed"
+	"encoding/base64"
 	"encoding/gob"
 	"encoding/json"
 	"fmt"
 	a "github.com/anthropics/anthropic-sdk-go"
 	"github.com/anthropics/anthropic-sdk-go/option"
+	"github.com/gabriel-vasile/mimetype"
 	gitignore "github.com/sabhiram/go-gitignore"
 	"io"
+	"os"
 	"strings"
 	"time"
 )
@@ -126,19 +129,67 @@ func NewAnthropicExecutor(baseUrl string, apiKey string, logger Logger, ignorer
 	}
 }
 
-func (s *anthropicExecutor) Execute(input string) error {
+func (s *anthropicExecutor) Execute(inputs []Input) error {
+	// Convert inputs into content blocks
+	var contentBlocks []a.BetaContentBlockParamUnion
+	for _, input := range inputs {
+		switch input.Type {
+		case InputTypeText:
+			if len(strings.TrimSpace(input.Text)) > 0 {
+				contentBlocks = append(contentBlocks, &a.BetaTextBlockParam{
+					Text: a.F(input.Text),
+					Type: a.F(a.BetaTextBlockParamTypeText),
+				})
+			}
+		case InputTypeImage:
+			// Read and base64 encode the image file
+			imgData, err := os.ReadFile(input.FilePath)
+			if err != nil {
+				return fmt.Errorf("failed to read image file %s: %w", input.FilePath, err)
+			}
+
+			// Detect mime type
+			mime := mimetype.Detect(imgData)
+			if !strings.HasPrefix(mime.String(), "image/") {
+				return fmt.Errorf("file %s is not an image", input.FilePath)
+			}
+
+			// Base64 encode the image data
+			encodedData := base64.StdEncoding.EncodeToString(imgData)
+
+			// Create image block
+			contentBlocks = append(contentBlocks, &a.BetaImageBlockParam{
+				Type: a.F(a.BetaImageBlockParamTypeImage),
+				Source: a.F(a.BetaImageBlockParamSource{
+					Type:      a.F(a.BetaImageBlockParamSourceTypeBase64),
+					MediaType: a.F(a.BetaImageBlockParamSourceMediaType(mime.String())),
+					Data:      a.F(encodedData),
+				}),
+			})
+		case InputTypeVideo:
+			return fmt.Errorf("video input is not supported by Claude models")
+		case InputTypeAudio:
+			return fmt.Errorf("audio input is not supported by Claude models")
+		default:
+			return fmt.Errorf("unknown input type: %s", input.Type)
+		}
+	}
+
 	if !s.params.Messages.Present {
 		s.params.Messages = a.F([]a.BetaMessageParam{})
 	}
 
+	// If we have no content blocks, create one with an empty prompt
+	if len(contentBlocks) == 0 {
+		contentBlocks = append(contentBlocks, &a.BetaTextBlockParam{
+			Text: a.F("Please analyze these files."),
+			Type: a.F(a.BetaTextBlockParamTypeText),
+		})
+	}
+
 	s.params.Messages = a.F(append(s.params.Messages.Value, a.BetaMessageParam{
-		Content: a.F([]a.BetaContentBlockParamUnion{
-			&a.BetaTextBlockParam{
-				Text: a.F(input),
-				Type: a.F(a.BetaTextBlockParamTypeText),
-			},
-		}),
-		Role: a.F(a.BetaMessageParamRoleUser),
+		Content: a.F(contentBlocks),
+		Role:    a.F(a.BetaMessageParamRoleUser),
 	}))
 
 	for {

diff --git a/internal/agent/deepseek.go b/internal/agent/deepseek.go
@@ -108,7 +108,16 @@ func NewDeepSeekExecutor(baseUrl string, apiKey string, logger Logger, ignorer *
 	}
 }
 
-func (o *deepseekExecutor) Execute(input string) error {
+func (o *deepseekExecutor) Execute(inputs []Input) error {
+	// Only text input is supported
+	var textInputs []string
+	for _, input := range inputs {
+		if input.Type != InputTypeText {
+			return fmt.Errorf("input type %s is not supported by DeepSeek models, only text input is supported", input.Type)
+		}
+		textInputs = append(textInputs, input.Text)
+	}
+	input := strings.Join(textInputs, "\n")
 	o.logger.Println("Note that the current V3 model is not yet perfected, it seems like the instruction following and tool calling performance is not yet tuned.")
 	o.logger.Println("Recommend using this model for one-off tasks like generating git commit messages or bash commands.")
 

diff --git a/internal/agent/executor.go b/internal/agent/executor.go
@@ -23,8 +23,24 @@ var agentInstructions string
 var reasoningAgentInstructions string
 
 // Executor defines the interface for executing agentic workflows
+type InputType string
+
+const (
+	InputTypeText  InputType = "text"
+	InputTypeImage InputType = "image"
+	InputTypeVideo InputType = "video"
+	InputTypeAudio InputType = "audio"
+)
+
+// Input represents a single input to be processed by the model
+type Input struct {
+	Type     InputType
+	Text     string // Used when Type is InputTypeText
+	FilePath string // Used when Type is InputTypeImage, InputTypeVideo, or InputTypeAudio
+}
+
 type Executor interface {
-	Execute(input string) error
+	Execute(inputs []Input) error
 	LoadMessages(r io.Reader) error
 	SaveMessages(w io.Writer) error
 	PrintMessages() string
@@ -159,7 +175,7 @@ func InitExecutor(logger Logger, flags ModelOptions) (Executor, error) {
 		executor:     executor,
 		convoManager: convoManager,
 		model:        genConfig.Model,
-		userMessage:  flags.Input,
+		userMessage:  "",  // Will be set during Execute
 		continueID:   continueId,
 	}, nil
 }
diff --git a/internal/agent/gemini.go b/internal/agent/gemini.go
@@ -7,11 +7,13 @@ import (
 	"encoding/json"
 	"errors"
 	"fmt"
+	"github.com/gabriel-vasile/mimetype"
 	"github.com/google/generative-ai-go/genai"
 	gitignore "github.com/sabhiram/go-gitignore"
 	"google.golang.org/api/googleapi"
 	"google.golang.org/api/option"
 	"io"
+	"os"
 	"strings"
 	"time"
 )
@@ -25,6 +27,7 @@ func init() {
 	gob.Register(genai.FunctionCall{})
 	gob.Register(genai.FunctionResponse{})
 	gob.Register(map[string]interface{}{})
+	gob.Register(genai.Blob{})  // Add this line
 }
 
 type geminiExecutor struct {
@@ -186,23 +189,88 @@ func NewGeminiExecutor(baseUrl string, apiKey string, logger Logger, ignorer *gi
 	}, nil
 }
 
-func (g *geminiExecutor) Execute(input string) error {
+func (g *geminiExecutor) Execute(inputs []Input) error {
 	if g.session == nil {
 		g.session = g.model.StartChat()
 	}
 
 	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
 	defer cancel()
 
-	// Send initial user message with retries
+	// Convert inputs into parts
+	var parts []genai.Part
+	for _, input := range inputs {
+		switch input.Type {
+		case InputTypeText:
+			parts = append(parts, genai.Text(input.Text))
+		case InputTypeImage:
+			// Read image file
+			imgData, err := os.ReadFile(input.FilePath)
+			if err != nil {
+				return fmt.Errorf("failed to read image file %s: %w", input.FilePath, err)
+			}
+
+			// Detect mime type
+			mime := mimetype.Detect(imgData)
+			if !strings.HasPrefix(mime.String(), "image/") {
+				return fmt.Errorf("file %s is not an image", input.FilePath)
+			}
+
+			// Verify supported image type
+			switch mime.String() {
+			case "image/png", "image/jpeg", "image/webp", "image/heic", "image/heif":
+				// These are supported
+			default:
+				return fmt.Errorf("unsupported image type %s for file %s. Supported types: PNG, JPEG, WEBP, HEIC, HEIF", mime.String(), input.FilePath)
+			}
+
+			// Get format without the "image/" prefix
+			format := strings.TrimPrefix(mime.String(), "image/")
+
+			// Create image part
+			parts = append(parts, genai.ImageData(format, imgData))
+		case InputTypeAudio:
+			// Read audio file
+			audioData, err := os.ReadFile(input.FilePath)
+			if err != nil {
+				return fmt.Errorf("failed to read audio file %s: %w", input.FilePath, err)
+			}
+
+			// Detect mime type
+			mime := mimetype.Detect(audioData)
+			if !strings.HasPrefix(mime.String(), "audio/") {
+				return fmt.Errorf("file %s is not an audio file", input.FilePath)
+			}
+
+			// Verify supported audio type
+			switch mime.String() {
+			case "audio/wav", "audio/mp3", "audio/aiff", "audio/aac", "audio/ogg", "audio/flac":
+				// These are supported
+			default:
+				return fmt.Errorf("unsupported audio type %s for file %s. Supported types: WAV, MP3, AIFF, AAC, OGG, FLAC", mime.String(), input.FilePath)
+			}
+
+			// Create audio part
+			parts = append(parts, genai.Blob{
+				MIMEType: mime.String(),
+				Data:     audioData,
+			})
+		case InputTypeVideo:
+			return fmt.Errorf("video input is not yet supported by this implementation")
+		default:
+			return fmt.Errorf("unknown input type: %s", input.Type)
+		}
+	}
+
+	// Send initial message with retries
 	var resp *genai.GenerateContentResponse
 	var err error
 	maxRetries := 5
 	retryCount := 0
 	retryWait := 1 * time.Minute
 
 	for retryCount <= maxRetries {
-		resp, err = g.session.SendMessage(ctx, genai.Text(input))
+		resp, err = g.session.SendMessage(ctx, parts...)
 		if err == nil {
 			break
 		}

diff --git a/internal/agent/models.go b/internal/agent/models.go
@@ -34,9 +34,10 @@ type ModelDefaults struct {
 }
 
 type ModelConfig struct {
-	Name     string
-	IsKnown  bool
-	Defaults ModelDefaults
+	Name            string
+	IsKnown         bool
+	Defaults        ModelDefaults
+	SupportedInputs []InputType
 }
 
 type ProviderConfig interface {
@@ -71,78 +72,97 @@ var ModelConfigs = map[string]ModelConfig{
 	"o3-mini": {
 		Name: openai.ChatModelO3Mini, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 100000, Temperature: 1},
+		SupportedInputs: []InputType{InputTypeText},
 	},
 	"deepseek-chat": {
 		Name: "deepseek-chat", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText},
 	},
 	"deepseek-r1": {
 		Name: "deepseek-reasoner", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 1},
+		SupportedInputs: []InputType{InputTypeText},
 	},
 	"claude-3-opus": {
 		Name: anthropic.ModelClaude_3_Opus_20240229, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 4096, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage},
 	},
 	"claude-3-5-sonnet": {
 		Name: anthropic.ModelClaude3_5Sonnet20241022, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage},
 	},
 	"claude-3-5-haiku": {
 		Name: anthropic.ModelClaude3_5Haiku20241022, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage},
 	},
 	"claude-3-haiku": {
 		Name: anthropic.ModelClaude_3_Haiku_20240307, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 4096, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage},
 	},
 	"gemini-1-5-flash-8b": {
 		Name: "gemini-1.5-flash-8b", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
 	},
 	"gemini-1-5-flash": {
 		Name: "gemini-1.5-flash-002", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
 	},
 	"gemini-2-flash-exp": {
 		Name: "gemini-2.0-flash-exp", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
 	},
 	"gemini-2-flash": {
 		Name: "gemini-2.0-flash", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
 	},
 	"gemini-2-flash-lite-preview": {
 		Name: "gemini-2.0-flash-lite-preview-02-05", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
 	},
 	"gemini-2-pro-exp": {
 		Name: "gemini-2.0-pro-exp-02-05", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
 	},
 	"gemini-1-5-pro": {
 		Name: "gemini-1.5-pro-002", IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
 	},
 	"gpt-4o": {
 		Name: openai.ChatModelGPT4o2024_11_20, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage},
 	},
 	"gpt-4o-mini": {
 		Name: openai.ChatModelGPT4oMini2024_07_18, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
+		SupportedInputs: []InputType{InputTypeText, InputTypeImage},
 	},
 	"o1": {
 		Name: openai.ChatModelO1_2024_12_17, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 100000, Temperature: 1},
+		SupportedInputs: []InputType{InputTypeText},
 	},
 	"o1-mini": {
 		Name: openai.ChatModelO1Mini2024_09_12, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 65536, Temperature: 1},
+		SupportedInputs: []InputType{InputTypeText},
 	},
 	"o1-preview": {
 		Name: openai.ChatModelO1Preview2024_09_12, IsKnown: true,
 		Defaults: ModelDefaults{MaxTokens: 100000, Temperature: 1},
+		SupportedInputs: []InputType{InputTypeText},
 	},
 }
 
@@ -158,7 +178,6 @@ type ModelOptions struct {
 	FrequencyPenalty   float64
 	PresencePenalty    float64
 	NumberOfResponses  int
-	Input              string
 	Version            bool
 	Continue           string // conversation ID to continue from
 	ListConversations  bool   // List all conversations