Skip to content

Commit

Permalink
Feature enhancement: multimodal input (#30)
Browse files Browse the repository at this point in the history
* Initialize PR for issue #29

* feat: add support for multiple input types

This commit adds support for handling multiple types of input (text, image, video, audio) via the -input flag. The flag now accepts a comma-separated list of files, and each file's type is detected and validated against the model's capabilities.

Key changes:
- Add InputType enum and Input struct to represent different input types
- Update ModelConfig to specify supported input types per model
- Update Executor interface to handle multiple inputs
- Add input type detection using mime types
- Add input type validation against model capabilities
- Update all executors with TODOs for handling non-text inputs

Models support:
- Claude models: text and images
- Gemini models: text, images, video, and audio
- GPT-4o models: text and images
- Other models: text only

Note: Actual handling of non-text inputs will be implemented in future commits.

* fix: error on non-existent input files

Instead of treating non-existent files in the -input flag as text input, throw an error. This prevents confusion and potential mistakes where a user might accidentally misspell a file path and have it treated as text input.

* feat: implement multimodal support for Claude models

Implement support for image inputs in the Anthropic executor:
- Add support for reading and base64 encoding image files
- Add support for detecting image mime types
- Add support for creating image content blocks
- Add error handling for unsupported input types (video and audio)

Note: Actual handling of non-text inputs for other executors will be implemented in future commits.

* feat: implement multimodal support for OpenAI models

Implement support for image inputs in the OpenAI executor:
- Add support for reading and base64 encoding image files
- Add support for detecting image mime types
- Add support for creating image content parts with data URIs
- Add error handling for unsupported input types (video and audio)

Note: The image data is sent as a data URI (data:image/png;base64,...) as required by the OpenAI API.

* feat: implement multimodal support for Gemini models

Implement support for image and audio inputs in the Gemini executor:
- Add support for reading image and audio files
- Add support for detecting mime types
- Add support for creating image and audio parts
- Add validation for supported image types (PNG, JPEG, WEBP, HEIC, HEIF)
- Add validation for supported audio types (WAV, MP3, AIFF, AAC, OGG, FLAC)
- Add TODO for video support

Note: Video support is marked as not yet supported as it requires additional implementation.

* feat: update remaining executors for input type handling

- Update DeepSeek executor to error on non-text inputs
- Update OpenAI Reasoning executor to error on non-text inputs
- Update executor wrapper to only store text inputs in userMessage
- Improve error messages to clearly indicate supported input types

* feat: improve input file handling UX

Change how input files are specified:
- When -input flag is used, all arguments except the last one are treated as input files
- The last argument is always the prompt text
- Without -input flag, only one argument (the prompt) is accepted
- Update help text to explain the new behavior

Example usage:
  With input files:    cpe -input file.txt image.png "What's in these files?"
  Without input files: cpe "What is 2+2?"

* refactor: improve input flag and file handling

The -input flag has been converted from a string to a boolean flag for better UX:
- When -input is provided, all but last argument must be valid files
- Last argument can be either a file or prompt text
- Without -input flag, only one argument (prompt) is allowed
- Improved error messages for invalid input combinations
- Fixed empty text content handling in model executors

Example usage:
  cpe -input file.txt image.png "What's in these files?"  # Input files + prompt
  cpe "What is 2+2?"                                      # Single prompt

* feat: allow input files without prompt

When using -input flag:
- At least one input file is required
- Last argument can be either a file or prompt text
- If last argument is not a file, treat it as prompt text
- If no prompt text is provided, use default prompt

Example usage:
  cpe -input file1.txt file2.txt                    # Just input files
  cpe -input file1.txt file2.txt "Analyze these"    # Input files + prompt
  cpe -input file1.txt                              # Single input file

* feat: read text file contents as input

When using -input flag:
- If a file is of text mimetype, read its contents and use as text input
- For non-text files (images, etc.), use file path as input
- Last argument can still be either a file or prompt text
- If last argument is not a file, treat it as prompt text

Example usage:
  cpe -input prompt.txt                            # Read prompt from file
  cpe -input prompt.txt image.png                  # Text from file + image file
  cpe -input prompt.txt "Additional context"       # Text from file + text prompt

* fix: register genai.Blob type with gob encoder

The genai.Blob type is used by Gemini models for handling binary data like
images and audio. It needs to be registered with the gob encoder to allow
saving conversations that include binary content.

* fix: register OpenAI image content type with gob encoder

The ChatCompletionContentPartImageParam type is used by OpenAI models for
handling image inputs. It needs to be registered with the gob encoder to
allow saving conversations that include image content.

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
  • Loading branch information
github-actions[bot] and github-actions[bot] authored Feb 10, 2025
1 parent 9c83c83 commit dd1ffa8
Show file tree
Hide file tree
Showing 11 changed files with 392 additions and 52 deletions.
67 changes: 59 additions & 8 deletions internal/agent/anthropic.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@ package agent
import (
"context"
_ "embed"
"encoding/base64"
"encoding/gob"
"encoding/json"
"fmt"
a "github.com/anthropics/anthropic-sdk-go"
"github.com/anthropics/anthropic-sdk-go/option"
"github.com/gabriel-vasile/mimetype"
gitignore "github.com/sabhiram/go-gitignore"
"io"
"os"
"strings"
"time"
)
Expand Down Expand Up @@ -126,19 +129,67 @@ func NewAnthropicExecutor(baseUrl string, apiKey string, logger Logger, ignorer
}
}

func (s *anthropicExecutor) Execute(input string) error {
func (s *anthropicExecutor) Execute(inputs []Input) error {
// Convert inputs into content blocks
var contentBlocks []a.BetaContentBlockParamUnion
for _, input := range inputs {
switch input.Type {
case InputTypeText:
if len(strings.TrimSpace(input.Text)) > 0 {
contentBlocks = append(contentBlocks, &a.BetaTextBlockParam{
Text: a.F(input.Text),
Type: a.F(a.BetaTextBlockParamTypeText),
})
}
case InputTypeImage:
// Read and base64 encode the image file
imgData, err := os.ReadFile(input.FilePath)
if err != nil {
return fmt.Errorf("failed to read image file %s: %w", input.FilePath, err)
}

// Detect mime type
mime := mimetype.Detect(imgData)
if !strings.HasPrefix(mime.String(), "image/") {
return fmt.Errorf("file %s is not an image", input.FilePath)
}

// Base64 encode the image data
encodedData := base64.StdEncoding.EncodeToString(imgData)

// Create image block
contentBlocks = append(contentBlocks, &a.BetaImageBlockParam{
Type: a.F(a.BetaImageBlockParamTypeImage),
Source: a.F(a.BetaImageBlockParamSource{
Type: a.F(a.BetaImageBlockParamSourceTypeBase64),
MediaType: a.F(a.BetaImageBlockParamSourceMediaType(mime.String())),
Data: a.F(encodedData),
}),
})
case InputTypeVideo:
return fmt.Errorf("video input is not supported by Claude models")
case InputTypeAudio:
return fmt.Errorf("audio input is not supported by Claude models")
default:
return fmt.Errorf("unknown input type: %s", input.Type)
}
}

if !s.params.Messages.Present {
s.params.Messages = a.F([]a.BetaMessageParam{})
}

// If we have no content blocks, create one with an empty prompt
if len(contentBlocks) == 0 {
contentBlocks = append(contentBlocks, &a.BetaTextBlockParam{
Text: a.F("Please analyze these files."),
Type: a.F(a.BetaTextBlockParamTypeText),
})
}

s.params.Messages = a.F(append(s.params.Messages.Value, a.BetaMessageParam{
Content: a.F([]a.BetaContentBlockParamUnion{
&a.BetaTextBlockParam{
Text: a.F(input),
Type: a.F(a.BetaTextBlockParamTypeText),
},
}),
Role: a.F(a.BetaMessageParamRoleUser),
Content: a.F(contentBlocks),
Role: a.F(a.BetaMessageParamRoleUser),
}))

for {
Expand Down
11 changes: 10 additions & 1 deletion internal/agent/deepseek.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,16 @@ func NewDeepSeekExecutor(baseUrl string, apiKey string, logger Logger, ignorer *
}
}

func (o *deepseekExecutor) Execute(input string) error {
func (o *deepseekExecutor) Execute(inputs []Input) error {
// Only text input is supported
var textInputs []string
for _, input := range inputs {
if input.Type != InputTypeText {
return fmt.Errorf("input type %s is not supported by DeepSeek models, only text input is supported", input.Type)
}
textInputs = append(textInputs, input.Text)
}
input := strings.Join(textInputs, "\n")
o.logger.Println("Note that the current V3 model is not yet perfected, it seems like the instruction following and tool calling performance is not yet tuned.")
o.logger.Println("Recommend using this model for one-off tasks like generating git commit messages or bash commands.")

Expand Down
20 changes: 18 additions & 2 deletions internal/agent/executor.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,24 @@ var agentInstructions string
var reasoningAgentInstructions string

// Executor defines the interface for executing agentic workflows
type InputType string

const (
InputTypeText InputType = "text"
InputTypeImage InputType = "image"
InputTypeVideo InputType = "video"
InputTypeAudio InputType = "audio"
)

// Input represents a single input to be processed by the model
type Input struct {
Type InputType
Text string // Used when Type is InputTypeText
FilePath string // Used when Type is InputTypeImage, InputTypeVideo, or InputTypeAudio
}

type Executor interface {
Execute(input string) error
Execute(inputs []Input) error
LoadMessages(r io.Reader) error
SaveMessages(w io.Writer) error
PrintMessages() string
Expand Down Expand Up @@ -159,7 +175,7 @@ func InitExecutor(logger Logger, flags ModelOptions) (Executor, error) {
executor: executor,
convoManager: convoManager,
model: genConfig.Model,
userMessage: flags.Input,
userMessage: "", // Will be set during Execute
continueID: continueId,
}, nil
}
74 changes: 71 additions & 3 deletions internal/agent/gemini.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@ import (
"encoding/json"
"errors"
"fmt"
"github.com/gabriel-vasile/mimetype"
"github.com/google/generative-ai-go/genai"
gitignore "github.com/sabhiram/go-gitignore"
"google.golang.org/api/googleapi"
"google.golang.org/api/option"
"io"
"os"
"strings"
"time"
)
Expand All @@ -25,6 +27,7 @@ func init() {
gob.Register(genai.FunctionCall{})
gob.Register(genai.FunctionResponse{})
gob.Register(map[string]interface{}{})
gob.Register(genai.Blob{}) // Add this line
}

type geminiExecutor struct {
Expand Down Expand Up @@ -186,23 +189,88 @@ func NewGeminiExecutor(baseUrl string, apiKey string, logger Logger, ignorer *gi
}, nil
}

func (g *geminiExecutor) Execute(input string) error {
func (g *geminiExecutor) Execute(inputs []Input) error {
if g.session == nil {
g.session = g.model.StartChat()
}

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()

// Send initial user message with retries
// Convert inputs into parts
var parts []genai.Part
for _, input := range inputs {
switch input.Type {
case InputTypeText:
parts = append(parts, genai.Text(input.Text))
case InputTypeImage:
// Read image file
imgData, err := os.ReadFile(input.FilePath)
if err != nil {
return fmt.Errorf("failed to read image file %s: %w", input.FilePath, err)
}

// Detect mime type
mime := mimetype.Detect(imgData)
if !strings.HasPrefix(mime.String(), "image/") {
return fmt.Errorf("file %s is not an image", input.FilePath)
}

// Verify supported image type
switch mime.String() {
case "image/png", "image/jpeg", "image/webp", "image/heic", "image/heif":
// These are supported
default:
return fmt.Errorf("unsupported image type %s for file %s. Supported types: PNG, JPEG, WEBP, HEIC, HEIF", mime.String(), input.FilePath)
}

// Get format without the "image/" prefix
format := strings.TrimPrefix(mime.String(), "image/")

// Create image part
parts = append(parts, genai.ImageData(format, imgData))
case InputTypeAudio:
// Read audio file
audioData, err := os.ReadFile(input.FilePath)
if err != nil {
return fmt.Errorf("failed to read audio file %s: %w", input.FilePath, err)
}

// Detect mime type
mime := mimetype.Detect(audioData)
if !strings.HasPrefix(mime.String(), "audio/") {
return fmt.Errorf("file %s is not an audio file", input.FilePath)
}

// Verify supported audio type
switch mime.String() {
case "audio/wav", "audio/mp3", "audio/aiff", "audio/aac", "audio/ogg", "audio/flac":
// These are supported
default:
return fmt.Errorf("unsupported audio type %s for file %s. Supported types: WAV, MP3, AIFF, AAC, OGG, FLAC", mime.String(), input.FilePath)
}

// Create audio part
parts = append(parts, genai.Blob{
MIMEType: mime.String(),
Data: audioData,
})
case InputTypeVideo:
return fmt.Errorf("video input is not yet supported by this implementation")
default:
return fmt.Errorf("unknown input type: %s", input.Type)
}
}

// Send initial message with retries
var resp *genai.GenerateContentResponse
var err error
maxRetries := 5
retryCount := 0
retryWait := 1 * time.Minute

for retryCount <= maxRetries {
resp, err = g.session.SendMessage(ctx, genai.Text(input))
resp, err = g.session.SendMessage(ctx, parts...)
if err == nil {
break
}
Expand Down
27 changes: 23 additions & 4 deletions internal/agent/models.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,10 @@ type ModelDefaults struct {
}

type ModelConfig struct {
Name string
IsKnown bool
Defaults ModelDefaults
Name string
IsKnown bool
Defaults ModelDefaults
SupportedInputs []InputType
}

type ProviderConfig interface {
Expand Down Expand Up @@ -71,78 +72,97 @@ var ModelConfigs = map[string]ModelConfig{
"o3-mini": {
Name: openai.ChatModelO3Mini, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 100000, Temperature: 1},
SupportedInputs: []InputType{InputTypeText},
},
"deepseek-chat": {
Name: "deepseek-chat", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText},
},
"deepseek-r1": {
Name: "deepseek-reasoner", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 1},
SupportedInputs: []InputType{InputTypeText},
},
"claude-3-opus": {
Name: anthropic.ModelClaude_3_Opus_20240229, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 4096, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage},
},
"claude-3-5-sonnet": {
Name: anthropic.ModelClaude3_5Sonnet20241022, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage},
},
"claude-3-5-haiku": {
Name: anthropic.ModelClaude3_5Haiku20241022, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage},
},
"claude-3-haiku": {
Name: anthropic.ModelClaude_3_Haiku_20240307, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 4096, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage},
},
"gemini-1-5-flash-8b": {
Name: "gemini-1.5-flash-8b", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
},
"gemini-1-5-flash": {
Name: "gemini-1.5-flash-002", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
},
"gemini-2-flash-exp": {
Name: "gemini-2.0-flash-exp", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
},
"gemini-2-flash": {
Name: "gemini-2.0-flash", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
},
"gemini-2-flash-lite-preview": {
Name: "gemini-2.0-flash-lite-preview-02-05", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
},
"gemini-2-pro-exp": {
Name: "gemini-2.0-pro-exp-02-05", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
},
"gemini-1-5-pro": {
Name: "gemini-1.5-pro-002", IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage, InputTypeVideo, InputTypeAudio},
},
"gpt-4o": {
Name: openai.ChatModelGPT4o2024_11_20, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage},
},
"gpt-4o-mini": {
Name: openai.ChatModelGPT4oMini2024_07_18, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 8192, Temperature: 0.3},
SupportedInputs: []InputType{InputTypeText, InputTypeImage},
},
"o1": {
Name: openai.ChatModelO1_2024_12_17, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 100000, Temperature: 1},
SupportedInputs: []InputType{InputTypeText},
},
"o1-mini": {
Name: openai.ChatModelO1Mini2024_09_12, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 65536, Temperature: 1},
SupportedInputs: []InputType{InputTypeText},
},
"o1-preview": {
Name: openai.ChatModelO1Preview2024_09_12, IsKnown: true,
Defaults: ModelDefaults{MaxTokens: 100000, Temperature: 1},
SupportedInputs: []InputType{InputTypeText},
},
}

Expand All @@ -158,7 +178,6 @@ type ModelOptions struct {
FrequencyPenalty float64
PresencePenalty float64
NumberOfResponses int
Input string
Version bool
Continue string // conversation ID to continue from
ListConversations bool // List all conversations
Expand Down
Loading

0 comments on commit dd1ffa8

Please sign in to comment.