GooseBench #1432

zakiali · 2025-02-27T20:52:44Z

zakiali
Feb 27, 2025
Maintainer

GooseBench

We’re working on an evaluation benchmark for Goose that we’re calling goosebench. We’d love your feedback!

What are we doing?

We’re building out a minimal framework and a set of evaluations to include in a lightweight benchmark. The purpose of this benchmark is to quickly verify Goose’s practical functionality and performance on a small set of standardized use cases. Unlike other industry benchmarks like SWE-Bench or Tau Bench, this benchmark will be less statistically real than others by design with the initial goal of test the users setup/configuration and answering questions like: does function calling work and how many can we do, is goose able to send images/take screenshots, what provider limits is the user dealing with (e.g. rate limits), do all the supported extensions work, is the agent using write/replace where it should, and others.

Current work and progress:

We’ve kicked this off this work by creating a framework for defining and running the evals along with an initial set of evals that tests tool call usage with extensions and comparing the search/replace/write functionality in the text_edit tool. Check out these PRs for the foundational work and an early set of evals:

We’re continuing to add evals and functionality to the benchmark. Check back here for updates!

Looking for community feedback/contributions:

We’re continuing to grow this benchmark, but would love feedback on the benchmark’s usefulness and welcome contributions for any evaluations that you’d like to see included here. This could be part of existing workflows you have to test new releases of goose or some issues that you run into goose with that should be included here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GooseBench #1432

{{title}}

Replies: 0 comments

Select a reply

GooseBench #1432

zakiali Feb 27, 2025 Maintainer

GooseBench

What are we doing?

Current work and progress:

Looking for community feedback/contributions:

Replies: 0 comments

zakiali
Feb 27, 2025
Maintainer