You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We’re working on an evaluation benchmark for Goose that we’re calling goosebench. We’d love your feedback!
What are we doing?
We’re building out a minimal framework and a set of evaluations to include in a lightweight benchmark. The purpose of this benchmark is to quickly verify Goose’s practical functionality and performance on a small set of standardized use cases. Unlike other industry benchmarks like SWE-Bench or Tau Bench, this benchmark will be less statistically real than others by design with the initial goal of test the users setup/configuration and answering questions like: does function calling work and how many can we do, is goose able to send images/take screenshots, what provider limits is the user dealing with (e.g. rate limits), do all the supported extensions work, is the agent using write/replace where it should, and others.
Current work and progress:
We’ve kicked this off this work by creating a framework for defining and running the evals along with an initial set of evals that tests tool call usage with extensions and comparing the search/replace/write functionality in the text_edit tool. Check out these PRs for the foundational work and an early set of evals:
We’re continuing to add evals and functionality to the benchmark. Check back here for updates!
Looking for community feedback/contributions:
We’re continuing to grow this benchmark, but would love feedback on the benchmark’s usefulness and welcome contributions for any evaluations that you’d like to see included here. This could be part of existing workflows you have to test new releases of goose or some issues that you run into goose with that should be included here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
GooseBench
We’re working on an evaluation benchmark for Goose that we’re calling goosebench. We’d love your feedback!
What are we doing?
We’re building out a minimal framework and a set of evaluations to include in a lightweight benchmark. The purpose of this benchmark is to quickly verify Goose’s practical functionality and performance on a small set of standardized use cases. Unlike other industry benchmarks like SWE-Bench or Tau Bench, this benchmark will be less statistically real than others by design with the initial goal of test the users setup/configuration and answering questions like: does function calling work and how many can we do, is goose able to send images/take screenshots, what provider limits is the user dealing with (e.g. rate limits), do all the supported extensions work, is the agent using write/replace where it should, and others.
Current work and progress:
We’ve kicked this off this work by creating a framework for defining and running the evals along with an initial set of evals that tests tool call usage with extensions and comparing the search/replace/write functionality in the text_edit tool. Check out these PRs for the foundational work and an early set of evals:
We’re continuing to add evals and functionality to the benchmark. Check back here for updates!
Looking for community feedback/contributions:
We’re continuing to grow this benchmark, but would love feedback on the benchmark’s usefulness and welcome contributions for any evaluations that you’d like to see included here. This could be part of existing workflows you have to test new releases of goose or some issues that you run into goose with that should be included here.
Beta Was this translation helpful? Give feedback.
All reactions