Skip to content

Agent2Bench is a benchmark that tests LLMs abilities in Daily life computer tasks like booking flights, downloading programs or exiting vim.

Notifications You must be signed in to change notification settings

DKeAlvaro/Agent2Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent2Bench Logo

Agent2Bench is a proposed benchmark that tests LLMs abilities in daily life computer tasks like booking flights, downloading programs or exiting vim. A demo can be found at https://dkealvaro.github.io/Agent2Bench/. Note that the demo doesn't include real results, it's just a proof of concept.

Motivation

LLMs have been gaining a lot of attention in the last years. People are claming that AGI is just around the corner. Yet LLMs have not been tested on simple, daily life computer tasks. I have a vision of a future where LLMS are able to use a computer's full capabilities, just in the same way a human would do:

  • Given a natural language instruction of a task to complete with a computer (e.g. "Book a flight to Tokyo")
  • Come up and execute a plan to complete the task without the need of specific APIs or similar tools, simply by inspecting the computer's screen and using the keyboard and mouse (e.g. "Open a browser, go to Google Flights, search for flights to Tokyo, book the cheapest flight")

Currently, no benchmark that measures this capability exists. Agent2Bench is my attempt to fill this gap.

Such a benchmark should be as open as possible, therefore it will be open source and allow anyone to submit tasks to be benchmarked, and to view the results.

Tasks should be easily verifiable, e.g: if the task is to solve the wordle, then the LLM should return the wordle solution. If the task is to book a flight, then the LLM should return the flight details, and so on.

Collaboration

I'm looking for collaborators to help me build this benchmark. If you're interested in collaborating, please contact me.

About

Agent2Bench is a benchmark that tests LLMs abilities in Daily life computer tasks like booking flights, downloading programs or exiting vim.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published