Agent2Bench is a proposed benchmark that tests LLMs abilities in daily life computer tasks like booking flights, downloading programs or exiting vim. A demo can be found at https://dkealvaro.github.io/Agent2Bench/. Note that the demo doesn't include real results, it's just a proof of concept.
LLMs have been gaining a lot of attention in the last years. People are claming that AGI is just around the corner. Yet LLMs have not been tested on simple, daily life computer tasks. I have a vision of a future where LLMS are able to use a computer's full capabilities, just in the same way a human would do:
- Given a natural language instruction of a task to complete with a computer (e.g. "Book a flight to Tokyo")
- Come up and execute a plan to complete the task without the need of specific APIs or similar tools, simply by inspecting the computer's screen and using the keyboard and mouse (e.g. "Open a browser, go to Google Flights, search for flights to Tokyo, book the cheapest flight")
Currently, no benchmark that measures this capability exists. Agent2Bench is my attempt to fill this gap.
Such a benchmark should be as open as possible, therefore it will be open source and allow anyone to submit tasks to be benchmarked, and to view the results.
Tasks should be easily verifiable, e.g: if the task is to solve the wordle, then the LLM should return the wordle solution. If the task is to book a flight, then the LLM should return the flight details, and so on.
I'm looking for collaborators to help me build this benchmark. If you're interested in collaborating, please contact me.