GitHub - DKeAlvaro/Agent2Bench: Agent2Bench is a benchmark that tests LLMs abilities in Daily life computer tasks like booking flights, downloading programs or exiting vim.

Agent2Bench is a proposed benchmark that tests LLMs abilities in daily life computer tasks like booking flights, downloading programs or exiting vim. A demo can be found at https://dkealvaro.github.io/Agent2Bench/. Note that the demo doesn't include real results, it's just a proof of concept.

Motivation

LLMs have been gaining a lot of attention in the last years. People are claming that AGI is just around the corner. Yet LLMs have not been tested on simple, daily life computer tasks. I have a vision of a future where LLMS are able to use a computer's full capabilities, just in the same way a human would do:

Given a natural language instruction of a task to complete with a computer (e.g. "Book a flight to Tokyo")
Come up and execute a plan to complete the task without the need of specific APIs or similar tools, simply by inspecting the computer's screen and using the keyboard and mouse (e.g. "Open a browser, go to Google Flights, search for flights to Tokyo, book the cheapest flight")

Currently, no benchmark that measures this capability exists. Agent2Bench is my attempt to fill this gap.

Such a benchmark should be as open as possible, therefore it will be open source and allow anyone to submit tasks to be benchmarked, and to view the results.

Tasks should be easily verifiable, e.g: if the task is to solve the wordle, then the LLM should return the wordle solution. If the task is to book a flight, then the LLM should return the flight details, and so on.

Collaboration

I'm looking for collaborators to help me build this benchmark. If you're interested in collaborating, please contact me.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
css		css
img		img
js		js
pages		pages
templates		templates
.nojekyll		.nojekyll
README.md		README.md
favicon.ico		favicon.ico
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation

Collaboration

About

Releases

Packages

Languages

DKeAlvaro/Agent2Bench

Folders and files

Latest commit

History

Repository files navigation

Motivation

Collaboration

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages