-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
97 lines (91 loc) · 3.92 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-Q6XW00J8ZY"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-Q6XW00J8ZY');
</script>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Agent2Bench - Testing LLM Real-World Capabilities</title>
<link rel="icon" type="image/x-icon" href="favicon.ico">
<link rel="stylesheet" href="css/styles.css">
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&display=swap" rel="stylesheet">
</head>
<body>
<main class="landing-page">
<section class="landing-hero">
<div class="floating-nav">
<a href="pages/results.html">Results</a>
<a href="pages/tasks.html">Tasks</a>
<a href="pages/submit.html">Submit Task</a>
</div>
<h1>Can AI Handle Real Human Tasks?</h1>
<p class="hero-subtitle">The first benchmark testing LLMs on everyday tasks like solving Wordle or booking flights</p>
<div class="hero-stats">
<div class="stat-item">
<span class="stat-number">100+</span>
<span class="stat-label">Real Tasks</span>
</div>
<div class="stat-item">
<span class="stat-number">15+</span>
<span class="stat-label">LLM Models</span>
</div>
<div class="stat-item">
<span class="stat-number">1000+</span>
<span class="stat-label">Test Runs</span>
</div>
</div>
<div class="hero-cta">
<a href="pages/results.html" class="cta-button">View Results</a>
<a href="pages/submit.html" class="cta-button secondary">Submit Task</a>
</div>
</section>
<section class="features-section">
<div class="feature-item">
<h3>Real-World Tasks</h3>
<p>From solving Wordle to booking flights, we test AI on tasks humans do daily</p>
</div>
<div class="feature-item">
<h3>Verifiable Results</h3>
<p>Clear success criteria and automated verification for each task</p>
</div>
<div class="feature-item">
<h3>Community Driven</h3>
<p>Submit your own tasks and help expand the benchmark</p>
</div>
</section>
<section class="insights-section">
<h2>Latest Insights</h2>
<div class="insights-grid">
<div class="insight-item">
<span class="insight-title">Best Overall</span>
<span class="insight-value">Claude 3.5 Sonnet</span>
<span class="insight-detail">85% Success Rate</span>
</div>
<div class="insight-item">
<span class="insight-title">Most Cost-Effective</span>
<span class="insight-value">DeepSeek v3</span>
<span class="insight-detail">$0.08 per task</span>
</div>
<div class="insight-item">
<span class="insight-title">Hardest Task</span>
<span class="insight-value">Flight Booking</span>
<span class="insight-detail">65% Success Rate</span>
</div>
</div>
</section>
<footer>
<div class="footer-content">
<p> 2025 Agent2Bench.</p>
</div>
</footer>
</main>
<script src="js/landing.js"></script>
<script src="js/cursor-effect.js"></script>
</body>
</html>