Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading benchmark features #6

Merged
merged 6 commits into from
Dec 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 113 additions & 7 deletions README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@
"cells": [
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[NbConvertApp] Converting notebook README.ipynb to markdown\n",
"[NbConvertApp] Writing 29531 bytes to README.md\n",
"[NbConvertApp] Writing 32369 bytes to README.md\n",
"┌──────────┬────────────┬───────────┐\n",
"│ \u001b[1mlast_day\u001b[0m │ \u001b[1mlast_month\u001b[0m │ \u001b[1mlast_week\u001b[0m │\n",
"├──────────┼────────────┼───────────┤\n",
"│ 190778369 │\n",
"│ 134905472 │\n",
"└──────────┴────────────┴───────────┘\n",
"\n"
]
Expand Down Expand Up @@ -50,7 +50,8 @@
"\n",
"Here are the tools most likely to be useful to you:\n",
"- 🎯 **Forecasting Bot:** General Forecaster that integrates with the Metaculus AI benchmarking competition. You can forecast with a pre-existing bot or override the class to customize your own (without redoing all the API code, etc)\n",
"- 🔍 **Perplexity++ Smart Searcher:** An AI-powered internet-informed llm powered by Exa.ai. Its a better (but slightly more expensive) alternative to Perplexity.ai that is configurable, more accurate, able to decide on filters, able to link to exact paragraphs, etc.\n",
"- 📊 **Benchmarking:** Randomly sample quality questions from Metaculus and run you bot against them so you can get an early sense of how your bot is doing by comparing to the community prediction and expected log scores. \n",
"- 🔍 **Perplexity++ Smart Searcher:** A custom AI-powered internet-informed llm powered by Exa.ai and Gpt. Its a better (but slightly more expensive) alternative to Perplexity.ai that is configurable, more accurate, able to decide on filters, able to link to exact paragraphs, etc.\n",
"- 🔑 **Key Factor Analysis:** Key Factors Analysis for scoring, ranking, and prioritizing important variables in forecasting questions\n",
"\n",
"\n",
Expand All @@ -61,9 +62,9 @@
"- **Metaculus API Wrapper:** for interacting with questions and tournaments\n",
"- **Monetary Cost Manager:** for tracking AI and API expenses\n",
"\n",
"Join the [discord](https://discord.gg/Dtq4JNdXnw) for updates and to give feedback (btw feedback is very appreciated, even just a quick 'I did/didn't decide to use the tool for reason X' is helpful to know)\n",
"Join the [discord](https://discord.gg/Dtq4JNdXnw) for updates and to give feedback (btw feedback is very appreciated, even just a quick \"I did/didn't decide to use tool X for reason Y, though am busy and don't have time to elaborate\" is helpful to know)\n",
"\n",
"Note: This package is still in a experimental phase. The goal is to keep the API fairly stable, though no guarantees are given at this phase. There will be special effort to keep the ForecastBot and TemplateBot APIs consistent."
"Note: This package is still in an experimental phase. The goal is to keep the package API fairly stable, though no guarantees are given at this phase. There will be special effort to keep the ForecastBot and TemplateBot APIs consistent."
]
},
{
Expand Down Expand Up @@ -275,8 +276,113 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Forecasting Tools Examples\n",
"# Forecasting Tools Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmarking\n",
"Below is an example of how to run the benchmarker"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from forecasting_tools import Benchmarker, TemplateBot, BenchmarkForBot\n",
"\n",
"class CustomBot(TemplateBot):\n",
" ...\n",
"\n",
"# Run benchmark on multiple bots\n",
"bots = [TemplateBot(), CustomBot()] # Add your custom bots here\n",
"benchmarker = Benchmarker(\n",
" forecast_bots=bots,\n",
" number_of_questions_to_use=2, # Recommended 100+ for meaningful results\n",
" file_path_to_save_reports=\"benchmarks/\"\n",
")\n",
"benchmarks: list[BenchmarkForBot] = await benchmarker.run_benchmark()\n",
"\n",
"# View results\n",
"for benchmark in benchmarks:\n",
" print(f\"Bot: {benchmark.name}\")\n",
" print(f\"Score: {benchmark.average_inverse_expected_log_score}\") # Lower is better\n",
" print(f\"Num Forecasts: {len(benchmark.forecast_reports)}\")\n",
" print(f\"Time: {benchmark.time_taken_in_minutes}min\")\n",
" print(f\"Cost: ${benchmark.total_cost}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ideal number of questions to get a good sense of whether one bot is better than another can vary. 100+ should tell your something decent. See [this analysis](https://forum.effectivealtruism.org/posts/DzqSh7akX28JEHf9H/comparing-two-forecasters-in-an-ideal-world) for exploration of the numbers.\n",
"\n",
"As of Dec 20, 2024 the benchmarker automatically selects a random set of questions from Metaculus that:\n",
"- Are binary questions (yes/no)\n",
"- Are currently open\n",
"- Will resolve within 3 months\n",
"- Have at least 40 forecasters\n",
"- Have a community prediction\n",
"- Are not notebook/group/conditional questions\n",
"\n",
"As of last edit there are plans to expand this to numeric and multiple choice, but right now it just benchmarks binary questions.\n",
"\n",
"You can grab these questions without using the Benchmarker by running the below\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from forecasting_tools import MetaculusApi\n",
"\n",
"questions = MetaculusApi.get_benchmark_questions(\n",
" num_of_questions_to_return=100,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also save/load benchmarks to/from json"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from forecasting_tools import BenchmarkForBot\n",
"\n",
"# Load\n",
"file_path = \"benchmarks/benchmark.json\"\n",
"benchmarks: list[BenchmarkForBot] = BenchmarkForBot.load_json_from_file_path(file_path)\n",
"\n",
"# Save\n",
"new_benchmarks: list[BenchmarkForBot] = benchmarks\n",
"BenchmarkForBot.save_object_list_to_file_path(new_benchmarks, file_path)\n",
"\n",
"# To/From Json String\n",
"single_benchmark = benchmarks[0]\n",
"json_object: dict = single_benchmark.to_json()\n",
"new_benchmark: BenchmarkForBot = BenchmarkForBot.from_json(json_object)\n",
"\n",
"# Note: Make sure to set the 'FILE_WRITING_ALLOWED' environment variable to true if you want to save the benchmarks to a file\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Smart Searcher\n",
"The Smart Searcher acts like an LLM with internet access. It works a lot like Perplexity.ai API, except:\n",
"- It has clickable citations that highlights and links directly to the paragraph cited using text fragments\n",
Expand Down
88 changes: 79 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,3 @@
```python
# To keep the notebook and readme in sync, run this, then delete this code from the top
!jupyter nbconvert --to markdown README.ipynb --output README.md
!pypistats recent forecasting-tools
```

![PyPI version](https://badge.fury.io/py/forecasting-tools.svg)
![Python Versions](https://img.shields.io/pypi/pyversions/forecasting-tools.svg)
![License](https://img.shields.io/badge/License-MIT-blue.svg)
Expand All @@ -22,7 +16,8 @@ This repository contains forecasting and research tools built with Python and St

Here are the tools most likely to be useful to you:
- 🎯 **Forecasting Bot:** General Forecaster that integrates with the Metaculus AI benchmarking competition. You can forecast with a pre-existing bot or override the class to customize your own (without redoing all the API code, etc)
- 🔍 **Perplexity++ Smart Searcher:** An AI-powered internet-informed llm powered by Exa.ai. Its a better (but more expensive) alternative to Perplexity.ai that is configurable, more accurate, able to decide on filters, able to link to exact paragraphs, etc.
- 📊 **Benchmarking:** Randomly sample quality questions from Metaculus and run you bot against them so you can get an early sense of how your bot is doing by comparing to the community prediction and expected log scores.
- 🔍 **Perplexity++ Smart Searcher:** A custom AI-powered internet-informed llm powered by Exa.ai and Gpt. Its a better (but slightly more expensive) alternative to Perplexity.ai that is configurable, more accurate, able to decide on filters, able to link to exact paragraphs, etc.
- 🔑 **Key Factor Analysis:** Key Factors Analysis for scoring, ranking, and prioritizing important variables in forecasting questions


Expand All @@ -33,9 +28,9 @@ Here are some other cool components and features of the project:
- **Metaculus API Wrapper:** for interacting with questions and tournaments
- **Monetary Cost Manager:** for tracking AI and API expenses

Join the [discord](https://discord.gg/Dtq4JNdXnw) for updates and to give feedback (btw feedback is very appreciated, even just a quick 'I did/didn't decide to use the tool for reason X' is helpful to know)
Join the [discord](https://discord.gg/Dtq4JNdXnw) for updates and to give feedback (btw feedback is very appreciated, even just a quick "I did/didn't decide to use tool X for reason Y, though am busy and don't have time to elaborate" is helpful to know)

Note: This package is still in a experimental phase. The goal is to keep the API fairly stable, though no guarantees are given at this phase. There will be special effort to keep the ForecastBot and TemplateBot APIs consistent.
Note: This package is still in an experimental phase. The goal is to keep the package API fairly stable, though no guarantees are given at this phase. There will be special effort to keep the ForecastBot and TemplateBot APIs consistent.


# Forecasting Bot Building
Expand Down Expand Up @@ -207,6 +202,81 @@ Whether running locally or through Github actions, you will need to set environm

# Forecasting Tools Examples

## Benchmarking
Below is an example of how to run the benchmarker


```python
from forecasting_tools import Benchmarker, TemplateBot, BenchmarkForBot

class CustomBot(TemplateBot):
...

# Run benchmark on multiple bots
bots = [TemplateBot(), CustomBot()] # Add your custom bots here
benchmarker = Benchmarker(
forecast_bots=bots,
number_of_questions_to_use=2, # Recommended 100+ for meaningful results
file_path_to_save_reports="benchmarks/"
)
benchmarks: list[BenchmarkForBot] = await benchmarker.run_benchmark()

# View results
for benchmark in benchmarks:
print(f"Bot: {benchmark.name}")
print(f"Score: {benchmark.average_inverse_expected_log_score}") # Lower is better
print(f"Num Forecasts: {len(benchmark.forecast_reports)}")
print(f"Time: {benchmark.time_taken_in_minutes}min")
print(f"Cost: ${benchmark.total_cost}")
```

The ideal number of questions to get a good sense of whether one bot is better than another can vary. 100+ should tell your something decent. See [this analysis](https://forum.effectivealtruism.org/posts/DzqSh7akX28JEHf9H/comparing-two-forecasters-in-an-ideal-world) for exploration of the numbers.

As of Dec 20, 2024 the benchmarker automatically selects a random set of questions from Metaculus that:
- Are binary questions (yes/no)
- Are currently open
- Will resolve within 3 months
- Have at least 40 forecasters
- Have a community prediction
- Are not notebook/group/conditional questions

As of last edit there are plans to expand this to numeric and multiple choice, but right now it just benchmarks binary questions.

You can grab these questions without using the Benchmarker by running the below



```python
from forecasting_tools import MetaculusApi

questions = MetaculusApi.get_benchmark_questions(
num_of_questions_to_return=100,
)
```

You can also save/load benchmarks to/from json


```python
from forecasting_tools import BenchmarkForBot

# Load
file_path = "benchmarks/benchmark.json"
benchmarks: list[BenchmarkForBot] = BenchmarkForBot.load_json_from_file_path(file_path)

# Save
new_benchmarks: list[BenchmarkForBot] = benchmarks
BenchmarkForBot.save_object_list_to_file_path(new_benchmarks, file_path)

# To/From Json String
single_benchmark = benchmarks[0]
json_object: dict = single_benchmark.to_json()
new_benchmark: BenchmarkForBot = BenchmarkForBot.from_json(json_object)

# Note: Make sure to set the 'FILE_WRITING_ALLOWED' environment variable to true if you want to save the benchmarks to a file

```

## Smart Searcher
The Smart Searcher acts like an LLM with internet access. It works a lot like Perplexity.ai API, except:
- It has clickable citations that highlights and links directly to the paragraph cited using text fragments
Expand Down
44 changes: 20 additions & 24 deletions code_tests/low_cost_or_live_api_tests/test_benchmarker.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import os
from typing import Literal
from unittest.mock import Mock

import pytest
Expand All @@ -11,6 +10,9 @@
TemplateBot,
)
from forecasting_tools.forecasting.helpers.benchmarker import Benchmarker
from forecasting_tools.forecasting.questions_and_reports.benchmark_for_bot import (
BenchmarkForBot,
)
from forecasting_tools.util import file_manipulation


Expand All @@ -34,7 +36,11 @@ async def test_file_is_made_for_benchmark(mocker: Mock) -> None:
if os.path.isfile(os.path.join(absolute_path, f))
)

await Benchmarker.benchmark_forecast_bot(bot, "shallow")
await Benchmarker(
forecast_bots=[bot],
number_of_questions_to_use=10,
file_path_to_save_reports=file_path_to_save_reports,
).run_benchmark()

files_after = set(
f
Expand All @@ -49,33 +55,23 @@ async def test_file_is_made_for_benchmark(mocker: Mock) -> None:
os.remove(os.path.join(absolute_path, new_file))


@pytest.mark.parametrize("num_questions", [1, 5, 10])
async def test_each_benchmark_mode_calls_forecaster_more_time(
mocker: Mock,
num_questions: int,
) -> None:
if ForecastingTestManager.quarterly_cup_is_not_active():
pytest.skip("Quarterly cup is not active")

bot_type = TemplateBot
bot = bot_type()

bot = TemplateBot()
mock_run_forecast = ForecastingTestManager.mock_forecast_bot_run_forecast(
bot_type, mocker
)

modes: list[Literal["shallow", "medium", "deep"]] = [
"shallow",
"medium",
"deep",
]
num_calls_for_modes = []
for mode in modes:
score = await Benchmarker.benchmark_forecast_bot(bot, mode)
assert isinstance(score, float), "The score should be a float"

previous_calls = num_calls_for_modes[-1] if num_calls_for_modes else 0
current_calls = mock_run_forecast.call_count - previous_calls
num_calls_for_modes.append(current_calls)

assert (
current_calls > previous_calls
), "No new forecast calls were made"
benchmarks = await Benchmarker(
forecast_bots=[bot],
number_of_questions_to_use=num_questions,
).run_benchmark()
assert isinstance(benchmarks, list)
assert all(
isinstance(benchmark, BenchmarkForBot) for benchmark in benchmarks
)
assert mock_run_forecast.call_count == num_questions
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,7 @@ def get_forecast_example_reports() -> list[ForecastReport]:

def get_base_rate_example_reports() -> list[BaseRateReport]:
base_rate_data_path = "code_tests/unit_tests/test_forecasting/forecasting_test_data/base_rate_reports.json"
base_rate_reports = (
BaseRateReport.convert_project_file_path_to_object_list(
base_rate_data_path
)
base_rate_reports = BaseRateReport.load_json_from_file_path(
base_rate_data_path
)
return base_rate_reports
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
MonetaryCostManager,
)
from forecasting_tools.forecasting.forecast_bots.bot_lists import (
get_all_bot_classes,
get_cheap_bot_question_type_pairs,
)
from forecasting_tools.forecasting.forecast_bots.forecast_bot import (
Expand Down Expand Up @@ -100,3 +101,11 @@ async def test_no_reports_when_questions_already_forecasted(
assert len(reports) == len(
questions
), "Expected all questions to be forecasted on"


@pytest.mark.parametrize("bot", get_all_bot_classes())
def test_bot_has_config(bot: type[ForecastBot]):
probable_minimum_number_of_bot_params = 3
bot_config = bot().get_config()
assert bot_config is not None
assert len(bot_config.keys()) > probable_minimum_number_of_bot_params
Loading
Loading