A more powerful version than Virtual Scientists, which supports a million-agent-level scientific collaboration simulation. Our scientific collaboration includes six sections: (1) collaborator selection; (2) topic selection; (3) idea generation; (4) novelty check; (5) abstract generation; (6) review generation.
[2025-05]
- We release a simple reinforcement learning (RL)-based algorithm for collaborator selection in the
RL-Based
branch.
[2025-04]
- We release the code and data of VirSci-v2, which is a powerful platform for scientfic collaboration simulation.
git clone https://github.com/RenqiChen/Virtual-Scientists-v2
conda create --name virsci python=3.11
conda activate virsci
Install dependencies of the basic multi-agent framework CAMEL.
cd camel-master
pip install --upgrade pip setuptools
pip install -e . # This will install dependencies in pyproject.toml and install camel in editable mode
Then, install the following necessary packages.
pip install ollama
pip install faiss-gpu
Some other dependencies can be installed as needed.
In our experiments, we use ollama
to deploy the llama3.1-8b
and llama3.1-70b
language models and mxbai-embed-large
embedding model. The details of deployment could refer to URL. Here we show some key steps:
- Ollama should be installed. The linux version:
curl -fsSL https://ollama.com/install.sh | sh
- Run ollama in the path where ollama is installed:
./ollama serve
- Pull a model to use with the library:
./ollama pull llama3.1
./ollama pull llama3.1:70b
./ollama pull mxbai-embed-large
- Install the ollama python library in your environment:
pip install ollama
- Complete the installation and close the terminal.
The raw data is based on the AMiner Computer Science Dataset and Open Academic Graph.
After preprocessing, the used data is publicly available at Google Drive.
The files in Google Drive is related to def _init_
of class Platform
in sci_platform/sci_platform_fast.py
.
- Computer Science Dataset
- Past paper database is put in the
Papers/papers.tar.gz
, which is used inpaper_folder_path
. The corresponding embedding database is put in theEmbeddings/faiss_index.index
, which is used inpaper_index_path
. - Contemporary paper database is put in the
Papers/papers_future.tar.gz
, which is used infuture_paper_folder_path
. The corresponding embedding database is put in theEmbeddings/faiss_index_future.index
, which is used inpaper_future_index_path
. - Author knowledge bank is put in the
Authors/books.tar
, which is used in ininput_dir
insci_platform/configs/knowledge_config.json
andauthor_folder_path
. - Adjacency matrix is put in the
adjacency.txt
, which is used inadjacency_matrix_dir
.
- Open Academic Graph Dataset
- Past paper database is put in the
Papers/papers_OAG.zip
, which is used inpaper_folder_path
. The corresponding embedding database is put in theEmbeddings/faiss_index_OAG.index
, which is used inpaper_index_path
. - Contemporary paper database is put in the
Papers/papers_future_OAG.tar.gz
, which is used infuture_paper_folder_path
. The corresponding embedding database is put in theEmbeddings/faiss_index_OAG_future.index
, which is used inpaper_future_index_path
. - Author knowledge bank is put in the
Authors/books_OAG.zip
, which is used in ininput_dir
insci_platform/configs/knowledge_config.json
andauthor_folder_path
. - Adjacency matrix is put in the
weight_matrix.txt
, which is used inadjacency_matrix_dir
.
Note
Please replace all paths in sci_platform/sci_platform_fast.py
with your own settings after download the data.
Here we explain the roles of several critial files.
sci_platform/configs/deploy_config.py
defines all hyper-parameter settings.sci_platform/social_agent/sci_agent.py
defines the customized scientist agent in this project.sci_platform/social_agent/channel.py
defines the message sending and receiving, which is the lowest-level module.sci_platform/inference
controls the messages sent to or received from the channel, which corresponds to different threads.sci_platform/run_fast.py
is the main execution file.sci_platform/sci_platform_fast.py
defines the platform for the initialization of our multi-agent system.sci_platform/utils/prompt.py
contains all the prompts used.sci_platform/utils/scientist_utils.py
contains all the common functions used.sci_platform/sci_team/SciTeam.py
defines the execution mechanism of each scientist team.
Our code support different environment settings. The commonly used arguments in deploy_config.py
:
- Deploy Setup
ips
: the ips for the LLM model deployment
port
: the ports of the ip for the LLM model deployment
- Experiment Setup
agent_num
: how many independent scientists are included in the simulation
runs
: how many times does the program run
team_limit
: the max number of teams for a scientist
max_discuss_iteration
: the max discussion iterations for a team in a step
max_team_member
: the max team member of a team (including the team leader)
epochs
: the allowed time steps for one program run (the publish of a complete paper usually needs 5 epochs)
model_name
: the LLM base model for simulation (e.g., llama3.1)
leader_mode
: who is the leader (e.g., normal or random)
- Checkpoint Setup
checkpoint
: use the checkpoint or create a new program
test_time
: the name of the test as a checkpoint
load_time
: the name of the loaded checkpoint
In deploy_config.py
, set the ips=['127.0.0.1']
. In port2.sh
, CUDA_VISIBLE_DEVICES
means how many GPUs are used and OLLAMA_HOST=0.0.0.0:XXXXX
means the port of one GPU is deployed with a LLM model.
cd sci_platform
bash port2.sh
In deploy_config.py
, set the ips
includes the ips of all machines. port1.sh
is to deploy LLM models on these distributed machines.
cd sci_platform
bash port1.sh
bash port2.sh
This project is supported by Shanghai Artificial Intelligence Laboratory.
The multi-agent framework in this work is based on the CAMEL.
The concurrent distributed system in this work is based on the OASIS.
The raw data is based on the AMiner Computer Science Dataset and the Open Academic Graph.
This repository is licensed under the Apache-2.0 License.