Before attempting the steps in this guide, please ensure you have completed all onboarding steps from the Getting Started section of the Strong Compute Developer Docs.
Using the VSCode Remote SSH extension that you configured in the Quick Start, connect to your ISC container. (You might have to start it on Control Plane first.)
Create and source a new python virtual environment.
python3 -m virtualenv /root/.chess
source /root/.chess/bin/activate
Clone this repo and install the requirements.
cd /root
git clone https://github.com/StrongResearch/chess-hackathon.git
cd /root/chess-hackathon
pip install -r requirements.txt
- Nagivate to the models subdirectory of this repository.
- Decide whether you want to train a chessGPT or chessVision model.
- Navigate to the appropriate model type subdirectory for your chosen model type.
- The model type subdirectory will contain two further subdirectories, one for each example model of this type. Decide which of the two example models you want to train.
Copy the experiment launch file <type>.isc
and training script train_<type>.py
from your chosen model type subdirectory to the root directory for this repo (i.e. copy from /root/chess-hackathon/models/<type>
to /root/chess-hackathon
).
cd /root/chess-hackathon
cp models/CHOSEN_MODEL_TYPE/MODEL_ISC_FILE.isc .
cp models/CHOSEN_MODEL_TYPE/MODEL_TRAIN_SCRIPT.py .
Copy the model.py
and model_config.yaml
files from your chosen example model subdirectory to the root directory for this repo (i.e. copy from /root/chess-hackathon/models/<type>/<example>
to /root/chess-hackathon
)
cd /root/chess-hackathon # or wherever you cloned it
cp models/CHOSEN_MODEL_TYPE/NAME_OF_MODEL/model.py .
cp models/CHOSEN_MODEL_TYPE/NAME_OF_MODEL/model_config.yaml .
Update your chosen experiment launch file with your Project ID. The provided experiment launch files are prepared with a suitable dataset already, but if you want to select another dataset (see below) you can also update the User Dataset ID.
Launch your experiment with the following.
isc train <type>.isc
- In your terminal, run
isc experiments
to obtain the output path for the experiment you launched. - Wait for your experiment to reach the status
completed
(re-runisc experiments
until you see your experimentcompleted
). - Navigate to the output path for your experiment and copy the
checkpoint.pt
from within the/root/<output>/<path>/latest_pt
subdirectory into the home directory for this repo (i.e./root/chess-hackathon
).
cp /root/<output>/<path>/latest_pt/checkpoint.pt /root/chess-hackathon/checkpoint.pt
- In your terminal, navigate to the home directory for this repo with
cd /root/chess-hackathon
and runpython pre_submission_val.py
. This will validate that your model is able to initialize correctly, load the checkpoint, and infer fast enough to play in the tournament, and is an important step before launching burst. Otherwise, you might develop a model and spend time training it only to discover that it is too big, and you will need to train a smaller model instead.
For more information about this see below under Pre-submission model validation.
Once your model has successfully completed
a run with compute_mode = "cycle"
you will have confidence that it will run successfully on a dedicated cluster. Your next step is update your experiment launch file with compute_mode = "burst"
and again run isc train <type>.isc
.
This time you will see a message directing you to Control Plane to launch your burst experiment. Visit the Experiments page on Control Plane and click "Launch Burst" next to your experiment.
Click on the "View" button for your experiment in Control Plane to follow progress initializing your experiment to run on a dedicated cluster. Be patient, this can take a few minutes.
Once your experiment reaches the state of running
, visit the Workstations page in Control Plane and click Stop on your container, then click Start on your container again. When your container is started again, you will find artefacts from your experiment training on its dedicated cluster sycning to a directory in /root/exports/<experiment-id>/outputs
. Interacting with this directory is slow because it is a mounted bucket - again please be patient. To track performance metrics logging to rank_0.txt
or access checkpoints, copy the files you need from /root/exports/<experiment-id>/outputs
to another subdirectory in /root
beforehand.
cd /root/chess-hackathon
cp /root/exports/<experiment-id>/outputs/rank_0.txt .
cp /root/exports/<experiment-id>/outputs/checkpoint.pt .
If your experiment stops with status strong_fail
, or if you Stop your experiment via the CLI or Control Plane, then you may be able to resume training your experiment from its most recent checkpoint.
The training scripts included in this repo under /chess-hackathon/models
implement an optional argument --load-path
. Include this argument in your experiment launch file as follows, passing in the path to the most recent checkpoint from the stoppped experiment.
command = '''
source /root/.chess/bin/activate &&
cd /root/chess-hackathon/ &&
torchrun --nnodes=$NNODES --nproc-per-node=$N_PROC --master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT --node_rank=$RANK
train_<type>.py --load-path /root/<path>/<to>/checkpoint.pt'''
You can then launch a new experiment with isc train <type>.isc
which will resume training from that checkpoint.
Note: when resuming from comput_mode = "burst"
experiments, ensure you have copied the most recent checkpoint out of the /root/exports
directory into another location in /root
before resuming your experiment.
To understand how your model will be instantiated and called during gameplay, refer to the gameplay.ipynb
notebook.
You may develop most any kind of model you like, but your submission must adhere to the following rules.
- Your submission must conform to the specification (below),
- Your model must pass the pre-submission validation check (below) to be admitted into the tournament,
- Your model must be trained entirely from scratch using the provided compute resources.
- You may not use pretrained models (this includes no transfer learning, fine-tuning, or adaptation modules).
- You may not hard-code any moves (e.g. no opening books).
- Your model must use or be compatible with the dependencies included in the
requirements.txt
file for this repo. You may install other additional dependencies for the purpose of training but for inference (e.g. game play / tournament) your model must not require any dependencies other than those included in therequirements.txt
file.
Your submission must follow the following directory structure. Ensure you have moved your model.py
, model_config.yaml
, and checkpoint.pt
files into a separate sub/directory. Then copy in pre_submission_val.py
and chess_gameplay.py
and run this script with python pre_submission_val.py
to test that your model will build and infer within the allowed time. For more infro
└─team-name
├─ model.py
├─ model_config.yaml
├─ checkpoint.pt
├─ pre_submission_val.py
└─ chess_gameplay.py
Do not make any changes to the contents of pre_submission_val.py
or chess_gameplay.py
.
- The
model_config.yaml
file must conform to standard yaml syntax. - The
model_config.yaml
file must contain all necessary arguments for instantiating your model. See below for demonstration of how themodel_config.yaml
is expected to be used during the tournament.
- The
model.py
file must contain a class description of your model, which must be a PyTorch module calledModel
. - The
Model
class must be self-contained. All code necessary to instantiate your model should be included in themodel.py
file and dependencies installed withrequirements.txt
. Yourmodel.py
file must not import from any ancillary files in your project directory. - The model must not move any weights to the GPU upon initialization, it will be expected to run entirely on the CPU during the tournament.
- The model must implement a
score
method. - The
score
method must accept as input the following two positional arguments:
- A PGN string representing the current game up to the most recent move, and
- A string representing a potential next move.
- The
score
method must return afloat
value which represents a score for the potential move given the PGN, where higher positive scores always indicate preference for selecting that move. - The model must not require GPU access to execute the
score
method.
- The
checkpoint.pt
file must be able to be loaded with thetorch.load
function. - Your model state dictionary must be able to be obtained from the loaded checkpoint object by calling
checkpoint[“model”]
.
Your model must satisfy the pre-submission validation check to gain admittance into the tournament. You can run the pre-submission validation check with the following.
python pre_submission_val.py
If successful, this test will return the following.
Outputs pass validation tests.
Model passes validation test.
If any errors are reported, your model has failed the test and must be amended in order to be accepted into the tournament.
There are four datasets that have been published for this hackathon which can be found on the Datasets page of Control Plane under Public Datasets.
Chess Hackathon - PGNs - Grand Master Games
(ID:3bd77ed0-cda1-4274-8b5b-7582fabb9754
)Chess Hackathon - PGNs - Leela Chess Zero Training Run 60
(ID:a6ebbed3-c0ec-49f9-8759-f17bb28d5376
)Chess Hackathon - Board Evals - Grand Master Games
(ID:96f6d30d-3dec-474b-880e-d2fa3ba3756e
)Chess Hackathon - Board Evals - Leela Chess Zero Training Run 60
(ID:9714cc3f-7383-43de-bb06-1e23ba2887ac
)
The PGN
datasets are suitable for chessGPT
model training. The EVAL
datasets are suitable for chessVision
model training. Choose a dataset that is suitable for your chosen model and note the Dataset ID.
A further two datasets have also been prepared which contain both of the above PGN
and both of the above EVAL
datasets respectively. Those datasets are named as follows.
Chess Hackathon - PGNs - Combined
Chess Hackathon - Board Evals - Combined
Please note the training scripts published in this repo will not work with these two combined datasets without adjustment. You will need to update the training scripts, or write your own, to work with the above combined datasets if you wish.
All code used to develop these datasets can be found in /root/chess-hackathon/utils/data_preprocessing
. The Hackathon 3 - PGN - Grand Master Games
dataset was generated using gm_preproc.ipynb
notebook. The Hackathon 3 - PGN - Leela Chess Zero Training Test 60
dataset was generated using lc0_preproc.ipynb
notebook. The Hackathon 3 - EVAL - Grand Master Games
and Hackathon 3 - EVAL - Leela Chess Zero Training Test 60
datasets were generated by running a distributed processing workload with preproc_boardeval.py
launched with preproc.isc
, and post-processed with eval_preproc.ipynb
.
win