Before running the following Python scripts, run this command to install the necessary Python packages:
pip install -r requirements.txt
Use the tinyshakespeare dataset for a quick setup. This dataset is the fastest to download and tokenize. Run the following command to download and prepare the dataset:
python prepro_tinyshakespeare.py
(all Python scripts in this repo are from Andrej Karpathy's llm.c repository.)
Alternatively, download and tokenize the larger TinyStories dataset with the following command:
python prepro_tinystory.py
Next download the GPT-2 weights and save them as a checkpoint we can load in Mojo with following command:
python train_gpt2.py
Ensure that the Magic
command line tool is installed by following the Modular Docs.
Train your model by running:
magic shell
mojo train_gpt2.mojo
This command initiates the training process using the prepared data. When you execute the magic command for the first time, it will automatically install all necessary dependencies.