Restore from master weights (& allow restoring from a checkpoint of different precision) #702

ademeure · 2024-07-20T16:42:59Z

This is fully deterministic for new checkpoints where the new rng_state_last_update is saved, so that stochastic rounding from master weights is done with the exact same seeds (while restoring the actual final rng_state again afterwards, in case anything else changed it between that update and saving the checkpoint).

In the case where we are resuming from a checkpoint (not just a regular model file) and we have master weights, this simply skips loading the weights from the checkpoint completely, so it doesn't matter if they are not even the right number of bytes.

It should be useful to check if FP32 helps with runs exploding, and going forward it will allow FP8 runs to not care too much about what format the non-master weights are saved, so we don't need to worry about changes breaking compatibility etc...

…rministically by also saving RNG state of last update)

karpathy · 2024-07-28T16:22:06Z

llmc/adamw.cuh

@@ -61,6 +61,16 @@ __global__ void adamw_kernel3(Tp* params_memory, float* master_params_memory, Tg
                 );
 }

+template <typename Tp>
+__global__ void init_from_master_kernel(Tp* params_memory, float* master_params_memory, size_t num_parameters,
+                                          ptrdiff_t w_stride, ptrdiff_t s_stride, unsigned int seed, bool check_identical) {


unused check_identical?

oops, that used to be for debug lots that I removed, thanks

karpathy · 2024-07-28T16:40:59Z

train_gpt2.cu

@@ -1532,7 +1554,7 @@ int main(int argc, char *argv[]) {
    gpt2_init_common(&model);
    if (resuming == 1) {
        // if `-y 1` was set, then we are resuming from the latest checkpoint
-        gpt2_build_from_checkpoint(&model, filename_buffer);
+        gpt2_build_from_checkpoint(&model, filename_buffer, true);


at this point model.use_master_weights is not yet initialized with use_master_weights, that happens below, and is just the default (true). This variable is used inside gpt2_build_from_checkpoint and this is probably a bug?

ademeure added 5 commits July 25, 2024 22:04

Allow restoring weights from the master weights of a checkpoint (dete…

f470fbd

…rministically by also saving RNG state of last update)

make restoring from master weights actually work

5cae10f

allow restoring from checkpoint of different precision

9781627

simplify a little bit

2eabc22

simplified further (don't need non-functional error checking...)

4d77ece

ademeure force-pushed the restore_from_master_weights branch from 6387e66 to 4d77ece Compare July 25, 2024 21:04

fix bug from merge (init_state set to false too late)

52e6e0f

karpathy reviewed Jul 28, 2024

View reviewed changes

karpathy merged commit b2ae847 into karpathy:master Jul 30, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore from master weights (& allow restoring from a checkpoint of different precision) #702

Restore from master weights (& allow restoring from a checkpoint of different precision) #702

ademeure commented Jul 20, 2024

karpathy Jul 28, 2024

ademeure Jul 28, 2024

karpathy Jul 28, 2024

Restore from master weights (& allow restoring from a checkpoint of different precision) #702

Restore from master weights (& allow restoring from a checkpoint of different precision) #702

Conversation

ademeure commented Jul 20, 2024

karpathy Jul 28, 2024

Choose a reason for hiding this comment

ademeure Jul 28, 2024

Choose a reason for hiding this comment

karpathy Jul 28, 2024

Choose a reason for hiding this comment