Zero 2 #593

ngc92 · 2024-06-14T14:39:15Z

Trying to get a first version working. Code isn't nice, we currently lose the asynchrony in the communication code because we need to reuse the buffer for the next layer, and it doesn't give correct results.

llmc/cuda_utils.cuh

gordicaleksa · 2024-06-17T09:07:53Z

train_gpt2.cu

+            GPT2Config wave_config = model->config;
+            // allocate space for two layers, so we can do double-buffering
+            wave_config.num_layers = 2;
+            fill_in_parameter_sizes(param_elements, param_sizeof, wave_config);


curious whether DeepSpeed shards also embedding layer?

train_gpt2.cu

ademeure

LGTM, looks like it could be made significantly faster/more efficient with relatively less extra complexity, but based on your performance data it already looks useful and possibly good enough for now as-is!

ademeure · 2024-07-25T22:01:11Z

train_gpt2.cu

+                // NCCL stream:     Wait for buffer 2 to be ready
+                // Main stream:     calculate grads of layer 3 in buffer 1
+                // ...
+                cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream));


cudaStreamSynchronize will wait on the host which isn't great for performance/overlap, would it be possible to synchronise with events between streams, similar to what we do else with:

cudaCheck(cudaEventRecord(config->compute_nccl_sync, compute_stream)); cudaCheck(cudaStreamWaitEvent(config->nccl_stream, config->compute_nccl_sync));

llmc/cuda_utils.cuh

ademeure · 2024-07-25T22:05:43Z

llmc/zero.cuh

+                                                              src[i] + multi_gpu_config.process_rank * n,
+                                                              n, seed + i);
+        cudaCheck(cudaGetLastError());
+        cudaCheck(cudaMemset(src[i], 0, nelem[i] * sizeof(floatX)));


could be fused into the vector_add - these are tiny kernel calls in the same stream, so the idle time between might be significant (i.e. noticeable in Nsight Systems but not Nsight Compute)

If we move zeroing (pun intended) inside, I think we should then move the vector_add function from cuda_utils.h to zero.h.

gordicaleksa

other than the comments i left which i think are all about refactoring - lgtm!

gordicaleksa · 2024-07-26T20:41:57Z

llmc/cuda_utils.cuh

+        t128 dst_v = load128cs(dst + idx);
+        for(int k = 0; k < t128::size; ++k) {
+            float sum = (float)dst_v[k] + (float)src_v[k];
+            stochastic_rounding(sum, &dst_v[k], seed + idx);


what benefit do we have from stochastic rounding here?

if you do lots of gradient accumulation, you will incur more and more error because you end up adding small new gradients to the buffer of large accumulated gradients. With stochastic rounding, we at least stay correct in expectation, and will not systematically ignore small changes.

gordicaleksa · 2024-07-26T20:43:43Z

train_gpt2.cu

@@ -293,7 +293,11 @@ typedef struct {
    size_t num_parameters_bytes;
    // gradients of the weights
    ParameterTensors grads;
+    size_t grads_bytes;


nit: i'd rename this to grads_num_bytes otherwise very easy to confuse it with a pointer to grad buffer

gordicaleksa · 2024-07-26T20:44:07Z

train_gpt2.cu

@@ -293,7 +293,11 @@ typedef struct {
    size_t num_parameters_bytes;
    // gradients of the weights
    ParameterTensors grads;
+    size_t grads_bytes;
+    ParameterTensors grad_shards;   // ZeRO-2 gradient shards
+    size_t grad_shards_bytes;


similarly here

gordicaleksa · 2024-07-26T21:39:40Z

train_gpt2.cu

+        // Allocate parameter buffers for the current layers active "wave" of computation
+        size_t param_elements[NUM_PARAMETER_TENSORS];
+        size_t param_sizeof[NUM_PARAMETER_TENSORS];
+        GPT2Config wave_config = model->config;


nit: maybe double_buffer_config instead of wave_config

gordicaleksa · 2024-07-26T21:41:29Z

train_gpt2.cu

+        // allocate as if we had a two-layer network
+        wave_config.num_layers = 2;
+        fill_in_parameter_sizes(param_elements, param_sizeof, wave_config);
+        size_t alloc_bytes = 0;


refactoring nit: computing the number of bytes could be done inside fill in parameter sizes because we repeat it so many times throughout the main file and it always goes directly after the fill_in* func, might make things a bit more readable

that seems like a good cleanup, but IMO should be a separate PR to keep the changes here to a minimum

train_gpt2.cu

gordicaleksa · 2024-07-27T09:46:11Z

train_gpt2.cu

+                // ...
+                cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream));
+#endif
+            }
            multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream);


as per offline conversation let's refactor this one so that it's clear multi_gpu_async_reduce_gradient is actually running on the nccl stream and the compute stream is passed in just so that we tell it to wait until the computation on the compute stream has finished.

gordicaleksa · 2024-07-27T09:59:53Z

train_gpt2.cu

@@ -997,12 +1090,21 @@ float gpt2_calculate_grad_norm(GPT2 *model, MultiGpuConfig* multi_gpu_config) {
        // further sum the (partial) squared norm across all GPUs
        ncclCheck(ncclAllReduce(grad_norm_squared, grad_norm_squared, sizeof(float), ncclFloat, ncclSum, multi_gpu_config->nccl_comm, main_stream));
 #endif
-    } else {


shouldn't we extend the #if MULTI_GPU guard to the above if (multi_gpu_config->zero_stage == 1) { branch as well?

only zero stage 0 makes sense for non multi GPU setup?

I generally would try to keep #if guards to a minimum, they just make it harder to reason about the code because you end up compiling different source code. So I'm trying to only hide code behind #if if it actually cannot compile in single-gpu mode, e.g., because we don't have the nccl_stream. Ideally, I think all that should be hidden inside zero.h.

gordicaleksa · 2024-07-28T08:10:42Z

train_gpt2.cu

@@ -976,7 +976,8 @@ void gpt2_backward_and_reduce(GPT2 *model, int* inputs, const int* targets, int
                cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream));
 #endif
            }
-            multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream);
+            nccl_wait_on_compute(&multi_gpu_config, main_stream);


cool! i like this solution

gordicaleksa reviewed Jun 17, 2024

View reviewed changes

llmc/cuda_utils.cuh Show resolved Hide resolved

gordicaleksa reviewed Jun 17, 2024

View reviewed changes

train_gpt2.cu Outdated Show resolved Hide resolved

ngc92 force-pushed the zero2 branch 3 times, most recently from 3b2ab8d to eaebc20 Compare July 25, 2024 21:34

ngc92 added 11 commits July 25, 2024 23:35

first broken version

f7ea235

fix sending updates

83b0318

fix compile for non-multi-gpu

9fb96fd

double buffering for overlapping comms and calcs

3810563

fixup after rebase

66b47b6

fixup

a3cb2a6

allow zero 2

63686ab

fixup

c20a8ba

implement grad accumulation in dedicated function

a1b8c17

fix buffer size

85cf8b2

add stochastic rounding to gradient accumulation

425fd0f

ngc92 force-pushed the zero2 branch from eaebc20 to 425fd0f Compare July 25, 2024 21:35

ngc92 marked this pull request as ready for review July 25, 2024 21:37

ademeure approved these changes Jul 25, 2024

View reviewed changes

vectorized vector_add

adee25d

ngc92 changed the title ~~Zero 2 - WIP~~ Zero 2 Jul 26, 2024

move zero-2 gradient accumulation to nccl stream

4ce63f0

ngc92 force-pushed the zero2 branch from 9cd0be7 to 4ce63f0 Compare July 27, 2024 07:36

gordicaleksa approved these changes Jul 27, 2024

View reviewed changes

ngc92 added 4 commits July 27, 2024 13:32

updated comment

aeba5e1

made sync more explicit

40dd278

added missing guards

f7e3163

use event for sync

096d43a

gordicaleksa reviewed Jul 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero 2 #593

Zero 2 #593

ngc92 commented Jun 14, 2024

gordicaleksa Jun 17, 2024 •

edited

Loading

ademeure left a comment

ademeure Jul 25, 2024

ademeure Jul 25, 2024

ngc92 Jul 26, 2024

gordicaleksa left a comment

gordicaleksa Jul 26, 2024

ngc92 Jul 27, 2024

gordicaleksa Jul 26, 2024

gordicaleksa Jul 26, 2024

gordicaleksa Jul 26, 2024

gordicaleksa Jul 26, 2024

ngc92 Jul 27, 2024

gordicaleksa Jul 27, 2024

gordicaleksa Jul 27, 2024

ngc92 Jul 27, 2024

gordicaleksa Jul 28, 2024

Zero 2 #593

Are you sure you want to change the base?

Zero 2 #593

Conversation

ngc92 commented Jun 14, 2024

gordicaleksa Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

ademeure left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gordicaleksa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gordicaleksa Jun 17, 2024 •

edited

Loading