-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero 2 #593
base: master
Are you sure you want to change the base?
Zero 2 #593
Conversation
train_gpt2.cu
Outdated
GPT2Config wave_config = model->config; | ||
// allocate space for two layers, so we can do double-buffering | ||
wave_config.num_layers = 2; | ||
fill_in_parameter_sizes(param_elements, param_sizeof, wave_config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious whether DeepSpeed shards also embedding layer?
3b2ab8d
to
eaebc20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, looks like it could be made significantly faster/more efficient with relatively less extra complexity, but based on your performance data it already looks useful and possibly good enough for now as-is!
train_gpt2.cu
Outdated
// NCCL stream: Wait for buffer 2 to be ready | ||
// Main stream: calculate grads of layer 3 in buffer 1 | ||
// ... | ||
cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaStreamSynchronize will wait on the host which isn't great for performance/overlap, would it be possible to synchronise with events between streams, similar to what we do else with:
cudaCheck(cudaEventRecord(config->compute_nccl_sync, compute_stream));
cudaCheck(cudaStreamWaitEvent(config->nccl_stream, config->compute_nccl_sync));
llmc/zero.cuh
Outdated
src[i] + multi_gpu_config.process_rank * n, | ||
n, seed + i); | ||
cudaCheck(cudaGetLastError()); | ||
cudaCheck(cudaMemset(src[i], 0, nelem[i] * sizeof(floatX))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be fused into the vector_add - these are tiny kernel calls in the same stream, so the idle time between might be significant (i.e. noticeable in Nsight Systems but not Nsight Compute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move zeroing (pun intended) inside, I think we should then move the vector_add
function from cuda_utils.h
to zero.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than the comments i left which i think are all about refactoring - lgtm!
t128 dst_v = load128cs(dst + idx); | ||
for(int k = 0; k < t128::size; ++k) { | ||
float sum = (float)dst_v[k] + (float)src_v[k]; | ||
stochastic_rounding(sum, &dst_v[k], seed + idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what benefit do we have from stochastic rounding here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do lots of gradient accumulation, you will incur more and more error because you end up adding small new gradients to the buffer of large accumulated gradients. With stochastic rounding, we at least stay correct in expectation, and will not systematically ignore small changes.
@@ -293,7 +293,11 @@ typedef struct { | |||
size_t num_parameters_bytes; | |||
// gradients of the weights | |||
ParameterTensors grads; | |||
size_t grads_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i'd rename this to grads_num_bytes
otherwise very easy to confuse it with a pointer to grad buffer
@@ -293,7 +293,11 @@ typedef struct { | |||
size_t num_parameters_bytes; | |||
// gradients of the weights | |||
ParameterTensors grads; | |||
size_t grads_bytes; | |||
ParameterTensors grad_shards; // ZeRO-2 gradient shards | |||
size_t grad_shards_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly here
// Allocate parameter buffers for the current layers active "wave" of computation | ||
size_t param_elements[NUM_PARAMETER_TENSORS]; | ||
size_t param_sizeof[NUM_PARAMETER_TENSORS]; | ||
GPT2Config wave_config = model->config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe double_buffer_config
instead of wave_config
// allocate as if we had a two-layer network | ||
wave_config.num_layers = 2; | ||
fill_in_parameter_sizes(param_elements, param_sizeof, wave_config); | ||
size_t alloc_bytes = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactoring nit: computing the number of bytes could be done inside fill in parameter sizes because we repeat it so many times throughout the main file and it always goes directly after the fill_in* func, might make things a bit more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seems like a good cleanup, but IMO should be a separate PR to keep the changes here to a minimum
train_gpt2.cu
Outdated
// ... | ||
cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream)); | ||
#endif | ||
} | ||
multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per offline conversation let's refactor this one so that it's clear multi_gpu_async_reduce_gradient
is actually running on the nccl stream and the compute stream is passed in just so that we tell it to wait until the computation on the compute stream has finished.
@@ -997,12 +1090,21 @@ float gpt2_calculate_grad_norm(GPT2 *model, MultiGpuConfig* multi_gpu_config) { | |||
// further sum the (partial) squared norm across all GPUs | |||
ncclCheck(ncclAllReduce(grad_norm_squared, grad_norm_squared, sizeof(float), ncclFloat, ncclSum, multi_gpu_config->nccl_comm, main_stream)); | |||
#endif | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we extend the #if MULTI_GPU guard to the above if (multi_gpu_config->zero_stage == 1) {
branch as well?
only zero stage 0 makes sense for non multi GPU setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally would try to keep #if
guards to a minimum, they just make it harder to reason about the code because you end up compiling different source code. So I'm trying to only hide code behind #if
if it actually cannot compile in single-gpu mode, e.g., because we don't have the nccl_stream. Ideally, I think all that should be hidden inside zero.h
.
@@ -976,7 +976,8 @@ void gpt2_backward_and_reduce(GPT2 *model, int* inputs, const int* targets, int | |||
cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream)); | |||
#endif | |||
} | |||
multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream); | |||
nccl_wait_on_compute(&multi_gpu_config, main_stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! i like this solution
Trying to get a first version working. Code isn't nice, we currently lose the asynchrony in the communication code because we need to reuse the buffer for the next layer, and it doesn't give correct results.