Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero 2 #593

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Zero 2 #593

wants to merge 17 commits into from

Conversation

ngc92
Copy link
Contributor

@ngc92 ngc92 commented Jun 14, 2024

Trying to get a first version working. Code isn't nice, we currently lose the asynchrony in the communication code because we need to reuse the buffer for the next layer, and it doesn't give correct results.

train_gpt2.cu Outdated
GPT2Config wave_config = model->config;
// allocate space for two layers, so we can do double-buffering
wave_config.num_layers = 2;
fill_in_parameter_sizes(param_elements, param_sizeof, wave_config);
Copy link
Contributor

@gordicaleksa gordicaleksa Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious whether DeepSpeed shards also embedding layer?

train_gpt2.cu Outdated Show resolved Hide resolved
@ngc92 ngc92 force-pushed the zero2 branch 3 times, most recently from 3b2ab8d to eaebc20 Compare July 25, 2024 21:34
@ngc92 ngc92 marked this pull request as ready for review July 25, 2024 21:37
Copy link
Contributor

@ademeure ademeure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, looks like it could be made significantly faster/more efficient with relatively less extra complexity, but based on your performance data it already looks useful and possibly good enough for now as-is!

train_gpt2.cu Outdated
// NCCL stream: Wait for buffer 2 to be ready
// Main stream: calculate grads of layer 3 in buffer 1
// ...
cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudaStreamSynchronize will wait on the host which isn't great for performance/overlap, would it be possible to synchronise with events between streams, similar to what we do else with:

cudaCheck(cudaEventRecord(config->compute_nccl_sync, compute_stream));
cudaCheck(cudaStreamWaitEvent(config->nccl_stream, config->compute_nccl_sync));

llmc/cuda_utils.cuh Outdated Show resolved Hide resolved
llmc/zero.cuh Outdated
src[i] + multi_gpu_config.process_rank * n,
n, seed + i);
cudaCheck(cudaGetLastError());
cudaCheck(cudaMemset(src[i], 0, nelem[i] * sizeof(floatX)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be fused into the vector_add - these are tiny kernel calls in the same stream, so the idle time between might be significant (i.e. noticeable in Nsight Systems but not Nsight Compute)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move zeroing (pun intended) inside, I think we should then move the vector_add function from cuda_utils.h to zero.h.

@ngc92 ngc92 changed the title Zero 2 - WIP Zero 2 Jul 26, 2024
Copy link
Contributor

@gordicaleksa gordicaleksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than the comments i left which i think are all about refactoring - lgtm!

t128 dst_v = load128cs(dst + idx);
for(int k = 0; k < t128::size; ++k) {
float sum = (float)dst_v[k] + (float)src_v[k];
stochastic_rounding(sum, &dst_v[k], seed + idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what benefit do we have from stochastic rounding here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do lots of gradient accumulation, you will incur more and more error because you end up adding small new gradients to the buffer of large accumulated gradients. With stochastic rounding, we at least stay correct in expectation, and will not systematically ignore small changes.

@@ -293,7 +293,11 @@ typedef struct {
size_t num_parameters_bytes;
// gradients of the weights
ParameterTensors grads;
size_t grads_bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i'd rename this to grads_num_bytes otherwise very easy to confuse it with a pointer to grad buffer

@@ -293,7 +293,11 @@ typedef struct {
size_t num_parameters_bytes;
// gradients of the weights
ParameterTensors grads;
size_t grads_bytes;
ParameterTensors grad_shards; // ZeRO-2 gradient shards
size_t grad_shards_bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here

// Allocate parameter buffers for the current layers active "wave" of computation
size_t param_elements[NUM_PARAMETER_TENSORS];
size_t param_sizeof[NUM_PARAMETER_TENSORS];
GPT2Config wave_config = model->config;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe double_buffer_config instead of wave_config

// allocate as if we had a two-layer network
wave_config.num_layers = 2;
fill_in_parameter_sizes(param_elements, param_sizeof, wave_config);
size_t alloc_bytes = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactoring nit: computing the number of bytes could be done inside fill in parameter sizes because we repeat it so many times throughout the main file and it always goes directly after the fill_in* func, might make things a bit more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems like a good cleanup, but IMO should be a separate PR to keep the changes here to a minimum

train_gpt2.cu Outdated Show resolved Hide resolved
train_gpt2.cu Outdated Show resolved Hide resolved
train_gpt2.cu Outdated Show resolved Hide resolved
train_gpt2.cu Outdated
// ...
cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream));
#endif
}
multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per offline conversation let's refactor this one so that it's clear multi_gpu_async_reduce_gradient is actually running on the nccl stream and the compute stream is passed in just so that we tell it to wait until the computation on the compute stream has finished.

@@ -997,12 +1090,21 @@ float gpt2_calculate_grad_norm(GPT2 *model, MultiGpuConfig* multi_gpu_config) {
// further sum the (partial) squared norm across all GPUs
ncclCheck(ncclAllReduce(grad_norm_squared, grad_norm_squared, sizeof(float), ncclFloat, ncclSum, multi_gpu_config->nccl_comm, main_stream));
#endif
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we extend the #if MULTI_GPU guard to the above if (multi_gpu_config->zero_stage == 1) { branch as well?

only zero stage 0 makes sense for non multi GPU setup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally would try to keep #if guards to a minimum, they just make it harder to reason about the code because you end up compiling different source code. So I'm trying to only hide code behind #if if it actually cannot compile in single-gpu mode, e.g., because we don't have the nccl_stream. Ideally, I think all that should be hidden inside zero.h.

@@ -976,7 +976,8 @@ void gpt2_backward_and_reduce(GPT2 *model, int* inputs, const int* targets, int
cudaCheck(cudaStreamSynchronize(multi_gpu_config.nccl_stream));
#endif
}
multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream);
nccl_wait_on_compute(&multi_gpu_config, main_stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! i like this solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants