Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train a BERT model from scratch #27

Open
dwjung1 opened this issue Mar 16, 2022 · 6 comments
Open

How to train a BERT model from scratch #27

dwjung1 opened this issue Mar 16, 2022 · 6 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@dwjung1
Copy link

dwjung1 commented Mar 16, 2022

How can I train a BERT model from scratch?

@bwdGitHub bwdGitHub added question Further information is requested enhancement New feature or request labels Mar 16, 2022
@bwdGitHub
Copy link
Collaborator

We don't have any example code for this, but it is possible. You'll need to do a few things:

  1. Get a dataset to train on. The original BERT uses Wikipedia and BookCorpus.
  2. Set up the pre-training tasks. See section 3.1 of the paper or the create_training_data.py for how this was done in Python.
  3. Initialize learnable parameters. To use our bert.model implementation you need a struct of parameters in the same format as the Parameters field of the struct that bert() returns. The original weights initialization scheme details are in modeling.py.
  4. Write the pretraining script. Similar to FineTuneBERT.m, however you'll need to tweak the configuration (minibatch size, number of epochs, learn rate) and to replicate run_pretraining.py you need additional things in the training loop such as learn rate warmup/decay, and gradient clipping (as in optimization.py). If you're attempting to train at scale you probably want to adapt the training loop to use multiple GPUs as in this example.

It is worth being aware of the conclusions in section 5 of RoBERTa, particularly for setting up the pre-training tasks in step 2.

There's definitely a lot of work here, so I think we should keep this issue open as an enhancement to add a pre-training script.

@micklexqg
Copy link

We don't have any example code for this, but it is possible. You'll need to do a few things:

  1. Get a dataset to train on. The original BERT uses Wikipedia and BookCorpus.
  2. Set up the pre-training tasks. See section 3.1 of the paper or the create_training_data.py for how this was done in Python.
  3. Initialize learnable parameters. To use our bert.model implementation you need a struct of parameters in the same format as the Parameters field of the struct that bert() returns. The original weights initialization scheme details are in modeling.py.
  4. Write the pretraining script. Similar to FineTuneBERT.m, however you'll need to tweak the configuration (minibatch size, number of epochs, learn rate) and to replicate run_pretraining.py you need additional things in the training loop such as learn rate warmup/decay, and gradient clipping (as in optimization.py). If you're attempting to train at scale you probably want to adapt the training loop to use multiple GPUs as in this example.

It is worth being aware of the conclusions in section 5 of RoBERTa, particularly for setting up the pre-training tasks in step 2.

There's definitely a lot of work here, so I think we should keep this issue open as an enhancement to add a pre-training script.

The given example uses bert model from a pretrained struct.
How to create a new bert model but not load from the pretrained?
As the step3? but "the bert() returns" still uses the loaded struct, so how to get the struct of parameters in the same format as the Parameters field of the struct that bert() returns? define the struct first? any scripts are appropriate.

@bwdGitHub
Copy link
Collaborator

If you can use the same initializer for every parameter then the quickest thing you can do is something like:

mdl = bert;
% write an initializer function that 
% takes an existing dlarray parameter as input
% and returns a dlarray parameter of the same size.
initializer = @(w) 0.1*randn(size(w),"like",w);
mdl.Parameters.Weights = dlupdate(initializer,mdl.Parameters.Weights);

This is a little limited if you need to do something like use different initializers for the embeddings, linear layers, layer norms, etc. For that case I would write a suite of functions to initialize the struct, that might start like:

function weights = initializeBert()
weights = struct(...
  "embeddings",initializeEmbeddings(),...
  "encoder_layers",initializeEncoderLayers());
end

function weights = initializeEmbeddings()

% The numbers here are sizes from bert-base
weights = struct(...
  "LayerNorm", initializeLayerNorm(768),...
  "position_embeddings", initializeEmbedding(768,512),...
  "token_type_embeddings", initializeEmbedding(768,2),...
  "word_embeddings", initializeEmbedding(768,30522));
end

% etc.

You have to implement initializeEmbedding, initializeLayerNorm, and initializeEncoderLayers. It takes some time, luckily each encoder layer is the same structure so you can just write a loop to initialize those with a single initializeEncoderLayer implementation.

@micklexqg
Copy link

If you can use the same initializer for every parameter then the quickest thing you can do is something like:

mdl = bert;
% write an initializer function that 
% takes an existing dlarray parameter as input
% and returns a dlarray parameter of the same size.
initializer = @(w) 0.1*randn(size(w),"like",w);
mdl.Parameters.Weights = dlupdate(initializer,mdl.Parameters.Weights);

This is a little limited if you need to do something like use different initializers for the embeddings, linear layers, layer norms, etc. For that case I would write a suite of functions to initialize the struct, that might start like:

function weights = initializeBert()
weights = struct(...
  "embeddings",initializeEmbeddings(),...
  "encoder_layers",initializeEncoderLayers());
end

function weights = initializeEmbeddings()

% The numbers here are sizes from bert-base
weights = struct(...
  "LayerNorm", initializeLayerNorm(768),...
  "position_embeddings", initializeEmbedding(768,512),...
  "token_type_embeddings", initializeEmbedding(768,2),...
  "word_embeddings", initializeEmbedding(768,30522));
end

% etc.

You have to implement initializeEmbedding, initializeLayerNorm, and initializeEncoderLayers. It takes some time, luckily each encoder layer is the same structure so you can just write a loop to initialize those with a single initializeEncoderLayer implementation.

Thanks, it is clear. As the bert model has been trained, there should include the model creating scripts. I wonder why the demo does not give the function for model creating. More generally, if we create a different transformer model, should we need to implement something like createParameterStruct()? Demos including creating general transformer model functions would be helpful for popularizing.

@bwdGitHub
Copy link
Collaborator

For the record, we didn't train this ourselves, we imported the pre-trained weights for the original BERT models. That's why we didn't need to initialize the model ourselves, and don't have a nice pre-training demo.

I agree it would be nice for us to add initializer functions for the parameters that the layers in transformer.layer need - typically we would rely on built-in layers and dlnetwork to handle this for us.

Could you describe what you mean by a general transformer? I know of the BERT encoder-only type, GPT-2 decoder-only type, and encoder-decoder type like the original. Is there something else beyond those?

@micklexqg
Copy link

For the record, we didn't train this ourselves, we imported the pre-trained weights for the original BERT models. That's why we didn't need to initialize the model ourselves, and don't have a nice pre-training demo.

I agree it would be nice for us to add initializer functions for the parameters that the layers in transformer.layer need - typically we would rely on built-in layers and dlnetwork to handle this for us.

Could you describe what you mean by a general transformer? I know of the BERT encoder-only type, GPT-2 decoder-only type, and encoder-decoder type like the original. Is there something else beyond those?

sorry for the unclear, "general transformer" means to create custom transformer models using the basic modules just like creating kinds of cnn networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants