Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Par_bn and mam_adapter got "eval_matthews_correlation': 0.0" on GLUE-Cola #618

Closed
2 tasks done
wenjingk-xilinx opened this issue Dec 15, 2023 · 2 comments
Closed
2 tasks done
Assignees
Labels
question Further information is requested

Comments

@wenjingk-xilinx
Copy link

wenjingk-xilinx commented Dec 15, 2023

Environment info

  • adapters version: 0.1.0
  • Platform: linux
  • Python version: 3.8
  • PyTorch version (GPU?): 2.1.1+gpu
  • Tensorflow version (GPU?):-
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Information

Model I am using (Bert, XLNet ...): Bert-large

Language I am using the model on (English, Chinese ...): English

Adapter setup I am using (if any): as I use setup_adapter_training(model, adapter_args, data_args.task_name or "glue")

# for Par_bn
setup_adapter_training(model, "par_bn", data_args.task_name or "glue")
# for mam_adapter
setup_adapter_training(model, "mam", data_args.task_name or "glue")

The problem arises when using:

  • my own modified scripts: (give details below)

model=bert-large-uncased
adapter=mam #par_bn
TASK_NAME=cola
python run_glue.py
--model_name_or_path $model
--task_name $TASK_NAME
--do_train
--do_eval
--max_seq_length 128
--per_device_train_batch_size 32
--learning_rate 1e-4
--num_train_epochs 20.0
--evaluation_strategy "epoch"
--output_dir ./save/
--overwrite_output_dir
--train_adapter
--adapter_config ${adapter}

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name) GLUE

To reproduce

Steps to reproduce the behavior:

  1. use this run_glue.py, run my script as previously list

Expected behavior

$ python run_script.py
/proj/ossdataset1/wenjingk/peft/adapters/run_adapters.sh: line 18: cd: adapters: No such file or directory
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
12/15/2023 17:06:33 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
/proj/ossdataset1/wenjingk/anaconda3/envs/llm/lib/python3.8/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
warnings.warn(
[WARNING|modeling_utils.py:3952] 2023-12-15 17:06:41,336 >> Some weights of BertAdapterModel were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['heads.default.3.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
12/15/2023 17:06:44 - WARNING - accelerate.utils.other - Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
10%|███████████████████▏ | 536/5360 [05:10<34:51, 2.31it/s]
{'eval_loss': 0.6233669519424438, 'eval_matthews_correlation': 0.0, 'eval_runtime': 10.1056, 'eval_samples_per_second': 103.21, 'eval_steps_per_second': 12.963, 'epoch': 1.0}
15%|████████████████████████████▊ | 804/5360 [07:50<32:58, 2.30it/s]
20%|██████████████████████████████████████▏ | 1072/5360 [10:32<30:59, 2.31it/s]
{'eval_loss': 0.618126392364502, 'eval_matthews_correlation': 0.0, 'eval_runtime': 10.0855, 'eval_samples_per_second': 103.415, 'eval_steps_per_second': 12.989, 'epoch': 3.0}
25%|███████████████████████████████████████████████▊ | 1340/5360 [13:11<29:06, 2.30it/s]
30%|█████████████████████████████████████████████████████████▎ | 1608/5360 [15:52<27:10, 2.30it/s]
{'eval_loss': 0.6205015182495117, 'eval_matthews_correlation': 0.0, 'eval_runtime': 10.0833, 'eval_samples_per_second': 103.439, 'eval_steps_per_second': 12.992, 'epoch': 5.0}
35%|██████████████████████████████████████████████████████████████████▊ | 1876/5360 [18:32<25:13, 2.30it/s]
40%|████████████████████████████████████████████████████████████████████████████▍ | 2144/5360 [21:14<23:17, 2.30it/s]
{'eval_loss': 0.6273216009140015, 'eval_matthews_correlation': 0.0, 'eval_runtime': 10.0745, 'eval_samples_per_second': 103.529, 'eval_steps_per_second': 13.003, 'epoch': 7.0} 45%|█████████████████████████████████████████████████████████████████████████████████████▉ | 2412/5360 [23:53<21:16, 2.31it/s]
50%|███████████████████████████████████████████████████████████████████████████████████████████████▌ | 2680/5360 [26:35<19:23, 2.30it/s]
{'eval_loss': 0.6188081502914429, 'eval_matthews_correlation': 0.0, 'eval_runtime': 10.0585, 'eval_samples_per_second': 103.694, 'eval_steps_per_second': 13.024, 'epoch': 9.0}

@wenjingk-xilinx wenjingk-xilinx added the bug Something isn't working label Dec 15, 2023
@wenjingk-xilinx wenjingk-xilinx changed the title Par_bn and MAM got "eval_matthews_correlation': 0.0" on GLUE-Cola Par_bn and mam_adapter got "eval_matthews_correlation': 0.0" on GLUE-Cola Dec 15, 2023
@calpt calpt self-assigned this Dec 16, 2023
@calpt
Copy link
Member

calpt commented Dec 16, 2023

Hey @wenjingk-xilinx,

From the script parameters you shared, it seems you're using a bert-large checkpoint. For successful training, you might need to tune the training hyperparameters for this model a bit. E.g. by lowering the learning rate or enabling warmup steps to help the model converge.

I was able to get a Matthews coefficient of ~0.546 after 1 epoch with your training setup (par_bn config) by lowering the learning rate to 5e-5 and setting --warmup_steps 200.

Hope this helps!

@calpt calpt added question Further information is requested and removed bug Something isn't working labels Dec 16, 2023
@wenjingk-xilinx
Copy link
Author

Hi @calpt ,
Yeah, I tried with lower learning rate lr=5e-6, then get "eval_matthews_correlation=0.502" for one epoch. Thanks for your quick reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants