Corpify is a project with the goal of developing a "corpy" textual style-transfer model. This model changes informal and casual English text to a more professional and workplace-appropriate style. The repository provided contains:
- The Corpify dataset, a parallel dataset to train and evaluate models for the corpy textual style-transfer assignment, is available through HuggingFace: https://huggingface.co/datasets/maayans/Corpify_Dataset/.
- The code for fine-tuning and evaluating language models on this style-transfer task. We've made two of our fine-tuned models accessible via HuggingFace: https://huggingface.co/noystl/corpify-flan-large, https://huggingface.co/noystl/corpify_t5_large
-
Clone the repository:
git clone https://github.com/maayansharon10/Corpify.git
-
Enter the repository:
cd Corpify
-
Create a virtual environment:
python3 -m venv myenv
-
Activate the virtual environment:
source myenv/bin/activate.csh
-
Install requirements:
pip install -r requirements.txt
Alternatively, on cluster, you can run the build_venv.sh script, which will do 3..5 for you:
chmod +x ./build_env.sh
sbatch ./build_env.sh
We maintain comprehensive documentation for all our experiments using Weights and Biases. To get started with running a new experiment, please follow these steps:
- Register and create a new project on Weights and Biases by following the instructions in the quickstart guide: https://docs.wandb.ai/quickstart.
- After creating the project, log in to your Weights and Biases account.
- In the config.json file, include your project name in the following format:
"training": {
...
"wandb_project": "YOUR-PROJECT-NAME",
...
To bypass the use of Weights and Biases entirely, you can run the following command inside the virtual environment:
wandb disabled
If you are working on the cluster, remember to uncomment the relevant line in run.sh to disable Weights and Biases.
All jobs are defined using a configuration file in JSON format, which contains all the necessary parameters for the job.
An example configuration file can be found in config.json
. To execute the job, the configuration file is passed as an
argument to the main.py
script:
python3 main.py --config-file config.json
For cluster environments, an alternative method is available:
-
Make the run script executable:
chmod +x ./run.sh
-
Submit the job to the cluster using the run script:
sbatch ./run.sh
The run.sh script takes care of activating the virtual environment, adjusting the default cache path of Hugging-Face libraries, and running main.py with config.json as the configuration file.
We currently support the following models as a base for fine-tuning:
- t5-large (https://huggingface.co/t5-large): the default T5-large model.
- t5-detox (https://huggingface.co/s-nlp/t5-paranmt-detox): T5 fine-tuned on ParaNMT (a dataset of English-English paraphrasing, filtered for the task of detoxification).
- t5-formal (https://huggingface.co/Isotonic/informal_to_formal): T5-base fine-tuned on the GYAFC (informal-formal) dataset
- flan-large (https://huggingface.co/google/flan-t5-large): the default FLAN-large model.
- bart-large (https://huggingface.co/facebook/bart-large): the default BART-large model.
- bart-detox (https://huggingface.co/s-nlp/bart-paranmt-detox): BART-base trained on the ParaDetox ( toxic→not-toxic) dataset.
The model is defined in the configuration file under the model
key. For example:
"model": {
"value": "t5-large",
"choices": [
"t5-detox",
"t5-formal",
"t5-large",
"bart-detox",
"bart-large",
"flan-large"
]
}
We have made the two best models we've pretrained available for download. You can find them here:
https://huggingface.co/noystl/corpify-flan-large
https://huggingface.co/noystl/corpify_t5_large
The job mode is defined in the configuration file under the job_mode
key. The following modes are available:
The model is trained and evaluated, using the default hyperparameters.
Note: This mode is not supported for BART models.
In this job mode, hyperparameter optimization is performed using the Optuna library. The supported hyperparameters are:
- weight_decay
- num_train_epochs
- per_device_train_batch_size
- learning_rate
The allowed values for each hyperparameter are defined in the configuration file under the hpo
key. For example:
"learning_rate": {
"type": "float",
"min": 1e-05,
"max": 1e-02
},
Optuna uses a random grid search by default, meaning that it performs multiple attempts to obtain the best
hyperparameters (trials). It selects the hyperparameters for each trial randomly from the allowed values defined in the
configuration file. The number of trials is defined in the configuration file under the hpo_trials
key. For example:
"hpo_trials": 10,
The best trial is selected based on the evaluation loss, and a new training and evaluation session is performed using the best hyperparameters.
This job mode is used to evaluate a model from a given checkpoint, without additional training. checkpoint. The path to
the checkpoint is defined in the configuration file under the initial_checkpoint
key. For example:
"initial_checkpoint": "./results/2023-07-22_13_04_16/t5-large_best_checkpoint",
The default model checkpoint is downloaded from Hugging-Face and evaluated on the test set.
"max_dups": 1,
"eval_size": 0.2,
max_dups
is the maximum number of examples with the same source sentence allowed in the data.
eval_size
is the portion of the data to be used for evaluation. Half of it is used for creating the dev-set and the
rest is used for the test-set.
The tokenizers we use cannot process certain characters (with ASCII > 127 or ASCII < 32) that are sometimes introduced into automatically generated data. We have avoided this issue by rephrasing the examples we created using GPT; however, to prevent breaking with newer data, the code drops examples with illegal ASCII chars.
To examine our raw human annoation or re run our human annotation evaluation process and examine the results, see Evaluation_Data_And_Annotation directory, specifically human_eval.ipynb file
If you have any questions regarding this project, please don't hesitate to ask! You can contact us via our emails:
- Maayan Sharon - maayan.sharon@mail.huji.ac.il
- Nitzan Barzilay - nitzan.barzilay@mail.huji.ac.il
- Noy Sternlicht - noy.sternlicht@mail.huji.ac.il