Below you can find a outline of how to reproduce my solution for the Recencia Papers competition. If you run into any trouble with the setup/code or have any questions please contact me at f.nunezb@gmail.com
- models.zip : trained model weights. Can also be downloaded from GDrive
- recencia_papers : Github repo with codes and documentation
- I used a common Google Colab Notebook instance with GPU enabled (whichever Google provided at the time of run).
There are 2 Notebook examples saved in the examples folder.
-
Recencia_Complete_demo.ipynb: Preprocess, train and predictions for the competition data. Open in Google Colab
-
Recencia_FinalTest.ipynb: Final test prediction. Open in Google Colab
- Data PreProcessing (Generate Embeddings):
- function: prepare_data.py
- arguments:
- input_file: folder where the train.csv has been saved
- max_length: this controls how much text from the abstracts will be encoded. If shorter, the text will be padded. If longer, it will be truncated. Some models won't allow more than 512 tokens. Better results between 180 and 250.
- output_dir: folder to store the results embeddings.
- do_train/do_test: whether you are generating embeddings for the training of test set. This is important for the output files naming.
Training set
!python /content/recencia_papers/prepare_data.py \
--input_file '/content/train.csv' \
--max_length 250 \
--output_dir '/content/embeddings/' \
--do_train
Test set (notice the difference in input file and do_predict)
!python /content/recencia_papers/prepare_data.py \
--input_file '/content/test.csv' \
--max_length 250 \
--output_dir '/content/embeddings/' \
--do_predict
- Model training (Catboost):
- function: train.py
- arguments:
- input_file: folder where the train.csv has been saved (usually same as above)
- emb_folder: folder where the training text embeddings have been saved
- output_dir: folder where to store the Catboost model weights
- seed: defaults to 0, but could be other integers
!python /content/recencia_papers/train.py \
--input_file '/content/train.csv' \
--emb_folder '/content/embeddings/' \
--output_dir '/content/models/' \
--seed 0
- Generate predictions:
- function: predict.py
- arguments:
- input_file: folder where the test.csv/final_test.csv files has been saved
- emb_folder: folder where the test text embeddings have been saved
- models_folder: folder where the Catboost model weights have been saved
- output_dir: folder where to save the final predictions
- do_predict
!python /content/recencia_papers/predict.py \
--input_file '/content/test.csv' \
--emb_folder '/content/embeddings/' \
--models_folder '/content/models/' \
--output_dir '/content/predictions/'