Skip to content

felipenunezb/recencia_papers

Repository files navigation

Recencia Papers

Below you can find a outline of how to reproduce my solution for the Recencia Papers competition. If you run into any trouble with the setup/code or have any questions please contact me at f.nunezb@gmail.com

ARCHIVE CONTENTS

  • models.zip : trained model weights. Can also be downloaded from GDrive
  • recencia_papers : Github repo with codes and documentation

HARDWARE/SOFTWARE: (The following specs were used to create the original solution)

  • I used a common Google Colab Notebook instance with GPU enabled (whichever Google provided at the time of run).

Example

There are 2 Notebook examples saved in the examples folder.

  1. Recencia_Complete_demo.ipynb: Preprocess, train and predictions for the competition data. Open in Google Colab

  2. Recencia_FinalTest.ipynb: Final test prediction. Open in Google Colab

MODEL BUILD:

  1. Data PreProcessing (Generate Embeddings):
  • function: prepare_data.py
  • arguments:
    • input_file: folder where the train.csv has been saved
    • max_length: this controls how much text from the abstracts will be encoded. If shorter, the text will be padded. If longer, it will be truncated. Some models won't allow more than 512 tokens. Better results between 180 and 250.
    • output_dir: folder to store the results embeddings.
    • do_train/do_test: whether you are generating embeddings for the training of test set. This is important for the output files naming.

Training set

!python /content/recencia_papers/prepare_data.py \
 --input_file '/content/train.csv' \
 --max_length 250 \
 --output_dir '/content/embeddings/' \
 --do_train

Test set (notice the difference in input file and do_predict)

!python /content/recencia_papers/prepare_data.py \
--input_file '/content/test.csv' \
--max_length 250 \
--output_dir '/content/embeddings/' \
--do_predict
  1. Model training (Catboost):
  • function: train.py
  • arguments:
    • input_file: folder where the train.csv has been saved (usually same as above)
    • emb_folder: folder where the training text embeddings have been saved
    • output_dir: folder where to store the Catboost model weights
    • seed: defaults to 0, but could be other integers
!python /content/recencia_papers/train.py \
  --input_file '/content/train.csv' \
  --emb_folder '/content/embeddings/' \
  --output_dir '/content/models/' \
  --seed 0
  1. Generate predictions:
  • function: predict.py
  • arguments:
    • input_file: folder where the test.csv/final_test.csv files has been saved
    • emb_folder: folder where the test text embeddings have been saved
    • models_folder: folder where the Catboost model weights have been saved
    • output_dir: folder where to save the final predictions
    • do_predict
!python /content/recencia_papers/predict.py \
  --input_file '/content/test.csv' \
  --emb_folder '/content/embeddings/' \
  --models_folder '/content/models/' \
  --output_dir '/content/predictions/'

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages