This is source code/solution for 4th place (top 1%) in Private LB with score 0.84923 of [Student] Shopee Code League 2020 - Product Detection. Check Overview Archive and Leaderboard Archive if Kaggle is down or the link is invalid.
- This is work of Cocoa team, we DO NOT represent any organization.
- There's no reproducibility guarantee for notebook which uses GPU and TPU
- Although we use License "The Unlicense", dataset and generated dataset falls under Shopee Terms and Conditions which can be seen on Google Docs, Google Docs (2) or Internet Archive
Some of the notebook are run on different environment
Environment Name | Description |
---|---|
Kaggle CPU | 2C/4T CPU, 16GB RAM |
Kaggle GPU | 2C/4T CPU, 16GB RAM, Nvidia Tesla P100 |
Kaggle TPU | 2C/4T CPU, 16GB RAM, TPU v3-8 |
Local | 6C/12T CPU, 16GB RAM, 16GB Swapfile |
To ensure you can run jupyter notebook which runs on "Local" environment, make sure that :
- Use OS based on GNU/Linux
- Use Python 3.8
- Install required application & library
sudo apt update
sudo apt install enchant libenchant hunspell libhunspell hunspell-en-us hunspell-id
- Install required Python library
python3.8 -m pip install pandas matplotlib seaborn nltk pyenchant jellyfish scikit-learn scipy numpy pattern notebook
Jupyter notebook on
extra_notebook
directory isn't used on the competition
Filename | Link to Kaggle Kernel | Environment | Description |
---|---|---|---|
01a_ocr.ipynb | https://www.kaggle.com/ilosvigil/shopee-competition-2-ocr?scriptVersionId=37547674 | Kaggle GPU | Perform OCR on Train Images (0 - 26346) |
01b_ocr.ipynb | https://www.kaggle.com/ilosvigil/shopee-competition-2-ocr?scriptVersionId=37573844 | Kaggle GPU | Perform OCR on Train Images (26347 - 52691) |
01c_ocr.ipynb | https://www.kaggle.com/ilosvigil/shopee-competition-2-ocr?scriptVersionId=37601445 | Kaggle GPU | Perform OCR on Train Images (52695 - 79041) |
01d_ocr.ipynb | https://www.kaggle.com/williammulianto/shopee-2-ocr-train-41b3f4?scriptVersionId=37614455 | Kaggle GPU | Perform OCR on Train Images (79042 - 105390) |
01e_ocr.ipynb | https://www.kaggle.com/ilosvigil/shopee-competition-2-ocr?scriptVersionId=37631409 | Kaggle GPU | Perform OCR on Test Images |
02a_clean.ipynb | - | Local | Clean OCR text from train images |
02b_clean.ipynb | - | Local | Clean OCR text from text images |
03_tfidf.ipynb | - | Local | Create TF-IDF representation of cleaned text. Warning: use lots of RAM when creating parquest file. |
04a_tfrecord.ipynb | https://www.kaggle.com/ilosvigil/scl2020-2-4a-create-tfrecord-1-3?scriptVersionId=37853684 | Kaggle CPU | Create TFRecord file from train images & parquet dataset (0 - 52694) |
04b_tfrecord.ipynb | https://www.kaggle.com/ilosvigil/scl2020-2-4b-create-tfrecord-2-3?scriptVersionId=37853702 | Kaggle CPU | Create TFRecord file from train images & parquet dataset (52695 - 105390) |
04c_tfrecord.ipynb | https://www.kaggle.com/ilosvigil/scl2020-2-4c-create-tfrecord-3-3?scriptVersionId=37853715 | Kaggle CPU | Create TFRecord file from test images & parquet dataset |
05a_model.ipynb | https://www.kaggle.com/ilosvigil/scl2020-2-5-model?scriptVersionId=37953296 | Kaggle TPU | Create Multimodal Model. Some comment might not match code. |
05b_model.ipynb | https://www.kaggle.com/ilosvigil/shopee-competition-2?scriptVersionId=38054791 | Kaggle TPU | Create Multimodal Model |
The dataset is available on this repository and/or Kaggle datasets platform.
Created by | Directory path | Link to Kaggle datasets |
---|---|---|
Shopee | dataset/0_original_csv |
https://www.kaggle.com/c/student-shopee-code-league-sentiment-analysis/data |
01a_ocr.ipynb | dataset/1_ocr_text |
https://www.kaggle.com/ilosvigil/sc-21-ocr |
01b_ocr.ipynb | dataset/1_ocr_text |
https://www.kaggle.com/ilosvigil/sc-21-ocr |
01c_ocr.ipynb | dataset/1_ocr_text |
https://www.kaggle.com/ilosvigil/sc-21-ocr |
01d_ocr.ipynb | dataset/1_ocr_text |
https://www.kaggle.com/ilosvigil/sc-21-ocr |
01e_ocr.ipynb | dataset/1_ocr_text |
https://www.kaggle.com/ilosvigil/sc-21-ocr |
02a_clean.ipynb | dataset/2_with_cleaned_text |
- |
02b_clean.ipynb | dataset/2_with_cleaned_text |
- |
03_tfidf.ipynb | dataset/3_with_tfidf |
https://www.kaggle.com/ilosvigil/csv-with-cleaned-ocr-text |
04a_tfrecord.ipynb | - | https://www.kaggle.com/ilosvigil/tfrecords |
04b_tfrecord.ipynb | - | https://www.kaggle.com/ilosvigil/tfrecords-2 |
04c_tfrecord.ipynb | - | https://www.kaggle.com/ilosvigil/tfrecords-3 |
Notebook filename | Submission filename | Used for Final Score | Public LB | Private LB |
---|---|---|---|---|
05a_model.ipynb | submission.csv | Yes | 0.85325 | 0.84923 |
05b_model.ipynb | submission.csv | No | 0.84085 | 0.84339 |
05b_model.ipynb | submission_tta_0.csv | Yes | 0.84323 | 0.84244 |
05b_model.ipynb | submission_tta_all.csv | No | 0.84085 | 0.84339 |
This guide assume you have necessary files (full dataset provided by Shopee) and move it to correct directory path
- Run
01a_ocr.ipynb
,01b_ocr.ipynb
,01c_ocr.ipynb
,01d_ocr.ipynb
&01e_ocr.ipynb
on Kaggle Notebook - Rename output filename from each notebook
Notebook Filename | Old filename | New Filename |
---|---|---|
01a_ocr.ipynb | train2a.csv | train_ocr_1.csv |
01b_ocr.ipynb | train2a.csv | train_ocr_2.csv |
01c_ocr.ipynb | train2c.csv | train_ocr_3.csv |
01d_ocr.ipynb | train2d.csv | train_ocr_4.csv |
01e_ocr.ipynb | test2.csv | test_ocr.csv |
- Run
02a_clean.ipynb
&02b_clean.ipynb
- Run
03_tfidf.ipynb
- Run
04a_tfrecord.ipynb
,04b_tfrecord.ipynb
&04c_tfrecord.ipynb
- Run
05a_model.ipynb
or05b_model.ipynb