This repository contains the official implementation associated with this paper. The corresponding dataset is publicly available here.
PotSim is a large-scale simulated agricultural dataset specifically designed for AI-driven research on potato cultivation. This dataset is grounded in real-world crop management scenarios and extrapolated to approximately 4.9 million hypothetical crop management scenarios. It encompasses diverse factors including varying planting dates, fertilizer application rates and timings, irrigation strategies, and 24 years of weather data. The resulting dataset comprises over 675 million daily simulation records, offering an extensive and realistic framework for agricultural AI research.
The repository contains three main files example.ipynb
, plots.ipynb
, and run.py
. To reproduce the train/test results presented in the paper, we provide run.py
, which can be executed over a command line interface or terminal. To follow a step by step procedure and work with our dataset, we provide example.ipynb
, a jupyter notebook template, which act as a starting point for further exploration. To make it easier to visualize and plot the results, we have provided plots.ipnb
, a jupyter notebook template, which contains few example plots and can be edited according the requirements.
Directory Name | Description |
---|---|
data |
Contains all datasets required for experiments and analyses. |
data/potsim_yearly |
Default location for yearly dataset files utilized in the study. |
models |
Houses all model architecture definitions and related scripts. |
outputs |
Default directory for saving model checkpoints, logs, and results generated during training. |
saves |
Stores pre-trained model states and checkpoints from experiments referenced in the paper. |
testing |
Includes scripts and functions for evaluating model performance and generating metrics. |
training |
Contains training routines, configuration files, and code for model optimization. |
utils |
Utility functions for data preprocessing, splitting, and model configuration management. |
utils/potsimloader |
Specialized utilities for efficient data loading and processing workflows. |
To install the requirements:
conda env create -f environment.yml
conda activate potsim_env
Depending on the version of CUDA
on your system, install PyTorch v2.5.1
from official PyTorch source at https://pytorch.org
# Example for cuda-version 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
To allow on gpu metrics and display the model parameters clearly
pip install torchmetrics==1.7.1 torchinfo==1.8
If your system is not set up with conda
package manager, then please visit https://www.anaconda.com/download to install Miniconda
accoding to your operating system and then continue by installing the requirements from above.
The script supports two main commands: train
and test
.
- Make sure your datasets are in the
.parquet
format and accessible by the script atdata
folder. - For more details on available target variables and models, check the code or add a
--help
flag:
python run.py --help
python run.py train --help
python run.py test --help
python run.py train -tgt -m [options]
Arguments:
Argument | Type | Required | Default | Description |
---|---|---|---|---|
-tgt , --target |
str | Yes | Target variable to predict. Choices: see below | |
-m , --model |
str | Yes | Model type to use. Choices: see below | |
-tdata , --train_dataset |
str | No | train_split |
Training dataset split |
-vdata , --val_dataset |
str | No | val_split |
Validation dataset split |
-bs , --batch_size |
int | No | 256 |
Batch size |
-lr , --learning_rate |
float | No | 0.005 |
Learning rate |
-ep , --epochs |
int | No | 100 |
Maximum number of epochs |
-sl , --seq_len |
int | No | 15 |
Sequence length (for sequence models) |
-d , --device |
str | No | None |
Device: cpu or cuda |
Example:
python run.py train -tgt="NTotL1" -m="lstm" -tdata="train_split" -vdata="val_split" -bs=256 -lr=0.001 -ep=10 -sl=15 -d="cuda"
python run.py test -tgt -m -data [options]
Arguments:
Argument | Type | Required | Default | Description |
---|---|---|---|---|
-tgt , --target |
str | Yes | Target variable to predict. Choices: see below | |
-m , --model |
str | Yes | Model type to use. Choices: see below | |
-data , --dataset |
str | Yes | Dataset to run test on | |
-mdir , --model_dir |
str | No | saves |
Directory where trained models are saved (outputs or saves ) |
-bs , --batch_size |
int | No | 256 |
Batch size |
-sl , --seq_len |
int | No | 15 |
Sequence length (for sequence models) |
-d , --device |
str | No | None |
Device: cpu or cuda |
Example:
python run.py test -tgt="NTotL1" -m="lstm" -data="test_split" -mdir="saves" -bs=256 -sl=15 -d="cuda"
R2
Metrics for Different Models
Target | CNN1D | Transformer | LSTM | LinearRegression | MLP | TCN |
---|---|---|---|---|---|---|
NLeach |
0.432 | -0.02 | 0.343 | 0.002 | 0.014 | 0.265 |
NPlantUp |
0.803 | 0.733 | 0.794 | 0.322 | 0.753 | 0.791 |
NTotL1 |
0.843 | 0.764 | 0.831 | 0.481 | 0.779 | 0.823 |
NTotL2 |
0.861 | 0.799 | 0.849 | 0.489 | 0.792 | 0.843 |
SWatL1 |
0.973 | 0.949 | 0.972 | 0.620 | 0.841 | 0.950 |
SWatL2 |
0.944 | 0.783 | 0.928 | 0.700 | 0.816 | 0.914 |