Model Overview.txt

This document provides an overview of the various steps involved in developing the machine learning (ML) model. The source code includes multiple user-defined functions to ensure consistency within the ML models. For a more detailed explanation of these functions, please review the code in the 'Library.R' file.

The layout presented below outlines the general 'Sampling' code, as it includes every step used in both the 'Base' and 'Cluster' code.

**Steps:**

1. **Install All Libraries**
    - The 'Library.R' file includes all the required R libraries and user-defined libraries.
    - Change the working directory to the required location.

2. **Define Model Name**
    - Specify the name of the model.
    - The trainmethod should be the name of ML algorithm within the "caret" package.
    - Define the validation data. Use validation == 1 to evaluate the model's performance for the training data (used to look at overfitting issues) and use validation == 2 to evaluate the model's performance for the validation data (used to look at model's performance on unseen data).

3. **Get Dataset**
    - The 'matchup_data_clean' dataframe contains river reflectance data and corresponding riverine phosphorus measurement values.
    - Change the directory of 'matchup_data_clean' to the required location.

4. **Create Train and Validation Sets**
    - The 'data_partition' function creates a dataset partition with 90% training and 10% validation, considering the spatial distribution of gauge locations, temporal distribution of gauge measurements, and phosphorus magnitude from the 'matchup_data_clean' dataset.
    - The 'create_train_validate' function separates the training and validation datasets. Adjust the 'validate_partition' value in 'train_test' and 'label_partition_values' for model accuracy assessment data. Use '1' for the training dataset and '2' for the validation dataset.

5. **Clustering Method**
    - This method divides the dataset based on high-level vegetation, creating five clusters labeled according to vegetation type: 0 for Barren, 2.99990844307448 for Evergreen needleleaf, 5.00013733538828 for Deciduous, 18.0000305189752 for Mixed, and 19 for Interrupted. For more details refer to ERA5.
    - This step creates 90% training and 10% validation datasets for each cluster.

6. **Develop a Model**
    - The 'Features.R' file includes 'feature_1', which is a list of band combinations used for developing the ML model.
    - The 'grid_best' function assists in hyperparameter tuning and selecting the best combination.
    - The 'sampling_train_data' function performs undersampling and oversampling of the training dataset in an iterative process to improve the R-squared value of the previous ML model. As this is an iterative process, it may take a long time to complete (in days). 
    - The optimal value must be manually selected from the corresponding R2.x (where x is the cluster number) dataframe file. 
The optimal parameters are selected in the 'Smote_train' and 'undersample_filter' functions.
    - The 'train_model' function creates an ML model using the optimal parameters.

7. **Model Performance Evaluation**
    - The 'prediction_function' estimates the model’s performance using the validation dataset.
    - The 'print_result' function generates a .txt output file that includes the model performance. The file is created in the same directory as the source code.