Added changes for Multiple scripts execution + Documentation #151

anshuman-raina · 2025-01-26T05:39:43Z

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Added new benchmarking dataset benchmarking
Added Documentation for Benchmarking

Note : Before merge, ensure changing branch name in benchmark.yml

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules
I have run the devtools::document() command after my changes and committed the added files

github-actions · 2025-01-26T05:40:38Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Dataset Configuration Issue The dataset configuration is hardcoded to use the first dataset from `scriptController.json`. This could cause issues if multiple datasets need to be processed or if the configuration changes. Consider making this more dynamic or configurable. config <- fromJSON("scriptController.json", simplifyVector = FALSE) dataset_config <- config$datasets[[1]] dataset_config <- as.list(dataset_config) cat("Processing Dataset:", dataset_config$name, "\n") cat("Dataset File Path:", dataset_config$file, "\n") Potential Performance Bottleneck The `calculateResult` function iterates over samples and filters proteins multiple times, which could be computationally expensive for large datasets. Consider optimizing this logic to improve performance. calculateResult <- function(summarized, label, samples) { model <- groupComparison("pairwise", summarized) comparisonResult <- model$ComparisonResult TP <- 0 FP <- 0 TN <- 0 FN <- 0 for (sample_name in names(samples)) { sample <- samples[[sample_name]] is_positive <- sample$type == "positive" filtered_proteins <- comparisonResult %>% filter(grepl(sample$pattern, Protein)) if (is_positive) { TP <- TP + nrow(filtered_proteins %>% filter(adj.pvalue < 0.05)) FN <- FN + nrow(filtered_proteins %>% filter(adj.pvalue >= 0.05)) } else { FP <- FP + nrow(filtered_proteins %>% filter(adj.pvalue < 0.05)) TN <- TN + nrow(filtered_proteins %>% filter(adj.pvalue >= 0.05)) } Script Execution Order The SLURM script executes multiple R scripts sequentially. If dependencies exist between these scripts, ensure the order of execution is correct and any errors are handled gracefully. R_SCRIPTS=("benchmark.R" "benchmark1.R") for script in $R_SCRIPTS; do echo "Executing script: $script" Rscript "$script" done

github-actions · 2025-01-26T05:40:54Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Possible issue	Add file existence check for robustness Add error handling to ensure that the `dataset_config$file` path exists and is accessible before attempting to read it with `data.table::fread`. This will prevent runtime errors if the file is missing or the path is incorrect. benchmark/benchmark.R [20] +if (!file.exists(dataset_config$file)) { + stop(paste("File not found:", dataset_config$file)) +} fragpipe_raw <- data.table::fread(dataset_config$file) Suggestion importance[1-10]: 9 Why: Adding a file existence check ensures robustness by preventing runtime errors when the file path is invalid or inaccessible. This is a critical improvement for handling external dependencies.	9
	Validate MSstats format conversion output Add a check to ensure that the `msstats_format` object is not empty or NULL after conversion using `MSstatsConvert::FragPipetoMSstatsFormat`. This will prevent downstream errors if the conversion fails. benchmark/benchmark1.R [24] msstats_format = MSstatsConvert::FragPipetoMSstatsFormat(fragpipe_raw, use_log_file = FALSE) +if (is.null(msstats_format) \|\| nrow(msstats_format) == 0) { + stop("Error: MSstats format conversion failed or returned empty data.") +} Suggestion importance[1-10]: 9 Why: Ensuring the `msstats_format` object is valid after conversion prevents downstream errors and ensures the integrity of the data processing pipeline. This is a crucial validation step.	9
	Validate sample patterns before filtering Add a check to ensure that the `samples` parameter contains valid and non-empty patterns before filtering `comparisonResult`. This will prevent potential errors or incorrect results when `samples` is improperly configured. benchmark/calculateMetrics.R [18] +if (is.null(sample$pattern) \|\| sample$pattern == "") { + stop(paste("Invalid pattern for sample:", sample_name)) +} filtered_proteins <- comparisonResult %>% filter(grepl(sample$pattern, Protein)) Suggestion importance[1-10]: 8 Why: Validating the sample patterns before filtering avoids potential errors or incorrect results when the `samples` parameter is misconfigured. This enhances the reliability of the function.	8
	Verify existence of R scripts before execution Add error handling to verify that each R script in the `R_SCRIPTS` array exists before attempting to execute it. This will avoid silent failures if a script is missing. benchmark/config.slurm [32-34] for script in $R_SCRIPTS; do + if [ ! -f "$script" ]; then + echo "Error: Script $script not found." + exit 1 + fi echo "Executing script: $script" Rscript "$script" done Suggestion importance[1-10]: 8 Why: Adding a check for the existence of R scripts before execution prevents silent failures and ensures that the intended scripts are available, improving script reliability.	8

benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R

benchmark/README.md

benchmark/benchmark_Dowell2021-HEqe408_LFQ.R

benchmark/scriptController.json

tonywu1999 · 2025-02-06T23:47:59Z

benchmark/README.md

+ssh new_user@remote_server
+```
+
+If successful, the new user should be logged into the remote server.


As an extra step, we should make sure to include instructions to store the private key as a secret within the MSstats repo (i.e. settings -> secrets -> SSH_PRIVATE_KEY)

tonywu1999 · 2025-02-06T23:50:24Z

benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R

+# Useful for future datasets where BioReplicate and Condition columns are missing
+
+# fragpipe_raw$Condition = unlist(lapply(fragpipe_raw$Run, function(x){
+#   paste(str_split(x, "\\_")[[1]][4:5], collapse="_")
+# }))
+
+# fragpipe_raw$BioReplicate = unlist(lapply(fragpipe_raw$Run, function(x){
+#   paste(str_split(x, "\\_")[[1]][4:7], collapse="_")
+# }))


These comments can be removed. There isn't a pre-defined universal format for run ID, we just got lucky that this dataset's format was in a parseable format.

tonywu1999 · 2025-02-06T23:53:56Z

benchmark/calculateMetrics.R

+    ## significant implies positives
+    ## insignificant implies negative


You can remove these comments. In the context of statistics & proteomics, "significant" means there was a noticeable shift in protein abundance outside of 95% confidence intervals. This shift in protein abundance can be a positive shift or a negative shift (which is why these comments can be confusing).

Added changes for Multiple scripts execution + Documentation

5044287

github-actions bot added the Review effort [1-5]: 4 label Jan 26, 2025

Anshuman Raina added 16 commits January 26, 2025 00:48

Run benchmark for current-branch

6dfe2f8

Added changes for scp file transfer

fc1543b

Correction in formatting

0214f07

Added flag for correction in scp

767b3c7

Dummy Commit

7443733

Renamed files and changes

ce5bf5f

Corrected json config

7e4438a

Changes to ensure output appended to job_output.txt

aa17d4a

Added flag

605258f

Updated job name

5edd0c9

Changed order

2505bca

Added changes for sequential script run

773ade0

Added changes

79496ae

Removed FDR check step

3b00e90

Changes for line break

5d8dd84

Documentation changes

cf88b75

tonywu1999 reviewed Jan 29, 2025

View reviewed changes

benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R Outdated Show resolved Hide resolved