Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added changes for Multiple scripts execution + Documentation #151

Merged
merged 20 commits into from
Feb 8, 2025

Conversation

anshuman-raina
Copy link
Contributor

@anshuman-raina anshuman-raina commented Jan 26, 2025

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

  • Added new benchmarking dataset benchmarking
  • Added Documentation for Benchmarking

Note : Before merge, ensure changing branch name in benchmark.yml

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have run the devtools::document() command after my changes and committed the added files

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Dataset Configuration Issue

The dataset configuration is hardcoded to use the first dataset from scriptController.json. This could cause issues if multiple datasets need to be processed or if the configuration changes. Consider making this more dynamic or configurable.

config <- fromJSON("scriptController.json", simplifyVector = FALSE)

dataset_config <- config$datasets[[1]]
dataset_config <- as.list(dataset_config)

cat("Processing Dataset:", dataset_config$name, "\n")
cat("Dataset File Path:", dataset_config$file, "\n")
Potential Performance Bottleneck

The calculateResult function iterates over samples and filters proteins multiple times, which could be computationally expensive for large datasets. Consider optimizing this logic to improve performance.

calculateResult <- function(summarized, label, samples) {
  model <- groupComparison("pairwise", summarized)
  comparisonResult <- model$ComparisonResult

  TP <- 0
  FP <- 0
  TN <- 0
  FN <- 0

  for (sample_name in names(samples)) {
    sample <- samples[[sample_name]]
    is_positive <- sample$type == "positive"

    filtered_proteins <- comparisonResult %>% filter(grepl(sample$pattern, Protein))

    if (is_positive) {
      TP <- TP + nrow(filtered_proteins %>% filter(adj.pvalue < 0.05))
      FN <- FN + nrow(filtered_proteins %>% filter(adj.pvalue >= 0.05))
    } else {
      FP <- FP + nrow(filtered_proteins %>% filter(adj.pvalue < 0.05))
      TN <- TN + nrow(filtered_proteins %>% filter(adj.pvalue >= 0.05))
    }
Script Execution Order

The SLURM script executes multiple R scripts sequentially. If dependencies exist between these scripts, ensure the order of execution is correct and any errors are handled gracefully.

R_SCRIPTS=("benchmark.R" "benchmark1.R")

for script in $R_SCRIPTS; do
    echo "Executing script: $script"
    Rscript "$script"
done

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Possible issue
Add file existence check for robustness

Add error handling to ensure that the dataset_config$file path exists and is
accessible before attempting to read it with data.table::fread. This will prevent
runtime errors if the file is missing or the path is incorrect.

benchmark/benchmark.R [20]

+if (!file.exists(dataset_config$file)) {
+  stop(paste("File not found:", dataset_config$file))
+}
 fragpipe_raw <- data.table::fread(dataset_config$file)
Suggestion importance[1-10]: 9

Why: Adding a file existence check ensures robustness by preventing runtime errors when the file path is invalid or inaccessible. This is a critical improvement for handling external dependencies.

9
Validate MSstats format conversion output

Add a check to ensure that the msstats_format object is not empty or NULL after
conversion using MSstatsConvert::FragPipetoMSstatsFormat. This will prevent
downstream errors if the conversion fails.

benchmark/benchmark1.R [24]

 msstats_format = MSstatsConvert::FragPipetoMSstatsFormat(fragpipe_raw, use_log_file = FALSE)
+if (is.null(msstats_format) || nrow(msstats_format) == 0) {
+  stop("Error: MSstats format conversion failed or returned empty data.")
+}
Suggestion importance[1-10]: 9

Why: Ensuring the msstats_format object is valid after conversion prevents downstream errors and ensures the integrity of the data processing pipeline. This is a crucial validation step.

9
Validate sample patterns before filtering

Add a check to ensure that the samples parameter contains valid and non-empty
patterns before filtering comparisonResult. This will prevent potential errors or
incorrect results when samples is improperly configured.

benchmark/calculateMetrics.R [18]

+if (is.null(sample$pattern) || sample$pattern == "") {
+  stop(paste("Invalid pattern for sample:", sample_name))
+}
 filtered_proteins <- comparisonResult %>% filter(grepl(sample$pattern, Protein))
Suggestion importance[1-10]: 8

Why: Validating the sample patterns before filtering avoids potential errors or incorrect results when the samples parameter is misconfigured. This enhances the reliability of the function.

8
Verify existence of R scripts before execution

Add error handling to verify that each R script in the R_SCRIPTS array exists before
attempting to execute it. This will avoid silent failures if a script is missing.

benchmark/config.slurm [32-34]

 for script in $R_SCRIPTS; do
+    if [ ! -f "$script" ]; then
+        echo "Error: Script $script not found."
+        exit 1
+    fi
     echo "Executing script: $script"
     Rscript "$script"
 done
Suggestion importance[1-10]: 8

Why: Adding a check for the existence of R scripts before execution prevents silent failures and ensures that the intended scripts are available, improving script reliability.

8

benchmark/README.md Outdated Show resolved Hide resolved
ssh new_user@remote_server
```

If successful, the new user should be logged into the remote server.
Copy link
Contributor

@tonywu1999 tonywu1999 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an extra step, we should make sure to include instructions to store the private key as a secret within the MSstats repo (i.e. settings -> secrets -> SSH_PRIVATE_KEY)

Comment on lines 24 to 32
# Useful for future datasets where BioReplicate and Condition columns are missing

# fragpipe_raw$Condition = unlist(lapply(fragpipe_raw$Run, function(x){
# paste(str_split(x, "\\_")[[1]][4:5], collapse="_")
# }))

# fragpipe_raw$BioReplicate = unlist(lapply(fragpipe_raw$Run, function(x){
# paste(str_split(x, "\\_")[[1]][4:7], collapse="_")
# }))
Copy link
Contributor

@tonywu1999 tonywu1999 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments can be removed. There isn't a pre-defined universal format for run ID, we just got lucky that this dataset's format was in a parseable format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 16 to 17
## significant implies positives
## insignificant implies negative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove these comments. In the context of statistics & proteomics, "significant" means there was a noticeable shift in protein abundance outside of 95% confidence intervals. This shift in protein abundance can be a positive shift or a negative shift (which is why these comments can be confusing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@anshuman-raina anshuman-raina merged commit 488cbc7 into devel Feb 8, 2025
1 check passed
@tonywu1999 tonywu1999 deleted the feature/multiple-scripts branch February 10, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants