Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running BUSCO with your own database #697

Closed
Thomieh73 opened this issue Oct 17, 2024 · 6 comments
Closed

Running BUSCO with your own database #697

Thomieh73 opened this issue Oct 17, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@Thomieh73
Copy link

Thomieh73 commented Oct 17, 2024

Description of the bug

Hi I tried to use BUSCO for the bin checking. Since our cluster does not allow to download things when inside a slurm job, I download the database and installed it. I did not use BUSCO for it as suggested in issue: #545 . But my BUSCO job fails in two ways. It seems to not find the option: --auto-lineage-prok and it does not want to us the database I downloaded.

My params file looks like this:

{
		"input": "samplesheet_small_TP_reads.csv",
		"outdir": "\/cluster\/projects\/nn10070k\/projects\/phagedrive\/pd_data_control\/results\/20240916_MAG_results",
		"multiqc_title": "TP_cleaned_reads",
		"reads_minlength": 50,
		"igenomes_base" : "s3://ngi-igenomes/igenomes",
		"gtdb_db": "\/cluster\/projects\/nn10070k\/databases\/gtdbtk_r220_data.tar.gz", 
		"host_genome":"GRCh38",
		"kraken2_db": "\/cluster\/projects\/nn10070k\/databases\/kraken2_pluspfp_05.06.2024\/hash.k2d",
		"cat_db": "\/cluster\/projects\/nn10070k\/databases\/20240422_CAT_nr",
		"binqc_tool": "busco",
		"busco_db": "\/cluster\/shared\/biobases\/BUSCO\/2024-10-04",
		"busco_auto_lineage_prok": true,
		"busco_clean": true,
		"checkm_db": "\/cluster\/projects\/nn10070k\/databases\/checkm_db_2015.01.16",
		"refine_bins_dastool": true,
		"postbinning_input": "refined_bins_only",
		"run_virus_identification": false	
	}

When I check the log file from BUSCO I get this :

2024-10-17 09:23:16 INFO:       ***** Start a BUSCO v5.4.3 analysis, current time: 10/17/2024 09:23:16 *****
2024-10-17 09:23:16 INFO:       Configuring BUSCO with local environment
2024-10-17 09:23:16 INFO:       Mode is genome
2024-10-17 09:23:16 INFO:       Input file is /cluster/work/users/thhaverk/nf_mag/00/62b73b0fdc4a15c0b3e9bfa7b6270c/SPAdes-DASToolUnbinned-DNA_H1H_10_A1.fa
2024-10-17 09:23:16 INFO:       No lineage specified. Running lineage auto selector.

2024-10-17 09:23:16 INFO:       ***** Starting Auto Select Lineage *****
        This process runs BUSCO on the generic lineage datasets for the domains archaea, bacteria and eukaryota. Once the optimal domain is selected, BUSCO automatically attempts to find the most appropriate BUSCO dataset to use based on phylogenetic placement.
        --auto-lineage-euk and --auto-lineage-prok are also available if you know your input assembly is, or is not, an eukaryote. See the user guide for more information.
        A reminder: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.

I saw in the BUSCO manual that there is an flag called: --offline that you can use when you provide it with your own database. I see in the .command.sh file that it is there. But that is not used here when I run BUSCO, with my own database

the .command.sh file looks like this:

#!/bin/bash -euo pipefail
run_busco.sh \
    "--auto-lineage-prok --offline --download_path 2024-10-04" \
    "Y" \
    "2024-10-04" \
    "SPAdes-DASToolUnbinned-DNA_H1H_10_A1.fa" \
    8 \
    "N" \
    "Y" \
    "--offline"

most_spec_db=$(<info_most_spec_db.txt)

cat <<-END_VERSIONS > versions.yml
"NFCORE_MAG:MAG:BUSCO_QC:BUSCO":
    python: $(python --version 2>&1 | sed 's/Python //g')
    R: $(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
    busco: $(busco --version 2>&1 | sed 's/BUSCO //g')
END_VERSIONS

# capture process environment
set +u
set +e
cd "$NXF_TASK_WORKDIR"

nxf_eval_cmd() {
    {
        IFS=$'\n' read -r -d '' "${1}";
        IFS=$'\n' read -r -d '' "${2}";
        (IFS=$'\n' read -r -d '' _ERRNO_; return ${_ERRNO_});
    } < <((printf '\0%s\0%d\0' "$(((({ shift 2; "${@}"; echo "${?}" 1>&3-; } | tr -d '\0' 1>&4-) 4>&2- 2>&1- | tr -d '\0' 1>&4-) 3>&1- | exit "$(cat)") 4>&1-)" "${?}" 1>&2) 2>&1)
}

echo '' > .command.env
#
echo most_spec_db="${most_spec_db[@]}" >> .command.env
echo /most_spec_db/ >> .command.env

I am not understanding what is the error here, it looks like the db location is only using the last bit of the db location,

Command used and terminal output

My nextflow command was:

nextflow run nf-core/mag -r 3.0.3 -profile apptainer -work-dir $USERWORK/nf_mag -resume -c saga_mag.simple.config -params-file params_test_2.json

Relevant files

No response

System information

Nextflow version: 24.04.3
Hardware: HPC
executor: Slurm
Container engine: Apptainer
OS : CentOS linux

@Thomieh73 Thomieh73 added the bug Something isn't working label Oct 17, 2024
@jfy133
Copy link
Member

jfy133 commented Oct 25, 2024

Without really knowing how BUSCO works (sorry), I have a suspicion... could you try again but with -r buco-offline-fix?

I've tried adding -offline on line 33:

def p = params.busco_auto_lineage_prok ? "--auto-lineage-prok" : "--auto-lineage"
if ( "${lineage_dataset_provided}" == "Y" ) {
p = "--offline --lineage_dataset dataset/${db}"
} else if ( "${lineage_dataset_provided}" == "N" ) {
p += " --offline --download_path ${db}"
} else {
lineage_dataset_provided = ""
}

@Thomieh73
Copy link
Author

Hi @jfy133 I have started a run now with your modification, -r busco-offline-fix

I let you know when I have a result.

@Thomieh73
Copy link
Author

Hi, my run has finished but stopped at the BUSCO step and with an error.

The error message:

Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_MAG:MAG:BUSCO_QC:BUSCO (MEGAHIT-DASToolUnbinned-DNA_H1H_30_C1.fa)'

Caused by:
  Process `NFCORE_MAG:MAG:BUSCO_QC:BUSCO (MEGAHIT-DASToolUnbinned-DNA_H1H_30_C1.fa)` terminated with an error exit status (1)


Command executed:

  run_busco.sh \
      "--auto-lineage-prok --offline --download_path 2024-10-04" \
      "Y" \
      "2024-10-04" \
      "MEGAHIT-DASToolUnbinned-DNA_H1H_30_C1.fa" \
      8 \
      "N" \
      "Y" \
      "--offline"

  most_spec_db=$(<info_most_spec_db.txt)

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_MAG:MAG:BUSCO_QC:BUSCO":
      python: $(python --version 2>&1 | sed 's/Python //g')
      R: $(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
      busco: $(busco --version 2>&1 | sed 's/BUSCO //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  ERROR: BUSCO analysis failed for some unknown reason! See also MEGAHIT-DASToolUnbinned-DNA_H1H_30_C1.fa_busco.err.

Work dir:
  /cluster/work/users/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details

My command to run the job was this:

 nextflow run nf-core/mag -profile apptainer -work-dir $USERWORK/nf_mag -resume -c saga_mag.config -params-file params.json -r busco-offline-fix

When I checked the error in the work directory I find this in the .command.log file:

ERROR: BUSCO analysis failed for some unknown reason! See also MEGAHIT-DASToolUnbinned-DNA_H1H_30_C1.fa_busco.err.

and the log file from BUSCO gave this output:

2024-11-20 10:07:19 INFO:       ***** Start a BUSCO v5.4.3 analysis, current time: 11/20/2024 10:07:19 *****
2024-11-20 10:07:19 INFO:       Configuring BUSCO with local environment
2024-11-20 10:07:19 INFO:       Mode is genome
2024-11-20 10:07:19 INFO:       Input file is /cluster/work/users/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274/MEGAHIT-DASToolUnbinned-DNA_H1H_30_C1.fa
2024-11-20 10:07:19 INFO:       No lineage specified. Running lineage auto selector.

2024-11-20 10:07:19 INFO:       ***** Starting Auto Select Lineage *****
        This process runs BUSCO on the generic lineage datasets for the domains archaea, bacteria and eukaryota. Once the optimal domain is selected, BUSCO automatically attempts to find the most appropriate BUSCO dataset to use based on phylogenetic placement.
        --auto-lineage-euk and --auto-lineage-prok are also available if you know your input assembly is, or is not, an eukaryote. See the user guide for more information.
        A reminder: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations

But that does not give a lot of info, so I checked the folder called BUSCO which contains a log directory.
There I find in the more details log file these lines:

2024-11-20 10:07:19 INFO:busco.BuscoRunner      Input file is /cluster/work/users/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274/MEGAHIT-DASToolUnbinned-DNA_H1H_30_C1.fa
2024-11-20 10:07:19 INFO:busco.BuscoRunner      No lineage specified. Running lineage auto selector.

2024-11-20 10:07:19 INFO:busco.AutoLineage      ***** Starting Auto Select Lineage *****
        This process runs BUSCO on the generic lineage datasets for the domains archaea, bacteria and eukaryota. Once the optimal domain is selected, BUSCO automatically attempts to find the most appropriate BUSCO dataset to use based on phylogenetic placement.
        --auto-lineage-euk and --auto-lineage-prok are also available if you know your input assembly is, or is not, an eukaryote. See the user guide for more information.
        A reminder: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.
2024-11-20 10:07:19 DEBUG:busco.AutoLineage     Running auto selector
2024-11-20 10:07:19 ERROR:busco.BuscoRunner     Unable to run BUSCO in offline mode. Dataset /cluster/work/users/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274/2024-10-04/lineages/archaea_odb10 does not exist.
2024-11-20 10:07:19 DEBUG:busco.BuscoRunner     Unable to run BUSCO in offline mode. Dataset /cluster/work/users/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274/2024-10-04/lineages/archaea_odb10 does not exist.

The last line indicates that the archaea directory is not found.
When I check it I see that the location it is incorrect.

BUSCO indicates this directory location

/cluster/work/users/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274/2024-10-04/lineages/archaea_odb10

but on my system the location is this:

/cluster/home/thhaverk/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274/2024-10-04/archaea_odb10

which is without the "lineages" subfolder.

So I just tried running the workflow againafter making my BUSCO database look the same as in the output from BUSCO
that is:

/cluster/work/users/thhaverk/nf_mag/f4/a588c0c4349b020ee22884cd596274/2024-10-04/lineages

With all the taxa directories dropped in lineages.

And that seems to work, the output from one of the jobs

2024-11-21 10:38:23 INFO:       ***** Start a BUSCO v5.4.3 analysis, current time: 11/21/2024 10:38:23 *****
2024-11-21 10:38:23 INFO:       Configuring BUSCO with local environment
2024-11-21 10:38:23 INFO:       Mode is genome
2024-11-21 10:38:23 INFO:       Input file is /cluster/work/users/thhaverk/nf_mag/10/103d8edb44aa99012952e5d3649693/MEGAHIT-DASToolUnbinned-DNA_H1H_48_C5.fa
2024-11-21 10:38:23 INFO:       No lineage specified. Running lineage auto selector.

2024-11-21 10:38:23 INFO:       ***** Starting Auto Select Lineage *****
        This process runs BUSCO on the generic lineage datasets for the domains archaea, bacteria and eukaryota. Once the optimal domain is selected, BUSCO automatically attempts to find the most appropriate BUSCO dataset to use based on phylogenetic placement.
        --auto-lineage-euk and --auto-lineage-prok are also available if you know your input assembly is, or is not, an eukaryote. See the user guide for more information.
        A reminder: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.
2024-11-21 10:38:27 INFO:       Running BUSCO using lineage dataset archaea_odb10 (prokaryota, 2024-01-08)
2024-11-21 10:38:27 INFO:       Running 1 job(s) on bbtools, starting at 11/21/2024 10:38:27
2024-11-21 10:38:30 INFO:       [bbtools]       1 of 1 task(s) completed
2024-11-21 10:38:31 INFO:       ***** Run Prodigal on input to predict and extract genes *****

@Thomieh73
Copy link
Author

Thomieh73 commented Nov 21, 2024

So it looks like the error with BUSCO running offline, had to do with the folder structure of the BUSCO database directory. Modifying the directory by adding a subfolder called lineages resolved the issue.

But it does not solve it completely. I ran into this issue as well: #545 and I checked this issue https://gitlab.com/ezlab/busco/-/issues/324

Which shows that I should set-up my database correctly.

I therefor used the Busco version 5.5.0 that is installed on our cluster to set-up the database correctly with this command:

 busco --download all --download_path 2024-11-21

Should have done that earlier :-)

Will try again, if my database has this folder structure:

2024-11-21
  /lineages
  /placement_files
  /information

@Thomieh73
Copy link
Author

Okay. I now have set-up a correct busco database, and I ran the pipeline with the commands

nextflow run nf-core/mag -profile apptainer -work-dir $USERWORK/nf_mag -resume -c saga_mag.config -params-file params.json -r busco-offline-fix

And the Busco step finished without any problems.

Now I went back to the original job which was the reason for this issue and check the logs from BUSCO. There I also now observed that the error was due to the database not being correctly set-up.

So I tried to run my original command for the pipeline with the now correctly set-up database.

nextflow run nf-core/mag -r 3.0.3 -profile apptainer -work-dir $USERWORK/nf_mag -resume -c saga_mag.config -params-file params.json 

And that works without a problem

So the whole issue was due to me not setting up the BUSCO database correctly.

So @jfy133 I will close this issue as it was not a problem of the MAG workflow, but of the database needed for BUSCO.

@jfy133
Copy link
Member

jfy133 commented Nov 22, 2024

Ok thanks for follow up and clarifications, we appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants