Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not a valid path value when provide genome by fasta and gtf in compressed gz format, and not providing genome index or gene bed #427

Open
ljw20180420 opened this issue Oct 16, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@ljw20180420
Copy link

ljw20180420 commented Oct 16, 2024

If provide genome by fasta and gtf, workflow will try to generate genome index and gene bed if they are not provided. It works fine when fasta and gtf are uncompressed. However, if any one of fasta and gtf is compressed (in gz format in my case), and the corresponding downstream file (i.e. genome index or gene bed) is not provided, I will get the following error:

Not a valid path value: '../genome/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz'
Not a valid path value: '../genome/Mus_musculus.GRCm38.102.gtf.gz'

My params.yaml is

input: './samplesheet.csv'
outdir: './results/'
fasta: '../genome/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz'
gtf: '../genome/Mus_musculus.GRCm38.102.gtf.gz'
narrow_peak: true
aligner: 'bowtie2'
read_length: 150
max_memory: '50.GB'
max_cpus: 24
save_reference: true

My command is

HTTPS_PROXY=http://localhost:1081 nextflow run $PWD/../pipeline/nf-core-chipseq/2_1_0/main.nf -profile singularity -resume -params-file params.yaml

Both should be right because it works fine if I uncompress 'Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz' and 'Mus_musculus.GRCm38.102.gtf'.

I also find some related issues in other nf-core workflows:

https://github.com/nf-core/rnaseq/issues/1311
https://github.com/nf-core/atacseq/issues/277
https://github.com/nf-core/cutandrun/issues/187

Thank you.

@ljw20180420 ljw20180420 added the bug Something isn't working label Oct 16, 2024
@ljw20180420 ljw20180420 changed the title Not a valid path value when using fasta and gtf without index Not a valid path value when provide genome by fasta and gtf in compressed gz format, and not providing genome index or gene.bed Oct 17, 2024
@ljw20180420 ljw20180420 changed the title Not a valid path value when provide genome by fasta and gtf in compressed gz format, and not providing genome index or gene.bed Not a valid path value when provide genome by fasta and gtf in compressed gz format, and not providing genome index or gene bed Oct 17, 2024
@Nicobouch
Copy link

same for the --gff option

@a1ultima
Copy link

a1ultima commented Dec 4, 2024

@ljw20180420

Cause

I've managed to fix this bug, basically in v2.1.0 they have upgraded their gffread process directive to match the latest nf-core/rnaseq directive for gffread, but they did not correspondingly update the prepare_genome.nf channel handling to match the expected inputs of the upgraded gffread process directive:

v2.1.0: https://github.com/nf-core/chipseq/blob/master/modules/nf-core/gffread/main.nf

In v2.0.0 this gffread process expects just the path of the gff file:

    input:
    path gff

, but in v2.1.0 the process now expects more inputs:

    input:
    tuple val(meta), path(gff)
    path fasta

Solution

The solution was to update the channel handling code in prepare_genome.nf to match these new input requirements of gffread: https://github.com/nf-core/chipseq/blob/master/subworkflows/local/prepare_genome.nf

    //
    // Uncompress GTF annotation file or create from GFF3 if required
    //
    if (gtf) {
        if (gtf.endsWith('.gz')) {
            ch_gtf      = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
        } else {
            ch_gtf = Channel.value(file(gtf))
        }
    } else if (gff) {
        if (gff.endsWith('.gz')) {
            ch_gff      = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
        } else {
            ch_gff = Channel.value(file(gff))
        }

        // Create a channel emitting tuples of (meta, gff)  
        ch_gff = ch_gff.map { gff_file ->
            def meta = [ id: gff_file.baseName.tokenize('.')[0] ]  // Remove extensions
            tuple(meta, gff_file)
        }

        // Pass the input channel to GFFREAD along with ch_fasta
        ch_gtf = GFFREAD(ch_gff, ch_fasta).gtf
        ch_versions = ch_versions.mix(GFFREAD.out.versions)
    }

Just replace your prepare_genome.nf with the following code (NOTE: I also fix the channel handling for gtf2bed in here):

//
// Uncompress and prepare reference genome files
//

include {
    GUNZIP as GUNZIP_FASTA
    GUNZIP as GUNZIP_GTF
    GUNZIP as GUNZIP_GFF
    GUNZIP as GUNZIP_GENE_BED
    GUNZIP as GUNZIP_BLACKLIST
} from '../../modules/nf-core/gunzip/main'

include {
    UNTAR as UNTAR_BWA_INDEX
    UNTAR as UNTAR_BOWTIE2_INDEX
    UNTAR as UNTAR_STAR_INDEX
} from '../../modules/nf-core/untar/main'

include { UNTARFILES } from '../../modules/nf-core/untarfiles/main'
include { GFFREAD } from '../../modules/nf-core/gffread/main'
include { CUSTOM_GETCHROMSIZES } from '../../modules/nf-core/custom/getchromsizes/main'
include { BWA_INDEX } from '../../modules/nf-core/bwa/index/main'
include { BOWTIE2_BUILD } from '../../modules/nf-core/bowtie2/build/main'
include { CHROMAP_INDEX } from '../../modules/nf-core/chromap/index/main'

include { GTF2BED } from '../../modules/local/gtf2bed'
include { GENOME_BLACKLIST_REGIONS } from '../../modules/local/genome_blacklist_regions'
include { STAR_GENOMEGENERATE } from '../../modules/local/star_genomegenerate'

workflow PREPARE_GENOME {
    take:
    genome             //  string: genome name
    genomes            //     map: genome attributes
    prepare_tool_index // string  : tool to prepare index for
    fasta              //    path: path to genome fasta file
    gtf                //    file: /path/to/genome.gtf
    gff                //    file: /path/to/genome.gff
    blacklist          //    file: /path/to/blacklist.bed
    gene_bed           //    file: /path/to/gene.bed
    bwa_index          //    file: /path/to/bwa/index/
    bowtie2_index      //    file: /path/to/bowtie2/index/
    chromap_index      //    file: /path/to/chromap/index/
    star_index         //    file: /path/to/star/index/

    main:

    ch_versions = Channel.empty()

    //
    // Uncompress genome fasta file if required
    //
    if (fasta.endsWith('.gz')) {
        ch_fasta    = GUNZIP_FASTA ( [ [:], fasta ] ).gunzip.map { it[1] }
        ch_versions = ch_versions.mix(GUNZIP_FASTA.out.versions)
    } else {
        ch_fasta = Channel.value(file(fasta))
    }

    //
    // Uncompress GTF annotation file or create from GFF3 if required
    //
    if (gtf) {
        if (gtf.endsWith('.gz')) {
            ch_gtf      = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
        } else {
            ch_gtf = Channel.value(file(gtf))
        }
    } else if (gff) {
        if (gff.endsWith('.gz')) {
            ch_gff      = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
        } else {
            ch_gff = Channel.value(file(gff))
        }

        // Create a channel emitting tuples of (meta, gff)
        ch_gff = ch_gff.map { gff_file ->
            def meta = [ id: gff_file.baseName.tokenize('.')[0] ]  // Remove extensions
            tuple(meta, gff_file)
        }

        // Pass the input channel to GFFREAD along with ch_fasta
        ch_gtf = GFFREAD(ch_gff, ch_fasta).gtf
        ch_versions = ch_versions.mix(GFFREAD.out.versions)
    }

    //
    // Uncompress blacklist file if required
    //
    ch_blacklist = Channel.empty()
    if (blacklist) {
        if (blacklist.endsWith('.gz')) {
            ch_blacklist = GUNZIP_BLACKLIST ( [ [:], blacklist ] ).gunzip.map { it[1] }
            ch_versions  = ch_versions.mix(GUNZIP_BLACKLIST.out.versions)
        } else {
            ch_blacklist = Channel.value(file(blacklist))
        }
    }

    //
    // Uncompress gene BED annotation file or create from GTF if required
    //

    // Uncompress gene BED annotation file or create from GTF if required
    def make_bed = false
    if (!gene_bed) {
        make_bed = true
    } else if (genome && gtf) {
        if (genomes[genome].gtf != gtf) {
            make_bed = true
        }
    }

    if (make_bed) {
        // ch_gtf emits tuples of (meta, gtf)
        // Extract the GTF file path from the tuple
        ch_gtf_path = ch_gtf.map { tuple -> tuple[1] }

        // Pass the GTF file path to GTF2BED
        ch_gene_bed = GTF2BED(ch_gtf_path).bed
        ch_versions = ch_versions.mix(GTF2BED.out.versions)
    } else {
        if (gene_bed.endsWith('.gz')) {
            ch_gene_bed = GUNZIP_GENE_BED ( [ [:], gene_bed ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GENE_BED.out.versions)
        } else {
            ch_gene_bed = Channel.value(file(gene_bed))
        }
    }

    //
    // Create chromosome sizes file
    //
    CUSTOM_GETCHROMSIZES(ch_fasta.map { [[:], it] })
    ch_chrom_sizes = CUSTOM_GETCHROMSIZES.out.sizes.map { it[1] }
    ch_fai         = CUSTOM_GETCHROMSIZES.out.fai.map { it[1] }
    ch_versions    = ch_versions.mix(CUSTOM_GETCHROMSIZES.out.versions)

    //
    // Prepare genome intervals for filtering by removing regions in blacklist file
    //
    ch_genome_filtered_bed = Channel.empty()

    GENOME_BLACKLIST_REGIONS(
        ch_chrom_sizes,
        ch_blacklist.ifEmpty([])
    )
    ch_genome_filtered_bed = GENOME_BLACKLIST_REGIONS.out.bed
    ch_versions = ch_versions.mix(GENOME_BLACKLIST_REGIONS.out.versions)

    //
    // Uncompress BWA index or generate from scratch if required
    //
    ch_bwa_index = Channel.empty()
    if (prepare_tool_index == 'bwa') {
        if (bwa_index) {
            if (bwa_index.endsWith('.tar.gz')) {
                ch_bwa_index = UNTAR_BWA_INDEX([[:], bwa_index]).untar
                ch_versions  = ch_versions.mix(UNTAR_BWA_INDEX.out.versions)
            } else {
                ch_bwa_index = [[:], file(bwa_index)]
            }
        } else {
            ch_bwa_index = BWA_INDEX(ch_fasta.map { [[:], it] }).index
            ch_versions  = ch_versions.mix(BWA_INDEX.out.versions)
        }
    }

    //
    // Uncompress Bowtie2 index or generate from scratch if required
    //
    ch_bowtie2_index = Channel.empty()
    if (prepare_tool_index == 'bowtie2') {
        if (bowtie2_index) {
            if (bowtie2_index.endsWith('.tar.gz')) {
                ch_bowtie2_index = UNTAR_BOWTIE2_INDEX([[:], bowtie2_index]).untar
                ch_versions      = ch_versions.mix(UNTAR_BOWTIE2_INDEX.out.versions)
            } else {
                ch_bowtie2_index = [[:], file(bowtie2_index)]
            }
        } else {
            ch_bowtie2_index = BOWTIE2_BUILD(ch_fasta.map { [[:], it] }).index
            ch_versions      = ch_versions.mix(BOWTIE2_BUILD.out.versions)
        }
    }

    //
    // Uncompress CHROMAP index or generate from scratch if required
    //
    ch_chromap_index = Channel.empty()
    if (prepare_tool_index == 'chromap') {
        if (chromap_index) {
            if (chromap_index.endsWith('.tar.gz')) {
                ch_chromap_index = UNTARFILES([[:], chromap_index]).files
                ch_versions      = ch_versions.mix(UNTARFILES.out.versions)
            } else {
                ch_chromap_index = [[:], file(chromap_index)]
            }
        } else {
            ch_chromap_index = CHROMAP_INDEX(ch_fasta.map { [[:], it] }).index
            ch_versions      = ch_versions.mix(CHROMAP_INDEX.out.versions)
        }
    }

    //
    // Uncompress STAR index or generate from scratch if required
    //
    ch_star_index = Channel.empty()
    if (prepare_tool_index == 'star') {
        if (star_index) {
            if (star_index.endsWith('.tar.gz')) {
                ch_star_index = UNTAR_STAR_INDEX([[:], star_index]).untar.map { it[1] }
                ch_versions   = ch_versions.mix(UNTAR_STAR_INDEX.out.versions)
            } else {
                ch_star_index = Channel.value(file(star_index))
            }
        } else {
            ch_star_index = STAR_GENOMEGENERATE(ch_fasta, ch_gtf).index
            ch_versions   = ch_versions.mix(STAR_GENOMEGENERATE.out.versions)
        }
    }

    emit:
    fasta          = ch_fasta                   // path: genome.fasta
    fai            = ch_fai                     // path: genome.fai
    gtf            = ch_gtf                     // path: genome.gtf
    gene_bed       = ch_gene_bed                // path: gene.bed
    chrom_sizes    = ch_chrom_sizes             // path: genome.sizes
    filtered_bed   = ch_genome_filtered_bed     // path: *.include_regions.bed
    bwa_index      = ch_bwa_index               // path: bwa/index/
    bowtie2_index  = ch_bowtie2_index           // path: bowtie2/index/
    chromap_index  = ch_chromap_index           // path: genome.index
    star_index     = ch_star_index              // path: star/index/
    versions       = ch_versions.ifEmpty(null)  // channel: [ versions.yml ]
}

@ljw20180420
Copy link
Author

Thank you very mush. It seems a little complex for me. Do you have a plan to fix this directly in the repository https://github.com/nf-core/chipseq/blob/master/subworkflows/local/prepare_genome.nf so that others can get rid of this bug forever?

@a1ultima
Copy link

@ljw20180420 I am not part of their dev team, but there are multiple other errors I am debugging. Currently there have been 5-6 bugs I ran into, and I plan to make a PR request for each fix for them to incorporate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants