Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
tjparnell committed Sep 14, 2024
1 parent 789911f commit 2f3b2f9
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 25 deletions.
10 changes: 5 additions & 5 deletions docs/Install.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,20 @@ This package can by installed using the standard Perl

Otherwise, a Perl package manager may be used, such as CPAN or
[CPAN Minus](https://metacpan.org/pod/App::cpanminus). If you have downloaded this
package from GitHub, you can run one of the following inside the package directory.
package from GitHub, run _one_ of the following inside the package directory.

cpan .
cpanm .
cpanp .

This installation will allow the scripts to work with Fastq files. However, to use
the scripts with bam files, specifically bam_umi_dedup.pl, follow the Advanced
installation, below.
the scripts with bam files, specifically [bam_umi_dedup.pl](apps/bam_umi_dedup.md),
follow the Advanced installation, below.


## Advanced installation

The [bam_umi_dedup](bam_umi_dedupapps/.md) application requires the installation of
The [bam_umi_dedup](apps/bam_umi_dedup.md) application requires the installation of
the [Bio::DB::HTS](https://metacpan.org/pod/Bio::DB::HTS) Perl adapter, which in turn
requires the external HTSlib library to be installed.

Expand Down Expand Up @@ -67,7 +67,7 @@ should be installed first. These are listed below.
Once these two prerequisites are installed, the remaining Perl modules can be installed.
These are listed as recommendations in `Build.PL` and are not installed automatically
as dependencies. Most Perl package managers, such as CPANMinus, CPANPlus, or CPAN
can be used here. Use one of the following, as appropriate:
can be used here. Use _one_ of the following, as appropriate:

cpan -i Bio::DB::HTS Parallel::ForkManager List::MoreUtils
cpanm -i Bio::DB::HTS Parallel::ForkManager List::MoreUtils
Expand Down
30 changes: 16 additions & 14 deletions docs/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,14 +117,14 @@ file with an extra step.

To align Fastq with SAM tag comments:

bwa mem -t $CPU -v 1 -C reference.fasta *.fastq.gz | \
bwa mem -t $CPU -C reference.fasta *.fastq.gz | \
samtools fixmate -m - output.bam

To align with unaligned Bam file, it must first be converted to Fastq with tags as
comments using `samtools` as a pre-step:
To align with an unaligned Bam file, it must first be converted to Fastq with tags
as comments using `samtools` as a pre-step:

samtools fastq -T RX,RQ unaligned.bam | \
bwa mem -t $CPU -C -v 1 -p reference.fasta - | \
bwa mem -t $CPU -C -p reference.fasta - | \
samtools fixmate -m - output.bam

### STAR
Expand Down Expand Up @@ -217,15 +217,16 @@ or from the name (`--barcode-name`). No mismatches in the UMI are tolerated.

### bam\_umi\_dedup

The included [bam_umi_dedup.pl](bam_umi_dedupapps/.md) application is considerably
The included [bam_umi_dedup.pl](apps/bam_umi_dedup.md) application is considerably
faster than Picard, but with a **notable caveat**: no guarantee is made for retaining
identical molecules at secondary, supplementary, and inter-chromosomal alignments
(they are treated independently). Otherwise, for normal alignments, results are
comparable to Picard. For most count-based applications such as ChIPSeq or RNASeq,
this limitation may be acceptable.
comparable to other tools. For most count-based applications such as ChIPSeq or
RNASeq, this limitation may be acceptable.

By default, duplicates are discarded, or they can marked with the `--mark` option.
Unmapped alignments are silently discarded. 
By default, duplicates are discarded, or they can be marked with the `--mark` option.
Unmapped alignments are silently discarded (an unfortunate side effect of
multi-threading by reference target sequences). 

For de-duplication with SAM attribute tag `RX` (default):

Expand All @@ -235,11 +236,12 @@ For de-duplication with the UMI appended to the read name:

bam_umi_dedup.pl --in input.bam --out output.bam --umi name --cpu $CPU

By default, one mismatch is tolerated when de-duplicating with UMI sequences.
**Note** that extreme depth may substantially slow down execution time when mismatches
are tolerated. For `bam_umi_dedup`, a maximum depth is allowed for mismatch tolerance
before mismatch checking is dropped to avoid extreme impact (runtime of days). Dropping
mismatch tolerance completely will also improve runtime.
One mismatch (by default) is tolerated when de-duplicating with UMI sequences. This
is an advantage over the other tools which do not tolerate mismatches. **Note** that
extreme depth may substantially slow down execution time when mismatches are
tolerated. For `bam_umi_dedup`, a maximum depth is allowed for mismatch tolerance
before mismatch checking is dropped to avoid extreme impact (runtime of days!?).
Dropping mismatch tolerance completely will improve runtime.

### Performance

Expand Down
14 changes: 8 additions & 6 deletions docs/apps/bam_umi_dedup.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,20 +78,22 @@ UMI options:
Specify 'name' when UMI appended to read name.
-m --mark Mark duplicates (flag 0x400) instead of discarding
-t --tolerance <int> UMI sequence edit distance tolerance (1)
--indel <int> Set insertion/deletion penalty score (1)
--skip <int> Skip mismatch detection if depth exceeds (5000)
--indel <int> Set insertion/deletion penalty score (1)
--skip <int> Skip mismatch detection if depth exceeds (5000)

Other options:

-f --fasta <file> Provide fasta file for Cram files
-f --fasta <file> Provide indexed fasta file for Cram files
-d --distance <int> Set optical duplicate distance threshold.
Use 100 for unpatterned flowcell (HiSeq) or
2500 for patterned flowcell (NovaSeq). Default 0.
--coord <string> Provide the tile:X:Y integer 1-base positions in the
2500 for patterned flowcell (NextSeq or NovaSeq6000)
or 200 for NovaseqX. Default 0.
--coord <string> Provide the tile:X:Y integer 1-base positions in the
read name for optical checking. For Illumina CASAVA 1.8
7-element names, this is 5:6:7 (default)
-c --cpu <int> Specify the number of forks to use (4)
--samtools <path> Path to samtools (/usr/local/bin/samtools)
--samtools <path> Path to samtools (/usr/local/bin/samtools)
--nosam Do not use samtools for final concatenation (slower)
-h --help Display full description and help


0 comments on commit 2f3b2f9

Please sign in to comment.