update docs

HuntsmanCancerInstitute · Sep 14, 2024 · 2f3b2f9 · 2f3b2f9
1 parent 789911f
commit 2f3b2f9
Show file tree

Hide file tree

Showing 3 changed files with 29 additions and 25 deletions.
diff --git a/docs/Install.md b/docs/Install.md
@@ -15,20 +15,20 @@ This package can by installed using the standard Perl
 
 Otherwise, a Perl package manager may be used, such as CPAN or
 [CPAN Minus](https://metacpan.org/pod/App::cpanminus). If you have downloaded this
-package from GitHub, you can run one of the following inside the package directory.
+package from GitHub, run _one_ of the following inside the package directory.
 
 	cpan  .
 	cpanm .
 	cpanp .
 
 This installation will allow the scripts to work with Fastq files. However, to use
-the scripts with bam files, specifically bam_umi_dedup.pl, follow the Advanced 
-installation, below.
+the scripts with bam files, specifically [bam_umi_dedup.pl](apps/bam_umi_dedup.md),
+follow the Advanced installation, below.
 
 
 ## Advanced installation
 
-The [bam_umi_dedup](bam_umi_dedupapps/.md) application requires the installation of
+The [bam_umi_dedup](apps/bam_umi_dedup.md) application requires the installation of
 the [Bio::DB::HTS](https://metacpan.org/pod/Bio::DB::HTS) Perl adapter, which in turn 
 requires the external HTSlib library to be installed. 
 
@@ -67,7 +67,7 @@ should be installed first. These are listed below.
 Once these two prerequisites are installed, the remaining Perl modules can be installed.
 These are listed as recommendations in `Build.PL` and are not installed automatically
 as dependencies. Most Perl package managers, such as CPANMinus, CPANPlus, or CPAN 
-can be used here. Use one of the following, as appropriate:
+can be used here. Use _one_ of the following, as appropriate:
 
 	cpan  -i Bio::DB::HTS Parallel::ForkManager List::MoreUtils
 	cpanm -i Bio::DB::HTS Parallel::ForkManager List::MoreUtils

diff --git a/docs/Usage.md b/docs/Usage.md
@@ -117,14 +117,14 @@ file with an extra step.
 
 To align Fastq with SAM tag comments:
 
-    bwa mem -t $CPU -v 1 -C reference.fasta *.fastq.gz | \
+    bwa mem -t $CPU -C reference.fasta *.fastq.gz | \
     samtools fixmate -m - output.bam
 
-To align with unaligned Bam file, it must first be converted to Fastq with tags as
-comments using `samtools` as a pre-step:
+To align with an unaligned Bam file, it must first be converted to Fastq with tags
+as comments using `samtools` as a pre-step:
 
     samtools fastq -T RX,RQ unaligned.bam | \
-    bwa mem -t $CPU -C -v 1 -p reference.fasta - | \
+    bwa mem -t $CPU -C -p reference.fasta - | \
     samtools fixmate -m - output.bam
 
 ### STAR
@@ -217,15 +217,16 @@ or from the name (`--barcode-name`). No mismatches in the UMI are tolerated.
 
 ### bam\_umi\_dedup
 
-The included [bam_umi_dedup.pl](bam_umi_dedupapps/.md) application is considerably
+The included [bam_umi_dedup.pl](apps/bam_umi_dedup.md) application is considerably
 faster than Picard, but with a **notable caveat**: no guarantee is made for retaining
 identical molecules at secondary, supplementary, and inter-chromosomal alignments
 (they are treated independently). Otherwise, for normal alignments, results are
-comparable to Picard. For most count-based applications such as ChIPSeq or RNASeq,
-this limitation may be acceptable.
+comparable to other tools. For most count-based applications such as ChIPSeq or
+RNASeq, this limitation may be acceptable.
 
-By default, duplicates are discarded, or they can marked with the `--mark` option.
-Unmapped alignments are silently discarded. 
+By default, duplicates are discarded, or they can be marked with the `--mark` option.
+Unmapped alignments are silently discarded (an unfortunate side effect of
+multi-threading by reference target sequences). 
 
 For de-duplication with SAM attribute tag `RX` (default):
 
@@ -235,11 +236,12 @@ For de-duplication with the UMI appended to the read name:
 
     bam_umi_dedup.pl --in input.bam --out output.bam --umi name --cpu $CPU
 
-By default, one mismatch is tolerated when de-duplicating with UMI sequences. 
-**Note** that extreme depth may substantially slow down execution time when mismatches 
-are tolerated. For `bam_umi_dedup`, a maximum depth is allowed for mismatch tolerance 
-before mismatch checking is dropped to avoid extreme impact (runtime of days). Dropping
-mismatch tolerance completely will also improve runtime.
+One mismatch (by default) is tolerated when de-duplicating with UMI sequences. This
+is an advantage over the other tools which do not tolerate mismatches. **Note** that
+extreme depth may substantially slow down execution time when mismatches are
+tolerated. For `bam_umi_dedup`, a maximum depth is allowed for mismatch tolerance
+before mismatch checking is dropped to avoid extreme impact (runtime of days!?).
+Dropping mismatch tolerance completely will improve runtime.
 
 ### Performance
 

diff --git a/docs/apps/bam_umi_dedup.md b/docs/apps/bam_umi_dedup.md
@@ -78,20 +78,22 @@ UMI options:
                             Specify 'name' when UMI appended to read name.
     -m --mark             Mark duplicates (flag 0x400) instead of discarding
     -t --tolerance <int>  UMI sequence edit distance tolerance (1)
-    --indel <int>         Set insertion/deletion penalty score (1)
-    --skip <int>          Skip mismatch detection if depth exceeds (5000)
+       --indel <int>      Set insertion/deletion penalty score (1)
+       --skip <int>       Skip mismatch detection if depth exceeds (5000)
 
 Other options:
 
-    -f --fasta <file>     Provide fasta file for Cram files
+    -f --fasta <file>     Provide indexed fasta file for Cram files
     -d --distance <int>   Set optical duplicate distance threshold.
                             Use 100 for unpatterned flowcell (HiSeq) or 
-                            2500 for patterned flowcell (NovaSeq). Default 0.
-    --coord <string>      Provide the tile:X:Y integer 1-base positions in the 
+                            2500 for patterned flowcell (NextSeq or NovaSeq6000)
+                            or 200 for NovaseqX. Default 0.
+       --coord <string>   Provide the tile:X:Y integer 1-base positions in the 
                             read name for optical checking. For Illumina CASAVA 1.8 
                             7-element names, this is 5:6:7 (default)
     -c --cpu <int>        Specify the number of forks to use (4) 
-    --samtools <path>     Path to samtools (/usr/local/bin/samtools)
+       --samtools <path>  Path to samtools (/usr/local/bin/samtools)
+       --nosam            Do not use samtools for final concatenation (slower)
     -h --help             Display full description and help