Merge branch 'master' of https://github.com/chhylp123/Long_read_assembly

chhylp123 · Apr 24, 2021 · 80877b9 · 80877b9
2 parents e52e897 + dfd7720
commit 80877b9
Showing 1 changed file with 87 additions and 40 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-## Getting Started
+## <a name="started"></a>Getting Started
 
 ```sh
 # Install hifiasm (requiring g++ and zlib)
@@ -12,30 +12,47 @@ awk '/^S/{print ">"$2;print $3}' test.p_ctg.gfa > test.p_ctg.fa  # get primary c
 
 # Assemble inbred/homozygous genomes (-l0 disables duplication purging)
 hifiasm -o CHM13.asm -t32 -l0 CHM13-HiFi.fa.gz 2> CHM13.asm.log
-# Assemble heterozygous with built-in duplication purging
+# Assemble heterozygous genomes with built-in duplication purging
 hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz
 
+# Hi-C phasing with paired-end short reads in two FASTQ files
+hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz
+
 # Trio binning assembly (requiring https://github.com/lh3/yak)
 yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz)
 yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz)
 hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz
 ```
 
-## Introduction
+## Table of Contents
+
+- [Getting Started](#started)
+- [Introduction](#intro)
+- [Why Hifiasm?](#why)
+- [Usage](#use)
+  - [Assembling HiFi reads without additional data types](#hifionly)
+  - [Hi-C integration](#hic)
+  - [Trio binning](#trio)
+  - [Output files](#output)
+- [Results](#results)
+- [Getting Help](#help)
+- [Limitations](#limit)
+- [Citing Hifiasm](#cite)
+
+## <a name="intro"></a>Introduction
 
-Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. 
-It can assemble a human genome in several hours and works with the California
-redwood genome, one of the most complex genomes sequenced so far. Hifiasm can
-produce primary/alternate assemblies of quality competitive with the best
-assemblers. It also introduces a new graph binning algorithm and achieves
-the best haplotype-resolved assembly given trio data.
+Hifiasm is a fast haplotype-resolved de novo assembler for PacBio HiFi reads.
+It can assemble a human genome in several hours and assemble a ~30Gb California
+redwood genome in a few days. Hifiasm emits partially phased assemblies of
+quality competitive with the best assemblers. Given parental short reads or
+Hi-C data, it produces arguably the best haplotype-resolved assemblies so far.
 
-## Why Hifiasm?
+## <a name="why"></a>Why Hifiasm?
 
 * Hifiasm delivers high-quality assemblies. It tends to generate longer contigs
   and resolve more segmental duplications than other assemblers.
 
-* Given sequence reads from the parents, hifiasm can produce overall the best
+* Given Hi-C reads or short reads from the parents, hifiasm can produce overall the best
   haplotype-resolved assembly so far. It is the assembler of choice by the
   [Human Pangenome Project][hpp] for the first batch of samples.
 
@@ -47,25 +64,37 @@ the best haplotype-resolved assembly given trio data.
 * Hifiasm is fast. It can assemble a human genome in half a day and assemble a
   ~30Gb redwood genome in three days. No genome is too large for hifiasm.
 
-* Hifiasm is trivial to install and easy to use. It does not required python,
-  R or C++11 compilers and can be compiled into a single executable. The
+* Hifiasm is trivial to install and easy to use. It does not required Python,
+  R or C++11 compilers, and can be compiled into a single executable. The
   default setting works well with a variety of genomes.
 
 [hpp]: https://humanpangenome.org
 
-## Usage
+## <a name="use"></a>Usage
+
+### <a name="hifionly"></a>Assembling HiFi reads without additional data types
 
 A typical hifiasm command line looks like:
 ```sh
 hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
 ```
 where `NA12878.fq.gz` provides the input reads, `-t` sets the number of CPUs in
 use and `-o` specifies the prefix of output files. For this example, the
-primary contigs are written to `NA12878.asm.p_ctg.gfa` and alternate contigs to
-`NA12878.asm.a_ctg.gfa`. At the first run, hifiasm saves corrected reads and
+primary contigs are written to `NA12878.asm.bp.p_ctg.gfa` and alternate contigs to
+`NA12878.asm.bp.a_ctg.gfa`. Since v0.15, hifiasm also produces two sets of
+partially phased contigs at `NA12878.asm.bp.hap?.p_ctg.gfa`. This pair of files
+can be thought to represent the two haplotypes in a diploid genome, though with
+occasional switch errors. The frequency of switches is determined by the
+heterozygosity of the input sample.
+
+At the first run, hifiasm saves corrected reads and
 overlaps to disk as `NA12878.asm.*.bin`. It reuses the saved results to avoid
 the time-consuming all-vs-all overlap calculation next time. You may specify
 `-i` to ignore precomputed overlaps and redo overlapping from raw reads.
+You can also dump error corrected reads in FASTA and read overlaps in PAF with
+```sh
+hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null
+```
 
 Hifiasm purges haplotig duplications by default. For inbred or homozygous
 genomes, you may disable purging with option `-l0`. Old HiFi reads may contain
@@ -75,7 +104,27 @@ bloom filter which takes 16GB memory at the beginning. For genomes much larger
 than human, applying `-f38` or even `-f39` is preferred to save memory on k-mer
 counting.
 
-When parental short reads are available, hifiasm can generate a pair of
+### <a name="hic"></a>Hi-C integration
+
+Hifiasm can generate a pair of haplotype-resolved assemblies with paired-end
+Hi-C reads:
+```sh
+hifiasm -o NA12878.asm -t32 --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz
+```
+In this mode, each contig is supposed to be a haplotig, which by definition
+comes from one parental haplotype only. Hifiasm often puts all contigs from the
+same parental chromosome in one assembly. It has cleanly separated chrX and
+chrY for a human male dataset. Nonetheless, phasing across centromeres is
+challenging. Users should not expect hifiasm to phase entire chromosomes at the
+moment. Also, contigs from different parental chromosomes are randomly mixed as
+it is just not possible to phase across chromosomes with Hi-C.
+
+Hifiasm does not perform scaffolding for now. You need to run a standalone
+scaffolder such as SALSA or 3D-DNA to scaffold phased haplotigs.
+
+### <a name="trio"></a>Trio binning
+
+When parental short reads are available, hifiasm can also generate a pair of
 haplotype-resolved assemblies with trio binning. To perform such assembly, you
 need to count k-mers first with [yak][yak] first and then do assembly:
 ```sh
@@ -85,19 +134,15 @@ hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878.fq.gz
 ```
 Here `NA12878.asm.hap1.p_ctg.gfa` and `NA12878.asm.hap2.p_ctg.gfa` give the two
 haplotype assemblies. In the binning mode, hifiasm does not purge haplotig
-duplications by default. Because hifiasm reuses saved overlaps, you can
+duplicates by default. Because hifiasm reuses saved overlaps, you can
 generate both primary/alternate assemblies and trio binning assemblies with
 ```sh
 hifiasm -o NA12878.asm -t 32 NA12878.fq.gz 2> NA12878.asm.pri.log
 hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak /dev/null 2> NA12878.asm.trio.log
 ```
-The second command line will run much faster than the first. You can also dump
-error corrected in FASTA and/or overlaps in PAF with
-```sh
-hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null
-```
+The second command line will run much faster than the first.
 
-## Output files
+### <a name="output"></a>Output files
 
 For non-trio assembly, hifiasm generates the following files:
 
@@ -126,9 +171,9 @@ For trio assembly, hifiasm generates the following files:
 Hifiasm writes error corrected reads to the *prefix*.ec.bin binary file and
 writes overlaps to *prefix*.ovlp.source.bin and *prefix*.ovlp.reverse.bin.
 
-## Results
+## <a name="results"></a>Results
 
-The following table shows the statistics of several hifiasm primary assemblies:
+The following table shows the statistics of several hifiasm primary assemblies assembled with v0.12:
 
 |<sub>Dataset<sub>|<sub>Size<sub>|<sub>Cov.<sub>|<sub>Asm options<sub>|<sub>CPU time<sub>|<sub>Wall time<sub>|<sub>RAM<sub>|<sub> N50<sub>|
 |:---------------|-----:|-----:|:---------------------|-------:|--------:|----:|----------------:|
@@ -155,7 +200,10 @@ redwood genome in a few days on a single machine. For trio binning assembly:
 |:---------------|-----:|-------:|--------:|----:|----------------:|
 |<sub>[HG00733][HG00733-data], [\[father\]][HG00731-data], [\[mother\]][HG00732-data]</sub>|<sub>&times;33</sub>|<sub>269.1h</sub>|<sub>6.9h</sub>|<sub>135G</sub>|<sub>35.1Mb (paternal), 34.9Mb (maternal)</sub>|
 |<sub>[HG002][NA24385-data],   [\[father\]][NA24149-data], [\[mother\]][NA24143-data]</sup>|<sub>&times;36</sub>|<sub>305.4h</sub>|<sub>7.7h</sub>|<sub>137G</sub>|<sub>41.0Mb (paternal), 40.8Mb (maternal)</sub>|
+
+<!--
 |<sub>[NA12878][NA12878-data], [\[father\]][NA12891-data], [\[mother\]][NA12892-data]</sub>|<sub>&times;30</sub>|<sub>180.8h</sub>|<sub>4.9h</sub>|<sub>123G</sub>|<sub>27.7Mb (paternal), 27.0Mb (maternal)</sub>|
+-->
 
 [HG00733-data]: https://www.ebi.ac.uk/ena/data/view/ERX3831682
 [HG00731-data]: https://www.ebi.ac.uk/ena/data/view/ERR3241754
@@ -167,33 +215,32 @@ redwood genome in a few days on a single machine. For trio binning assembly:
 [NA12891-data]: https://www.ebi.ac.uk/ena/data/view/ERR194160
 [NA12892-data]: https://www.ebi.ac.uk/ena/data/view/ERR194161
 
-Except NA12878, the assemblies above were produced by hifiasm v0.12 and can be
-downloaded at
-```txt
-ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/submission/hifiasm-0.12/
-```
-NA12878 was assembled with an older version of hifiasm and is available at
-```txt
-ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/NA12878-r253/
-```
-
+Human assemblies above can be acquired [from Zenodo][zenodo-human] and
+non-human ones are available [here][zenodo-nonh].
 
+[zenodo-human]: https://zenodo.org/record/4393631
+[zenodo-nonh]: https://zenodo.org/record/4393750
 [unitig]: http://wgs-assembler.sourceforge.net/wiki/index.php/Celera_Assembler_Terminology
 [gfa]: https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md
 [paf]: https://github.com/lh3/miniasm/blob/master/PAF.md
 [yak]: https://github.com/lh3/yak
 
-## Getting Help
+## <a name="help"></a>Getting Help
 
 For detailed description of options, please see `man ./hifiasm.1`. The `-h`
 option of hifiasm also provides brief description of options. If you have
 further questions, please raise an issue at the [issue
 page](https://github.com/chhylp123/hifiasm/issues).
 
-## Limitations
+## <a name="limit"></a>Limitations
 
 1. Purging haplotig duplications may introduce misassemblies.
 
-## Citation
+## <a name="cite"></a>Citating Hifiasm
+
+If you use hifiasm in your work, please cite:
 
-Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5
+> Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. (2021)
+> Haplotype-resolved de novo assembly using phased assembly graphs with
+> hifiasm. *Nat Methods*, **18**:170-175.
+> https://doi.org/10.1038/s41592-020-01056-5