From d9c47e21dbb0c55186bbc3ac159970b0632d6865 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Sat, 17 Apr 2021 16:01:32 -0400 Subject: [PATCH 1/3] updated README --- README.md | 122 +++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 83 insertions(+), 39 deletions(-) diff --git a/README.md b/README.md index 5dd7a5a..507e256 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -## Getting Started +## Getting Started ```sh # Install hifiasm (requiring g++ and zlib) @@ -19,23 +19,40 @@ hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz) yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz) hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz + +# Hi-C phasing with paired-end short reads in two FASTQ files +hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz ``` -## Introduction +## Table of Contents + +- [Getting Started](#started) +- [Introduction](#intro) +- [Why Hifiasm?](#why) +- [Usage](#use) + - [Assembling HiFi reads without additional data types](#hifionly) + - [Hi-C integration](#hic) + - [Trio binning](#trio) + - [Output files](#output) +- [Results](#results) +- [Getting Help](#help) +- [Limitations](#limit) +- [Citing Hifiasm](#cite) + +## Introduction -Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. -It can assemble a human genome in several hours and works with the California -redwood genome, one of the most complex genomes sequenced so far. Hifiasm can -produce primary/alternate assemblies of quality competitive with the best -assemblers. It also introduces a new graph binning algorithm and achieves -the best haplotype-resolved assembly given trio data. +Hifiasm is a fast haplotype-resolved de novo assembler for PacBio HiFi reads. +It can assemble a human genome in several hours and assemble a ~30Gb California +redwood genome in a few days. Hifiasm emits partially phased assemblies of +quality competitive with the best assemblers. Given parental short reads or +Hi-C data, it produces arguably the best haplotype-resolved assemblies so far. -## Why Hifiasm? +## Why Hifiasm? * Hifiasm delivers high-quality assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers. -* Given sequence reads from the parents, hifiasm can produce overall the best +* Given Hi-C reads or short reads from the parents, hifiasm can produce overall the best haplotype-resolved assembly so far. It is the assembler of choice by the [Human Pangenome Project][hpp] for the first batch of samples. @@ -47,13 +64,15 @@ the best haplotype-resolved assembly given trio data. * Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm. -* Hifiasm is trivial to install and easy to use. It does not required python, - R or C++11 compilers and can be compiled into a single executable. The +* Hifiasm is trivial to install and easy to use. It does not required Python, + R or C++11 compilers, and can be compiled into a single executable. The default setting works well with a variety of genomes. [hpp]: https://humanpangenome.org -## Usage +## Usage + +### Assembling HiFi reads without additional data types A typical hifiasm command line looks like: ```sh @@ -61,11 +80,21 @@ hifiasm -o NA12878.asm -t 32 NA12878.fq.gz ``` where `NA12878.fq.gz` provides the input reads, `-t` sets the number of CPUs in use and `-o` specifies the prefix of output files. For this example, the -primary contigs are written to `NA12878.asm.p_ctg.gfa` and alternate contigs to -`NA12878.asm.a_ctg.gfa`. At the first run, hifiasm saves corrected reads and +primary contigs are written to `NA12878.asm.bp.p_ctg.gfa` and alternate contigs to +`NA12878.asm.bp.a_ctg.gfa`. Since v0.15, hifiasm also produces two sets of +partially phased contigs at `NA12878.asm.bp.hap?.p_ctg.gfa`. This pair of files +can be thought to represent the two haplotypes in a diploid genome, though with +occasional switch errors. The frequency of switches is determined by the +heterozygosity of the input sample. + +At the first run, hifiasm saves corrected reads and overlaps to disk as `NA12878.asm.*.bin`. It reuses the saved results to avoid the time-consuming all-vs-all overlap calculation next time. You may specify `-i` to ignore precomputed overlaps and redo overlapping from raw reads. +You can also dump error corrected in FASTA and/or overlaps in PAF with +```sh +hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null +``` Hifiasm purges haplotig duplications by default. For inbred or homozygous genomes, you may disable purging with option `-l0`. Old HiFi reads may contain @@ -75,7 +104,24 @@ bloom filter which takes 16GB memory at the beginning. For genomes much larger than human, applying `-f38` or even `-f39` is preferred to save memory on k-mer counting. -When parental short reads are available, hifiasm can generate a pair of +### Hi-C integration + +Hifiasm can generate a pair of haplotype-resolved assemblies with paired-end +Hi-C reads: +```sh +hifiasm -o NA12878.asm -t32 --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz +``` +In this mode, each contig is supposed to be a haplotig, which by definition +comes from one parental haplotype only. Hifiasm often puts all contigs from the +same parental chromosome in one assembly. It has cleanly separated chrX and +chrY for a human male dataset. Nonetheless, phasing across centromeres is +challenging. Users should not expect hifiasm to phase entire chromosomes at the +moment. Also, contigs from different parental chromosomes are randomly mixed as +it is just not possible to phase across chromosomes with Hi-C. + +### Trio binning + +When parental short reads are available, hifiasm can also generate a pair of haplotype-resolved assemblies with trio binning. To perform such assembly, you need to count k-mers first with [yak][yak] first and then do assembly: ```sh @@ -85,19 +131,15 @@ hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878.fq.gz ``` Here `NA12878.asm.hap1.p_ctg.gfa` and `NA12878.asm.hap2.p_ctg.gfa` give the two haplotype assemblies. In the binning mode, hifiasm does not purge haplotig -duplications by default. Because hifiasm reuses saved overlaps, you can +duplicates by default. Because hifiasm reuses saved overlaps, you can generate both primary/alternate assemblies and trio binning assemblies with ```sh hifiasm -o NA12878.asm -t 32 NA12878.fq.gz 2> NA12878.asm.pri.log hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak /dev/null 2> NA12878.asm.trio.log ``` -The second command line will run much faster than the first. You can also dump -error corrected in FASTA and/or overlaps in PAF with -```sh -hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null -``` +The second command line will run much faster than the first. -## Output files +### Output files For non-trio assembly, hifiasm generates the following files: @@ -126,9 +168,9 @@ For trio assembly, hifiasm generates the following files: Hifiasm writes error corrected reads to the *prefix*.ec.bin binary file and writes overlaps to *prefix*.ovlp.source.bin and *prefix*.ovlp.reverse.bin. -## Results +## Results -The following table shows the statistics of several hifiasm primary assemblies: +The following table shows the statistics of several hifiasm primary assemblies assembled with v0.12: |Dataset|Size|Cov.|Asm options|CPU time|Wall time|RAM| N50| |:---------------|-----:|-----:|:---------------------|-------:|--------:|----:|----------------:| @@ -155,7 +197,10 @@ redwood genome in a few days on a single machine. For trio binning assembly: |:---------------|-----:|-------:|--------:|----:|----------------:| |[HG00733][HG00733-data], [\[father\]][HG00731-data], [\[mother\]][HG00732-data]|×33|269.1h|6.9h|135G|35.1Mb (paternal), 34.9Mb (maternal)| |[HG002][NA24385-data], [\[father\]][NA24149-data], [\[mother\]][NA24143-data]|×36|305.4h|7.7h|137G|41.0Mb (paternal), 40.8Mb (maternal)| + + [HG00733-data]: https://www.ebi.ac.uk/ena/data/view/ERX3831682 [HG00731-data]: https://www.ebi.ac.uk/ena/data/view/ERR3241754 @@ -167,33 +212,32 @@ redwood genome in a few days on a single machine. For trio binning assembly: [NA12891-data]: https://www.ebi.ac.uk/ena/data/view/ERR194160 [NA12892-data]: https://www.ebi.ac.uk/ena/data/view/ERR194161 -Except NA12878, the assemblies above were produced by hifiasm v0.12 and can be -downloaded at -```txt -ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/submission/hifiasm-0.12/ -``` -NA12878 was assembled with an older version of hifiasm and is available at -```txt -ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/NA12878-r253/ -``` - +Human assemblies above can be acquired [from Zenodo][zenodo-human] and +non-human ones are available [here][zenodo-nonh]. +[zenodo-human]: https://zenodo.org/record/4393631 +[zenodo-nonh]: https://zenodo.org/record/4393750 [unitig]: http://wgs-assembler.sourceforge.net/wiki/index.php/Celera_Assembler_Terminology [gfa]: https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md [paf]: https://github.com/lh3/miniasm/blob/master/PAF.md [yak]: https://github.com/lh3/yak -## Getting Help +## Getting Help For detailed description of options, please see `man ./hifiasm.1`. The `-h` option of hifiasm also provides brief description of options. If you have further questions, please raise an issue at the [issue page](https://github.com/chhylp123/hifiasm/issues). -## Limitations +## Limitations 1. Purging haplotig duplications may introduce misassemblies. -## Citation +## Citating Hifiasm + +If you use hifiasm in your work, please cite: -Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5 +> Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. (2021) +> Haplotype-resolved de novo assembly using phased assembly graphs with +> hifiasm. *Nat Methods*, **18**:170-175. +> https://doi.org/10.1038/s41592-020-01056-5 From a0e4cbf80a376d70c5bb58993c0dcca853454277 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Sat, 17 Apr 2021 16:07:34 -0400 Subject: [PATCH 2/3] minor changes --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 507e256..4702cfa 100644 --- a/README.md +++ b/README.md @@ -12,16 +12,16 @@ awk '/^S/{print ">"$2;print $3}' test.p_ctg.gfa > test.p_ctg.fa # get primary c # Assemble inbred/homozygous genomes (-l0 disables duplication purging) hifiasm -o CHM13.asm -t32 -l0 CHM13-HiFi.fa.gz 2> CHM13.asm.log -# Assemble heterozygous with built-in duplication purging +# Assemble heterozygous genomes with built-in duplication purging hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz +# Hi-C phasing with paired-end short reads in two FASTQ files +hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz + # Trio binning assembly (requiring https://github.com/lh3/yak) yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz) yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz) hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz - -# Hi-C phasing with paired-end short reads in two FASTQ files -hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz ``` ## Table of Contents @@ -91,7 +91,7 @@ At the first run, hifiasm saves corrected reads and overlaps to disk as `NA12878.asm.*.bin`. It reuses the saved results to avoid the time-consuming all-vs-all overlap calculation next time. You may specify `-i` to ignore precomputed overlaps and redo overlapping from raw reads. -You can also dump error corrected in FASTA and/or overlaps in PAF with +You can also dump error corrected reads in FASTA and read overlaps in PAF with ```sh hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null ``` From dfd7720f5a3e109ce556cdd17c02c023d91d8db0 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Sat, 17 Apr 2021 16:28:12 -0400 Subject: [PATCH 3/3] clarify that hifiasm doesn't do scaffolding --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 4702cfa..d7318c1 100644 --- a/README.md +++ b/README.md @@ -119,6 +119,9 @@ challenging. Users should not expect hifiasm to phase entire chromosomes at the moment. Also, contigs from different parental chromosomes are randomly mixed as it is just not possible to phase across chromosomes with Hi-C. +Hifiasm does not perform scaffolding for now. You need to run a standalone +scaffolder such as SALSA or 3D-DNA to scaffold phased haplotigs. + ### Trio binning When parental short reads are available, hifiasm can also generate a pair of