Stats docs (#171)

* stats docs * np.int problem fix --------- Co-authored-by: Phlya <flyamer@gmail.com>
open2c · Feb 4, 2023 · 0c94f7b · 0c94f7b
1 parent 4a6a76e
commit 0c94f7b
Show file tree

Hide file tree

Showing 3 changed files with 126 additions and 2 deletions.
diff --git a/doc/index.rst b/doc/index.rst
@@ -70,7 +70,8 @@ Contents:
    installation
    parsing
    sorting
-   formats 
+   formats
+   stats
    technotes
    cli_tools
 

diff --git a/doc/stats.rst b/doc/stats.rst
@@ -0,0 +1,123 @@
+`Pairtools stats` as a source of quality control metrics
+===========================================
+
+Overview
+--------
+
+`pairtools stats` produces a human-readable nested dictionary of statistics stored in
+a YAML file or a tab-separated text table (specified through the parameters).
+
+When calculating statistics, any number of filters can be applied to generate separate
+statistics for different categories of pairs, for example they can be filtered by the
+read mapping quality (mapq values). These are then stored as separate sections of the
+output file.
+
+- **Global statistics** include:
+    - number of pairs (total, unmapped, single-side mapped, etc.),
+    - total number of different pair types (UU, NN, NU, and others, see ` Pair types in pairtools docs <https://pairtools.readthedocs.io/en/latest/formats.html#pair-types>`_),
+    - number of contacts between all chromosome pairs
+
+- **Summary statistics** include:
+    - fraction of duplicates
+    - fraction of cis interactions (at different minimal distance cutoffs) out of total
+    - estimation of library complexity
+
+Summary statistics can inform you about the quality of the data.
+For example, more trans interactions can be a sign of problems with the 3C+ procedure and lower signal-to-noise ratio.
+Substantial mapping to mitochondrial chromosome (chrM) might be a sign of random ligation.
+
+- **P(s), or scaling.**  The dependence of contact frequency on the genomic
+distance referred to as the P(s) curve or scaling, which is a rich source of both biologically relevant information and technical quality of 3C+ experiments.
+The shape of P(s) is often used to characterize mechanisms of genome folding and reveal issues with QC.
+
+Interactive visualization of stats with MultiQC
+---------
+
+Install `multiqc`:
+
+.. code-block:: bash
+
+    pip install --upgrade --force-reinstall git+https://github.com/open2c/MultiQC.git
+
+Note that (for now) the pairtools module for MultiQC is only available in the open2C fork and not in the main MultiQC repository.
+
+Run MultiQC in a folder with one or multiple .stats files:
+
+.. code-block:: bash
+
+    multiqc .
+
+
+This will produce a nice .html file with interactive graphical summaries of the stats.
+
+
+Estimating library complexity
+----------------------------
+
+Pairtools assumes that each sequencing read is randomly chosen with
+replacement from a finite pool of fragments in DNA library [1]_ [2]_.
+With each new sequenced molecule, the expected number of observed unique molecules
+increases according to a simple equation:
+
+$$ U(N+1) = U(N) + (1 - {U(N) \\over C}), $$
+
+where $N$ is the number of sequenced molecules, $U(N)$ is the expected number
+of observed unique molecules after sequencing $N$ molecules, and C is the library complexity.
+This differential equation yields [1, 2]:
+
+$$ {U(N) \\over C} = 1 - exp( - {N \\over C}), $$
+
+which can be solved as
+
+$$ C = \Re(lambert W( - { \exp( - {1 \\over u} ) \\over u} ) ) + {1 \\over u} $$
+
+Library complexity can guide in the choice of sequencing depth of the library
+and provide an estimate of library quality.
+
+
+Illumina sequencing duplicates
+-----------------
+
+Importantly, you can estimate the complexity of Hi-C libraries using only small QC
+samples to decide if their quality permits deeper sequencing [3]_.
+These estimates, however, can be significantly biased by the presence of “optical” or
+“clustering” duplicates. Such duplicates occur as artefacts of the sequencing procedure.
+Optical duplicates appear in data generated on sequencers with non-patterned flowcells in
+cases the instrument either erroneously splits a signal from a single sequenced molecule
+into two. On the other hand, clustering duplicates appear on patterned flowcells, when
+during cluster generation a cluster occupies adjacent nanowells. [4]_.
+
+The rate of optical and clustering duplication depends on the technology and the operating
+conditions (e.g. molarity of the library loaded onto the flowcell), but not on the
+library complexity or sequencing depth. Thus, in small sequencing samples in particular
+the clustering duplication on recent Illumina instruments can severely inflate the
+observed levels of duplication [5]_, resulting in underestimation of the library complexity.
+
+While the frequency of PCR duplicates increases with sequencing depth,
+optical or clustering duplication levels may stay constant for a particular sequencer,
+provided the library is loaded at the same molarity. This means that the high frequency of
+clustering duplicates on the NovaSeq leads to severe underestimation of library complexity
+in the pilot runs. In particular, the recent models of Illumina sequencers with patterned
+flowcells (such as NovaSeq) suffer from increased clustering duplication rate, which may
+far exceed the level of PCR duplication.
+
+Luckily, optical and clustering duplicates can be distinguished from the PCR ones,
+as the former are located next to each other on the sequencing flow cell.
+In case of Illumina sequencers, pairtools dedup can infer the positions of sequencing
+reads from their IDs and focuses on geometrically distant duplicates to produce unbiased
+estimates of PCR duplication and library complexity.  Although SRA does not store original
+read IDs from the sequencer, this analysis is possible when pairtools is run on a dataset
+with original Illumina-generated read IDs.
+Note that in our experience even when accounting for optical/clustering duplicates, the
+complexity can be greatly underestimated, but is still a useful measurement to choose the
+most complex libraries.
+
+
+.. [1] Picard. http://broadinstitute.github.io/picard/
+
+.. [2] Thread: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? https://sourceforge.net/p/samtools/mailman/samtools-help/thread/DUB405-EAS154589A1ACEF2BE4C573D4592180@phx.gbl/
+
+.. [3] Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
+
+.. [4] Duplicates on Illumina. BioStars. https://www.biostars.org/p/229842/
+.. [5] Illumina Patterned Flow Cells Generate Duplicated Sequences. https://sequencing.qcfail.com/articles/illumina-patterned-flow-cells-generate-duplicated-sequences/
diff --git a/pairtools/lib/stats.py b/pairtools/lib/stats.py
@@ -58,7 +58,7 @@ def __init__(
                 ** np.arange(
                     min_log10_dist, max_log10_dist + 0.001, log10_dist_bin_step
                 )
-            ).astype(np.int),
+            ).astype(np.int_),
         ]
 
         # establish structure of an empty _stat:
-Original file line number
+Diff line change
@@ Expand Up / @@ -70,7 +70,8 @@ Contents: @@
        installation
        parsing
        sorting
-       formats
+       formats
+       stats
        technotes
        cli_tools
@@ Expand Down @@