Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: cleanup and improve descriptions #126

Merged
merged 18 commits into from
Nov 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
### Note: This repository is still under-construction!
Please wait until we have published our first tagged release before using our code.

# haptools

Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype information. It is particularly designed for analysis of individuals with admixed ancestries, although the tools can also be used for non-admixed individuals.
Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype information. Haptools supports fast simulation of admixed genomes (with `simgenotype`), visualization of admixture tracks (with `karyogram`), simulating haplotype- and local ancestry-specific phenotype effects (with `transform` and `simphenotype`), and computing a variety of common file operations and statistics in a haplotype-aware manner.

Homepage: [https://haptools.readthedocs.io/](https://haptools.readthedocs.io/)

Visit our homepage for installation and usage instructions.

![haptools commands](https://drive.google.com/uc?id=1c0i_Hjms7579s24zRsKp5yMs7BxNHed_)
2 changes: 1 addition & 1 deletion docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ The :class:`GenotypesPLINK` class offers experimental support for reading and wr

.. code-block:: bash

pip install git+https://github.com/cast-genomics/haptools.git#egg=haptools[files]
pip install haptools[files]

The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesRefAlt` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/karyogram.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
karyogram
=========

Takes as input a breakpoints file (e.g. as output by ``simgenotype``) and a sample name, and plots a karyogram depicting local ancestry tracks.
Takes as input a :doc:`breakpoints file </formats/breakpoints>` (e.g. as output by :doc:`simgenotype </commands/simgenotype>`) and a sample name, and plots a karyogram depicting local ancestry tracks.

Basic Usage
~~~~~~~~~~~
Expand Down
19 changes: 12 additions & 7 deletions docs/commands/simphenotype.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,10 @@
simphenotype
============

Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal variables to use within the simulation by specifying them in a :doc:`.hap file </formats/haplotypes>`. Phenotypes are simulated from genotypes output by the :doc:`transform command </commands/transform>`.

To encode simple SNPs as causal variants within a ``.hap`` file, use the haptools API like in :ref:`this example <api-examples-snps2hap>`.
Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal haplotypes or variants by specifying them in a :doc:`.hap file </formats/haplotypes>`. Phenotypes are simulated from genotypes output by the :doc:`transform command </commands/transform>`.

The implementation is based on the `GCTA GWAS Simulation <https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation>`_ utility.

.. note::
Your ``.hap`` files must contain a "beta" extra field. See :ref:`this section <formats-haplotypes-extrafields-simphenotype>` of the ``.hap`` format spec for more details.

Usage
~~~~~
.. code-block:: bash
Expand Down Expand Up @@ -57,11 +52,21 @@ The heritability :math:`h^2` is user-specified, but if it is not provided, then

If a prevalence for the disease is specified, the final :math:`\vec{y}` value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals.

Input
~~~~~
Genotypes must be specified in VCF and haplotypes must be specified in the :doc:`.hap file format </formats/haplotypes>`. If you'd like to encode simple SNPs as causal variants within a ``.hap`` file, use the haptools API like in :ref:`this example <api-examples-snps2hap>`.

.. note::
Your ``.hap`` files must contain a "beta" extra field. See :ref:`this section <formats-haplotypes-extrafields-simphenotype>` of the ``.hap`` format spec for more details.

Alternatively, you may also specify genotypes in PLINK2 PGEN format. Just use the appropriate ".pgen" file extension in the input. See the documentation for genotypes in :ref:`the format docs <formats-genotypesplink>` for more information.

Output
~~~~~~
Phenotypes are output in the PLINK2-style ``.pheno`` file format. If ``--replications`` was set to greater than 1, additional columns are output for each simulated trait.

Note that case/control phenotypes are encoded as 0 (control) + 1 (case) **not** 1 (control) + 2 (case). The latter is used by PLINK2 unless the ``--1`` flag is used (see `the PLINK2 docs <https://www.cog-genomics.org/plink/2.0/input#input_missing_phenotype>`_). Therefore, you must use ``--1`` when providing our ``.pheno`` files to PLINK.
.. note::
Case/control phenotypes are encoded as 0 (control) + 1 (case) **not** 1 (control) + 2 (case). The latter is assumed by PLINK2 unless the ``--1`` flag is used (see `the PLINK2 docs <https://www.cog-genomics.org/plink/2.0/input#input_missing_phenotype>`_). Therefore, you must use ``--1`` when providing our ``.pheno`` files to PLINK.

Examples
~~~~~~~~
Expand Down
37 changes: 31 additions & 6 deletions docs/commands/transform.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ transform

Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

The ``transform`` command takes as input a set of genotypes in VCF and a list of haplotypes (specified as a :doc:`.hap file </formats/haplotypes>` without any extra fields) and outputs a set of haplotype "genotypes" in VCF.
The ``transform`` command takes as input a set of genotypes and a list of haplotypes and outputs a set of haplotype *pseudo-genotypes*, where each haplotype is encoded as a bi-allelic variant record in the output. In other words, each sample will have a genotype of ``0|0``, ``1|0``, ``0|1``, or ``1|1`` indicating whether each of their two chromosome copies contains the alleles of a haplotype.

You may also specify genotypes in PLINK2 PGEN format. Just use the appropriate ".pgen" file extension in the input and/or output. See the documentation for genotypes in :ref:`the format docs <formats-genotypesplink>` for more information.
.. figure:: https://drive.google.com/uc?id=1GyluoQ3IeGXo9FjWsCC3XwaRuflr68pn
:figwidth: 600
:align: center
:alt: Transforming genotypes via haplotypes

Ancestry
~~~~~~~~

If your ``.hap`` file contains an "ancestry" extra field and your VCF contains a "POP" format field or an accompanying :ref:`.bp file <formats-breakpoints>` (as output by ``simgenotype``), you should specify the ``--ancestry`` flag. This will enable us to match the population labels of each haplotype against those in the genotypes output by ``simgenotype``. See :ref:`this section <formats-haplotypes-extrafields-transform>` of the ``.hap`` format spec for more details.
Users may also specify an ancestral population label for each haplotype. See the :ref:`ancestry section <commands-transform-input-ancestry>` for more details.

Usage
~~~~~
Expand All @@ -32,6 +32,31 @@ Usage
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES

Input
~~~~~
Genotypes must be specified in VCF and haplotypes must be specified in the :doc:`.hap file format </formats/haplotypes>`.

Alternatively, you may specify genotypes in PLINK2 PGEN format. Just use the appropriate ".pgen" file extension in the input. See the documentation for genotypes in :ref:`the format docs <formats-genotypesplink>` for more information.

.. _commands-transform-input-ancestry:

Ancestry
--------
If your ``.hap`` file contains :ref:`an "ancestry" extra field <formats-haplotypes-extrafields-transform>` and your VCF contains a "POP" format field (as output by :doc:`simgenotype </commands/simgenotype>`), you should specify the ``--ancestry`` flag.
This will enable us to match the population labels of each haplotype against those in the genotypes output by :doc:`simgenotype </commands/simgenotype>`.
In other words, a sample is said to contain a haplotype only if all of the alleles of the haplotype are labeled with the haplotype's ancestry.

.. figure:: https://drive.google.com/uc?id=1uQ08d6X0vdbyLOXDN9evdjPlnqdpI_3k
:figwidth: 600
:align: center
:alt: Transforming via ancestry labels

Alternatively, you may specify a :doc:`breakpoints file </formats/breakpoints>` accompanying the genotypes file. It must have the same name as the genotypes file but with a ``.bp`` file ending. If such a file exists, ``transform`` will ignore any "POP" format fields in the genotypes file and instead obtain the ancestry labels from the breakpoints file. This is primarily a speed enhancement, since it's faster to load ancestral labels from the breakpoints file.

Output
~~~~~~
Transform outputs *psuedo-genotypes* in VCF, but you may request genotypes in PLINK2 PGEN format, instead. Just use the appropriate ".pgen" file extension in the output path. See the documentation for genotypes in :ref:`the format docs <formats-genotypesplink>` for more information.

Examples
~~~~~~~~
.. code-block:: bash
Expand Down
2 changes: 1 addition & 1 deletion docs/formats/genotypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ If you run out memory when using PGEN files, consider reading variants from the

.. code-block:: bash

pip install git+https://github.com/cast-genomics/haptools.git#egg=haptools[files]
pip install haptools[files]
5 changes: 5 additions & 0 deletions docs/formats/haplotypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ Haplotypes

This document describes our custom file format specification for haplotypes: the ``.hap`` file.

.. figure:: https://drive.google.com/uc?id=16GsMOCihu7qvE27iRz6ydSTtt76ZeaAY
:figwidth: 600
:align: center
:alt: The .hap file format

This is a tab-separated file composed of different types of lines. The first field of each line is a single, uppercase character denoting the type of line. The following line types are supported.

.. list-table::
Expand Down
60 changes: 24 additions & 36 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,44 +3,31 @@
haptools
========

Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype information. It is particularly designed for analysis of individuals with admixed ancestries, although the tools can also be used for non-admixed individuals.
Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype and ancestry information.

Installation
~~~~~~~~~~~~
.. note::
To reduce the likelihood of errors, we recommend installing ``haptools`` within a new conda environment using a recent version of pip:

.. code-block:: bash

conda create -y -n haptools -c conda-forge 'pip>=22.2.2'
conda activate haptools

We have not officially published ``haptools`` yet, but in the meantime, you can install it directly from our Github repository.

.. code-block:: bash
We support fast simulation of admixed genomes, visualization of admixture tracks, simulating haplotype- and local ancestry-specific phenotype effects, and computing a variety of common file operations and statistics in a haplotype-aware manner.

pip install git+https://github.com/cast-genomics/haptools.git
At the core of haptools lies the :doc:`.hap file </formats/haplotypes>`, our new file format for haplotypes designed for speed, extensibility, and ease-of-use.

Installing ``haptools`` with the "files" extra requirements enables automatic support for a variety of additional file formats, like PLINK2 PGEN files.
Commands
~~~~~~~~

.. code-block:: bash
* :doc:`haptools simgenotype </commands/simgenotype>`: Simulate genotypes for admixed individuals under user-specified demographic histories.

pip install git+https://github.com/cast-genomics/haptools.git#egg=haptools[files]
* :doc:`haptools simphenotype </commands/simphenotype>`: Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. ``haptools simphenotype`` takes as input a VCF file (usually from ``haptools transform``) and outputs simulated phenotypes for each sample.

Summary of Commands
~~~~~~~~~~~~~~~~~~~
* :doc:`haptools karyogram </commands/karyogram>`: Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by ``haptools simgenotype``.

``haptools`` consists of multiple utilities listed below. Click on a utility to see more detailed usage information.
* :doc:`haptools transform </commands/transform>`: Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

* `haptools simgenotype </commands/simgenotype>`_: Simulate genotypes for admixed individuals under user-specified demographic histories.
* :doc:`haptools index </commands/index>`: Sort, compress, and index our custom file format for haplotypes.

* `haptools simphenotype </commands/simphenotype>`_: Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. ``haptools simphenotype`` takes as input a VCF file (usually from ``haptools transform``) and outputs simulated phenotypes for each sample.
* :doc:`haptools ld </commands/ld>`: Compute Pearson's correlation coefficient between a target haplotype and a set of haplotypes.

* `haptools karyogram </commands/karyogram>`_: Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by ``haptools simgenotype``.

* `haptools transform </commands/transform>`_: Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

* `haptools ld </commands/ld>`_: Compute Pearson's correlation coefficient between a target haplotype and a set of haplotypes.
.. figure:: https://drive.google.com/uc?id=1c0i_Hjms7579s24zRsKp5yMs7BxNHed_
:figwidth: 600
:align: center
:alt: Overview of haptools commands

Outputs produced by these utilities are compatible with each other.
For example ``haptools simgenotype`` outputs a VCF file with local ancestry information annotated for each variant.
Expand All @@ -55,6 +42,15 @@ We gladly welcome any contributions to ``haptools``!
Please read our :doc:`contribution guidelines </project_info/contributing>` and then submit a `Github issue <https://github.com/cast-genomics/haptools/issues>`_.


.. toctree::
:caption: Overview
:name: overview
:hidden:
:maxdepth: 1

project_info/installation
project_info/contributing

.. toctree::
:caption: File Formats
:name: formats
Expand Down Expand Up @@ -91,11 +87,3 @@ Please read our :doc:`contribution guidelines </project_info/contributing>` and
api/data
api/modules
api/examples

.. toctree::
:caption: Project Info
:name: project-info
:hidden:
:maxdepth: 1

project_info/contributing
23 changes: 23 additions & 0 deletions docs/project_info/installation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
.. _project_info-installation:

============
Installation
============

You can install ``haptools`` from PyPI.

.. code-block:: bash

pip install haptools

Installing ``haptools`` with the "files" extra requirements enables automatic support for a variety of additional file formats, like PLINK2 PGEN files.

.. code-block:: bash

pip install haptools[files]

We also support installing ``haptools`` from Bioconda.

.. code-block:: bash

conda install -c conda-forge -c bioconda haptools
25 changes: 21 additions & 4 deletions haptools/data/breakpoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,12 @@ def population_array(
dtype = HapBlock[0][1] if self.labels is None else np.uint8
arr = np.empty((len(data), len(variants), 2), dtype=dtype)
# iterate through the variants belonging to each chromosome
for chrom in set(variants["chrom"]):
gts_chroms = set(variants["chrom"])
self.log.info(
f"Obtaining ancestry for {len(data)} samples and {len(variants)} "
f"variants in {len(gts_chroms)} chromosomes"
)
for chrom in gts_chroms:
var_idxs = variants["chrom"] == chrom
positions = variants["pos"][var_idxs]
# obtain the population labels of each sample
Expand All @@ -297,9 +302,21 @@ def population_array(
# aren't sorted
# Now try to figure out the right population labels using binary
# search and then store them in the result matrix
arr[samp_idx, var_idxs, strand_num] = chrom_block["pop"][
self._find_blocks(chrom_block["bp"], positions)
]
try:
arr[samp_idx, var_idxs, strand_num] = chrom_block["pop"][
self._find_blocks(chrom_block["bp"], positions)
]
except ValueError as e:
diff = gts_chroms.difference(blocks["chrom"])
if str(e).startswith("Position ") and len(diff):
samp_id = tuple(data.keys())[samp_idx]
raise ValueError(
f"Chromosomes {diff} in the genotypes are absent in "
f"the breakpoints for sample {samp_id}_{strand_num+1}."
" Check that your 'chr' prefixes match!"
)
else:
raise e
return arr

def write(self):
Expand Down
5 changes: 2 additions & 3 deletions haptools/data/genotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -710,11 +710,10 @@ def __init__(self, fname: Path | str, log: Logger = None, chunk_size: int = None
global pgenlib
import pgenlib
except ImportError:
url = "https://github.com/cast-genomics/haptools.git##egg=haptools[files]"
raise ImportError(
"We cannot read PGEN files without the pgenlib library. Please "
"reinstall haptools with the 'files' extra requirements via\n"
f"pip install 'git+{url}'"
"reinstall haptools with the 'files' extra requirement via\n"
f"pip install haptools[files]"
)

def read_samples(self, samples: list[str] = None):
Expand Down
41 changes: 41 additions & 0 deletions haptools/data/haplotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1131,3 +1131,44 @@ def sort(self):
self.data = dict(sorted(self.data.items(), key=lambda item: item[1]))
for hap in self.data.values():
hap.sort()

def subset(self, haplotypes: tuple[str], inplace: bool = False):
"""
Subset these haplotypes to a smaller set of haplotypes

The order of the haplotypes in the subsetted instance will match the order in
the provided tuple parameters.

Parameters
----------
haplotypes: tuple[str]
A subset of haplotype IDs to keep
inplace: bool, optional
If False, return a new Genotypes object; otherwise, alter the current one

Returns
-------
A new Haplotypes object if inplace is set to False, else returns None
"""
hps = self
if not inplace:
hps = self.__class__(self.fname, self.log)
hps.data = self.data
hps.types = self.types
hps.version = self.version
# Subset the haplotypes
data = {}
missing = set()
for hap_id in haplotypes:
try:
data[hap_id] = hps.data[hap_id]
except KeyError:
missing.add(hap_id)
if len(missing):
self.log.warning(
f"Saw {len(missing)} fewer haplotypes than requested. Proceeding with "
f"{len(hps.data)} haplotypes."
)
hps.data = data
if not inplace:
return hps
Loading