Bioframe enables flexible and scalable operations on genomic interval dataframes in Python.
Bioframe is built directly on top of Pandas. Bioframe provides:
- A variety of genomic interval operations that work directly on dataframes.
- Operations for special classes of genomic intervals, including chromosome arms and fixed-size bins.
- Conveniences for diverse tabular genomic data formats and loading genome assembly summary information.
Read the docs, including the guide, as well as the bioframe preprint for more information.
Bioframe is an Affiliated Project of NumFOCUS.
Bioframe is available on PyPI and bioconda:
pip install bioframe
Interested in contributing to bioframe? That's great! To get started, check out the contributing guide. Discussions about the project roadmap take place on the Open2C Slack and regular developer meetings scheduled there. Anyone can join and participate!
Key genomic interval operations in bioframe include:
overlap
: Find pairs of overlapping genomic intervals between two dataframes.closest
: For every interval in a dataframe, find the closest intervals in a second dataframe.cluster
: Group overlapping intervals in a dataframe into clusters.complement
: Find genomic intervals that are not covered by any interval from a dataframe.
Bioframe additionally has functions that are frequently used for genomic interval operations and can be expressed as combinations of these core operations and dataframe operations, including: coverage
, expand
, merge
, select
, and subtract
.
To overlap
two dataframes, call:
import bioframe as bf
bf.overlap(df1, df2)
For these two input dataframes, with intervals all on the same chromosome:
overlap
will return the following interval pairs as overlaps:
To merge
all overlapping intervals in a dataframe, call:
import bioframe as bf
bf.merge(df1)
For this input dataframe, with intervals all on the same chromosome:
merge
will return a new dataframe with these merged intervals:
See the guide for visualizations of other interval operations in bioframe.
Bioframe includes utilities for reading genomic file formats into dataframes and vice versa. One handy function is read_table
which mirrors pandas’s read_csv/read_table but provides a schema
argument to populate column names for common tabular file formats.
jaspar_url = 'http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/MA0139.1.tsv.gz'
ctcf_motif_calls = bioframe.read_table(jaspar_url, schema='jaspar', skiprows=1)
See this jupyter notebook for an example of how to assign TF motifs to ChIP-seq peaks using bioframe.
If you use bioframe in your work, please cite:
@article{bioframe_2024,
author = {Open2C and Abdennur, Nezar and Fudenberg, Geoffrey and Flyamer, Ilya M and Galitsyna, Aleksandra A and Goloborodko, Anton and Imakaev, Maxim and Venev, Sergey},
doi = {10.1093/bioinformatics/btae088},
journal = {Bioinformatics},
title = {{Bioframe: Operations on Genomic Intervals in Pandas Dataframes}},
year = {2024}
}