.. moduleauthor:: Derek Barnett, David Alexander, Marcus Kinsella, Yuan Li, James Drake
PacBio's previous alignment file format (cmp.h5
) contained a data
table called the alignment index that recorded auxiliary identifying
information and precomputed summary statistics per aligned read. This
table served several purposes:
- it enabled fast random access to aligned reads satisfying fairly complex predicates, for example, reads from a specific list of ZMWs which had unambiguous mapping (MapQV==254), or a read with a given readname.
- it allowed summary reports (readlength, mapped identity/accuracy, etc.) to be constructed by quick operations over the alignment index instead of loading all of the sequence reads for each analysis.
In order to provide backwards-compatibility with the APIs enabled for
accessing the cmp.h5
, we have devised a new BAM companion file,
the PacBio BAM index, which supports the two use cases above.
This is version 3.0.0
of the bam.pbi
specification.
Changelog will go here in the future
- Random-access queries, including:
- by reference or genomic region
- by read group
- by query name
- by ZMW
- by barcode index
- etc.
- Obtain information without processing entire BAM file
- Calculate summary statistics
- Reverse-lookup - get information for a record, given its index
- Layout - file sections follow each other immediately in the file and are described below.
- PBI Header
- Subread Data
- Mapped Data (optional)
- Coordinate-Sorted Data (optional)
- Barcode/Adapter Data (optional)
Field | Size | Definition | Value |
---|---|---|---|
magic | char[4] | PBI magic string | PBI\1 |
version | uint32_t | PBI format version (xx.yy.zz) | 0x00xxyyzz |
pbi_flags | uint16_t | bitflag describing file contents 1 | |
n_reads | uint32_t | number of reads in the BAM file | |
reserved | char[18] | reserved space for future expansion | fill(0x00) |
1 pbi_flags:
Flag Value Description Basic 0x0000 PbiHeader & SubreadData only Mapped 0x0001 MappedData section present Coordinate Sorted 0x0002 CoordinateSortedData section present Barcode/Adapter 0x0004 BarcodeAdapterData section present (0x0008 - 0x8000) are available to mark future data modifiers, add'l sections, etc.
SubreadData | |||||
---|---|---|---|---|---|
Field | Size | Definition | |||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
1 Read group identifiers for PacBio data are calculated as follows:
RGID_STRING := md5(movieName + "//" + readType)) [:8] RGID_INT := int32.Parse(RGID_STRING) RGID_STRING is used in the @RG header. RGID_INT is used in the RG tag of BAM records and here in the PBI index. Note that RGID_INT may be negative.
MappedData | |||||
---|---|---|---|---|---|
Field | Size | Definition | |||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
Note
Note the absence of the nDel
and nIns
values in the index.
These values are readily computed as:
nIns = aEnd - aStart - nM - nMM nDel = tEnd - tStart - nM - nMM
CoordinateSortedData | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Field | Size | Definition | |||||||||
n_tids | uint32_t | Number of reference sequences | |||||||||
|
In a coordinate-sorted BAM file, the records mapped to each reference form a contiguous block of row numbers.
- 1
- This dataset should be sorted in ascending order of the uint32 cast of tId (thus a tId of -1 will follow all other tId values)
- 2
- Data fields
beginRow
andendRow
. IftId[i]==t
, then[beginRow, endRow)
represents range of reads (by 0-based ordinal position in the BAM file) mapped to the reference contig with tId of t. If no BAM records are aligned to t, then we should havebeginRow, endRow = -1
.
BarcodeAdapterData | |||||
---|---|---|---|---|---|
Field | Size | Definition | |||
|
|||||
|
|||||
|
|||||
|