Skip to content

Commit

Permalink
Merge branch 'master' of github.com:4dn-dcic/Rpairix
Browse files Browse the repository at this point in the history
  • Loading branch information
SooLee committed May 3, 2017
2 parents 7662c0a + 1a4ee14 commit df3f0cc
Show file tree
Hide file tree
Showing 22 changed files with 194 additions and 69 deletions.
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ r:
- release
- devel
- oldrel
bioc_packages: GenomicRanges
bioc_packages: InteractionSet
5 changes: 4 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@ Authors@R: person("Soo", "Lee", email = "duplexa@gmail.com", role = c("aut", "cr
Description: R binder for pairix, tool for querying a pair of genomic ranges in a pairs file (pairix-indexed bgzipped text file)
Depends:
R (>= 3.1)
Imports:
GenomicRanges,
InteractionSet
License: MIT
Encoding: UTF-8
LazyData: true
RoxygenNote: 5.0.1
RoxygenNote: 6.0.1
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ export(px_seq2list)
export(px_seqlist)
export(px_startpos1_col)
export(px_startpos2_col)
import(GenomicRanges)
import(InteractionSet)
useDynLib(Rpairix,build_index)
useDynLib(Rpairix,check_1d_vs_2d)
useDynLib(Rpairix,get_chr1_col)
Expand Down
116 changes: 91 additions & 25 deletions R/px_query.R
Original file line number Diff line number Diff line change
@@ -1,74 +1,140 @@
#' Query function on pairix-indexed pairs file.
#' Query pairix-indexed pairs file.
#'
#' This function allows you to query a 2D range in a pairix-indexed pairs file.
#' This function allows you to query a 2D range in a pairix-indexed pairs file using strings or GenomicRanges-related objects.
#'
#' @param filename a pairs file, or a bgzipped text file (sometextfile.gz) with an index file sometextfile.gz.px2 in the same folder.
#' @param querystr a character vector containing a list of pairs of genomic coordinates in "chr1:start1-end1|chr2:start2-end2" format. start-end can be omitted (e.g. "chr1:start1-end1|chr2" or "chr1|chr2")
#' @param query One of three types: (1) a character vector containing a set of pairs of genomic coordinates in 1-based "chr1:start1-end1|chr2:start2-end2" format. start-end can be omitted (e.g. "chr1:start1-end1|chr2" or "chr1|chr2"); (2) A GInteractions object from the package "InteractionSet"; (3) A GRangesList composed of two GRanges objects of identical length (first pairs, second pairs), from the package "GenomicRanges".
#' @param max_mem the total string length allowed for the result. If the size of the output exceeds this number, the function will return NULL and print out a memory error. Default 100,000,000.
#' @param stringsAsFactors the stringsAsFactors parameter for the data frame returned. Default False.
#' @param linecount.only If TRUE, the function returns an integer corresponding to the number of output lines instead of the actual query result. (default FALSE)
#' @param autoflip If TRUE, the function will rerun on a flipped query (mate1 and mate2 swapped) if the original query results in an empty output. (default FALSE). If linecount.only option is used in combination with autoflip, the result count is on the flipped query in case the query gets flipped.
#'
#' @return data frame containing the query result. Column names are added if indexing was done with a pairs preset.
#' @keywords pairix query 2D
#' @keywords pairix query 2D GenomicRanges GInteractions
#' @import InteractionSet GenomicRanges
#' @details This function is compatible with Bioconductor packages InteractionSet and GenomicRanges.
#' @export px_query
#' @examples
#'
#' ##### -- query using strings -- ######
#'
#' ## 2D-indexed file
#' filename = system.file(".","test_4dn.pairs.gz", package="Rpairix")
#' querystr = c("chr10|chr20","chr22|chr22")
#' res = px_query(filename, querystr)
#' query = c("chr10|chr20","chr22|chr22")
#' res = px_query(filename, query)
#' print(res)
#'
#' n = px_query(filename, querystr, linecount.only=TRUE)
#' n = px_query(filename, query, linecount.only=TRUE)
#' print(n)
#'
#' querystr = "chr20|chr10"
#' res = px_query(filename, querystr, autoflip=TRUE)
#' query = "chr20|chr10"
#' res = px_query(filename, query, autoflip=TRUE)
#' print(res)
#'
#' ## wild card query
#' querystr = "chr21|*"
#' res = px_query(filename, querystr, autoflip=TRUE)
#' query = "chr21|*"
#' res = px_query(filename, query, autoflip=TRUE)
#' print(res)
#'
#' querystr = "*|chr21"
#' res = px_query(filename, querystr, autoflip=TRUE)
#' query = "*|chr21"
#' res = px_query(filename, query, autoflip=TRUE)
#' print(res)
#'
#' ## the following attempts will return NULL and give you a warning message.
#' querystr = "chr20"
#' res = px_query(filename, querystr, autoflip=TRUE)
#' query = "chr20"
#' res = px_query(filename, query, autoflip=TRUE)
#' print(res)
#'
#' filename = system.file(".","merged_nodups.space.chrblock_sorted.subsample1.txt.gz", package="Rpairix")
#' querystr = "10:1-1000000|20"
#' res = px_query(filename, querystr)
#' query = "10:1-1000000|20"
#' res = px_query(filename, query)
#' print(res)
#'
#' ## the following attempts will return NULL and give you a warning message.
#' res = px_query(filename, querystr, autoflip=TRUE)
#' res = px_query(filename, query, autoflip=TRUE)
#' print(res)
#'
#' filename = system.file(".","merged_nodups.space.chrblock_sorted.subsample1.txt.gz",
#' package="Rpairix")
#' querystr = "10:1-1000000|20"
#' res = px_query(filename, querystr)
#' query = "10:1-1000000|20"
#' res = px_query(filename, query)
#' print(res)
#'
#' ## 1D-indexed file
#' filename = system.file(".","SRR1171591.variants.snp.vqsr.p.vcf.gz", package="Rpairix")
#' querystr = 'chr10|5000000-20000000'
#' res = px_query(filename, querystr)
#' query = 'chr10|5000000-20000000'
#' res = px_query(filename, query)
#' print(res)
#'
#' ## the following attempts will return NULL and give you a warning message.
#' querystr = 'chr10|chr20'
#' res = px_query(filename, querystr)
#' query = 'chr10|chr20'
#' res = px_query(filename, query)
#' print(res)
#'
#'
#' ##### -- query using GenomicRanges-related objects -- #####
#'
#' filename = system.file(".","test_4dn.pairs.gz", package="Rpairix")
#'
#' # create GRangesList
#' library(GenomicRanges)
#' gr <- GRanges(
#' seqnames = Rle(c("chr10", "chr20", "chr21", "chr22"), c(1, 2, 1, 2)),
#' ranges = IRanges((0:5*1000000)+1, end = (0:5*1000000)+13000000))
#' grl <- split(gr, rep(1:2,3))
#' grl
#'
#' # create GInteractions
#' library(InteractionSet)
#' gi <- GInteractions(grl[[1]],grl[[2]])
#' gi
#'
#' # query with GInteractions
#' res = px_query(filename,query=gi)
#' print(res)
#'
#' # query with GRangesList
#' res = px_query(filename,query=grl)
#' print(res)
#'
#' @useDynLib Rpairix get_size get_lines check_1d_vs_2d
px_query<-function(filename, querystr, max_mem=100000000, stringsAsFactors=FALSE, linecount.only=FALSE, autoflip=FALSE){
px_query<-function(filename, query, max_mem=100000000, stringsAsFactors=FALSE, linecount.only=FALSE, autoflip=FALSE){

# -- helper function -- #
df_to_querystr <- function(qdf){
# expects presorted df (columns ordered: seqnames1,start1,end1,seqnames2,start2,end2)
querystr <- paste0(qdf[,1],":",qdf[,2],"-",qdf[,3],"|",qdf[,4],":",qdf[,5],"-",qdf[,6])
return(querystr)
}

# -- produce querystr -- #
if(class(query)=="character"){
querystr <- query
} else if (class(query)=="GInteractions"){
# -- produce querystr from GInteractions obj -- #
# convert
qdf <- as.data.frame(query)
qdf <- qdf[,c("seqnames1","start1","end1","seqnames2","start2","end2")]
querystr <- df_to_querystr(qdf)
rm(qdf)
} else if (class(query)=="GRangesList"){
# -- produce querystr from two identical-length, paired GRanges objects in a GRangesList-- #
# test lengths
if(length(query) != 2) stop("GRangesList must be composed of two GRanges objects.")
if(diff(sapply(query,length)) != 0) {
stop("Paired GRanges objects in the GRangesList must be of identical length.")
}
# convert
grldf <- lapply(query,as.data.frame)
names(grldf[[2]]) <- sub("$","2",names(grldf[[2]]))
grldf <- cbind(grldf[[1]],grldf[[2]])
grldf <- grldf[,c("seqnames","start","end","seqnames2","start2","end2")]
querystr <- df_to_querystr(grldf)
rm(grldf)
} else {
stop("query must be of class 'character', 'GInteractions', or 'GRangesList'.")
}
rm(query)

# sanity check for 2D query on 1D index.
ind_dim = .C("check_1d_vs_2d", filename, as.integer(0))
Expand Down
39 changes: 33 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ install_url("https://github.com/4dn-dcic/Rpairix/archive/0.1.6.zip")
```r
library(Rpairix)
px_build_index(filename,preset) # indexing
px_query(filename,querystr) # querying
px_query(filename,querystr,linecount.only=TRUE) # number of output lines for the query
px_query(filename,query) # querying using a string or GenomicRanges-related objects.
px_query(filename,query,linecount.only=TRUE) # number of output lines for the query
px_keylist(filename) # list of keys (chromosome pairs)
px_seqlist(filename) # list of chromosomes
px_seq1list(filename) # list of first chromosomes
Expand Down Expand Up @@ -114,6 +114,33 @@ data frame with 0 columns and 0 rows
> px_query("inst/test_4dn.pairs.gz", multi_querystr, linecount.only=TRUE)
[1] 104
>
> # query using GRangesList object
> library(GenomicRanges)
> gr <- GRanges(
> seqnames = Rle(c("chr10", "chr20", "chr21", "chr22"), c(1, 2, 1, 2)),
> ranges = IRanges((0:5*1000000)+1, end = (0:5*1000000)+13000000))
> grl <- split(gr, rep(1:2,3))
> px_query("test_4dn.pairs.gz",query=grl)
readID chr1 pos1 chr2 pos2 strand1 strand2
1 SRR1658581.33457260 chr10 2559777 chr20 7888262 - +
2 SRR1658581.15714901 chr10 4579507 chr20 10941340 + +
3 SRR1658581.39908038 chr22 16224023 chr22 16224245 + -
4 SRR1658581.18052095 chr22 16389136 chr22 16687983 - -
5 SRR1658581.52271223 chr22 16645528 chr22 17454018 + +
6 SRR1658581.22023475 chr22 16927100 chr22 17255207 - -
>
> # query using GInteractions object
> library(InteractionSet)
> gi <- GInteractions(grl[[1]],grl[[2]])
> px_query("test_4dn.pairs.gz",query=gi)
readID chr1 pos1 chr2 pos2 strand1 strand2
1 SRR1658581.33457260 chr10 2559777 chr20 7888262 - +
2 SRR1658581.15714901 chr10 4579507 chr20 10941340 + +
3 SRR1658581.39908038 chr22 16224023 chr22 16224245 + -
4 SRR1658581.18052095 chr22 16389136 chr22 16687983 - -
5 SRR1658581.52271223 chr22 16645528 chr22 17454018 + +
6 SRR1658581.22023475 chr22 16927100 chr22 17255207 - -
>
> # getting list of chromosome pairs and chromosomes
> keys = px_keylist(filename)
> length(keys)
Expand Down Expand Up @@ -186,10 +213,10 @@ sc2 second sequence (chromosome) column index (1-based). Zero (0) means not spec

### Querying
```
px_query(filename,querystr,max_mem=100000000,stringsAsFactors=FALSE,linecount.only=FALSE, autoflip=FALSE)
px_query(filename,query,max_mem=100000000,stringsAsFactors=FALSE,linecount.only=FALSE, autoflip=FALSE)
```
* `filename` is sometextfile.gz and an index file sometextfile.gz.px2 must exist.
* `querystr` (query string) is in the same format as the format for pairix. (e.g. '1:1-10000000|20:50000000-60000000'). It can be a vector of query strings.
* `filename` is sometextfile.gz, and an index file sometextfile.gz.px2 must exist.
* `query` is one of three types: (1) a character vector containing a set of pairs of genomic coordinates in 1-based "chr1:start1-end1|chr2:start2-end2" format. start-end can be omitted (e.g. "chr1:start1-end1|chr2" or "chr1|chr2"); (2) A GInteractions object from the package "InteractionSet"; (3) A GRangesList composed of two GRanges objects of identical length (first pairs, second pairs), from the package "GenomicRanges".
* `max_mem` is the maximum total length of the result strings (sum of string lengths).
* The return value is a data frame, each row corresponding to the line in the input file within the query range.
* If `linecount.only` is TRUE, the function returns only the number of output lines for the query.
Expand Down Expand Up @@ -287,7 +314,7 @@ Individual R functions are written and documented in `R/`. The `src/rpairixlib.c
* `px_query` : input query can be a GInteractions object or a list of GRanges objects.

### 0.1.5
* `px_query` : wild card (*) in a query now allowed (queries like 'chr11|*' or '*|chr2:1-20000' possible. '*' means whole genome.
* `px_query` : wild card (\*) in a query now allowed (queries like 'chr11|\*' or '\*|chr2:1-20000' possible. '\*' means whole genome.
* `px_exists` and `px_exists2` now returns TRUE/FALSE instead of 1/0.
* Function `px_colnames` is added (identical to `px_get_column_names`)
* Function `px_check_dim` is now renamed to `px_check_1d_vs_2d`.
Expand Down
1 change: 0 additions & 1 deletion man/px_build_index.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_check_1d_vs_2d.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_chr1_col.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_chr2_col.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_colnames.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_endpos1_col.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_endpos2_col.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_exists.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_exists2.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_get_column_names.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion man/px_keylist.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit df3f0cc

Please sign in to comment.