From 355bfc14eabe2e2fb0498cfd67fa437afe09c0d1 Mon Sep 17 00:00:00 2001 From: DominikRafacz Date: Thu, 26 Sep 2024 17:25:05 +0200 Subject: [PATCH 1/3] fix typos --- NEWS.md | 4 +- vignettes/quick-start.Rmd | 300 +++++++++++++++++++------------------- 2 files changed, 152 insertions(+), 152 deletions(-) diff --git a/NEWS.md b/NEWS.md index 13af3c1..237b5cb 100644 --- a/NEWS.md +++ b/NEWS.md @@ -9,8 +9,8 @@ * `write_fasta()` and `find_motifs()` accept `data.frame` arguments now; sequences and their names are taken from specified two columns * more descriptive error messages for non-existing generics that print out classes of the first parameter -## Fixed-ish: -* return to autoexported `Rcpp` catch declaration +## Fixed: +* return to automatically exported `Rcpp` catch declaration ## Quality of code stuff: * added tests and adjusted vignettes for the changes diff --git a/vignettes/quick-start.Rmd b/vignettes/quick-start.Rmd index e5a768c..6225947 100644 --- a/vignettes/quick-start.Rmd +++ b/vignettes/quick-start.Rmd @@ -1,150 +1,150 @@ ---- -title: "Quick Start" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{Quick Start} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -``` - -`tidysq` package is meant to store and conduct operations on biological sequences. This vignette provides a guide to basic usage of `tidysq`, i.e. reading, manipulating and writing sequences to file. - -The most recent version of `tidysq` can be installed with `install_github()` function from `devtools`. - -```{r setup} -# devtools::install_github("BioGenies/tidysq") -library(tidysq) -``` - -## Sequence creation - -Biological sequences can be and often are represented as strings -- sequences of letters. For example, a DNA sequence can take the form of `"TAGGCCCTAGACCTG"`, where `A` means adenine, `C` -- cytosine, `G` -- guanine and `T` -- thymine. Exact IUPAC recommendations for one-letter codes can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC341218/). - -Within `tidysq` package sequence data is stored in `sq` objects, that is, vectors of biological sequences. They can be created from string vectors as above: - -```{r sq_from_string} -sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG")) -sq_dna -``` - -There are several thing to note. First, each sequence is an element of `sq` object. Many operations are vectorized --- they are applied to all sequences of a vector --- and `sq` objects are no different in this regard. Second, the first line of output says: `basic DNA sequences list`. This means that all sequences of this object are of DNA type and do not use ambiguous letters (more about that in "Advanced alphabet techniques" vignette). - -## Subsetting sequences - -Manipulating sequence objects is an integral part of `tidysq`. `sq` objects can be easily subsetted using usual R syntax: - -```{r sq_subset} -sq_dna[1] -``` - -Extracting subsequences is a bit more complicated than that --- because it uses designated function `bite()`. Its syntax, however, closely resembles that of base R --- indexing starts with one and negative indices are interpreted as "anything except that". It returns an `sq` object with all sequences subsetted: - -```{r sq_bite} -bite(sq_dna, 5:10) -bite(sq_dna, c(-9, -11, -13)) -``` - -It's possible to reverse sequences using this function: - -```{r sq_bite_reversing} -# Don't do it like that! -bite(sq_dna, 15:1) -``` - -However, this usage is strongly discouraged, because it's both ineffective and works badly with sequences of different lengths. Instead, there is a designated function `reverse()`: - -```{r sq_reverse} -reverse(sq_dna) -``` - -Note that it is very different to base `rev()`, which reverses only the order of sequences, not letters: - -```{r sq_rev} -rev(sq_dna) -``` - -We can combine two or more `sq` objects using base `c()` function: - -```{r sq_c} -sq_dna <- c(sq_dna, reverse(sq_dna)) -sq_dna -``` - -## Biological interpretation - -`tidysq` offers two functions specific to DNA/RNA sequences, namely `complement()` and `translate()`. The former creates sequences with complementary bases, that is, replaces `A` with `T`, `C` with `G` and *vice versa*. The latter translates input to amino acid sequences using [the translation table with three-letter codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables). - -These functions can be called as shown below: - -```{r sq_complement_translate} -complement(sq_dna) -translate(sq_dna) -``` - -One noteworthy feature here is that translation can be done with any genetic code table of those listed [on this Wikipedia page](https://en.wikipedia.org/wiki/List_of_genetic_codes): - -```{r sq_translate_other_table} -translate(sq_dna, table = 6) -``` - -## Finding motifs - -Motifs are short subsequences. These are often searched for in biological sequences. `tidysq` has two distinct functions that allow the user to perform such search. - -One of them is a `%has%` operator that takes `sq` object and character vector as parameters respectively. It returns a logical vector of the same length as `sq` object, where each element says whether all motifs passed as strings were found in given sequence: - -```{r sq_has} -sq_dna %has% "ATC" -# It can be used to subset sq -sq_dna[sq_dna %has% c("AG", "CC")] -``` - -It says nothing about motif placement within sequence nor it exact form, however. In this case, there is `find_motifs()` function that returns a whole `tibble` (from `tibble` package; basically improved version of `data.frame`) with various info about found motifs. Important thing to note here is that the second argument is a character vector of sequence names to avoid embedding potentially long sequences in resulting `tibble` potentially many times: - -```{r sq_find_motifs} -find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG")) -``` - -You can also provide this function with a `data.frame` (or, what we recommend, `tibble`) containing one column called `sq`, containing the sequences and the other colum `name` containing the names. - -```{r sqibble_find_motifs} -sqibble <- tibble::tibble(sq = sq_dna, - name = c("seq1", "seq2", "rev1", "rev2")) - -# does the same as the call from previous chunk of code -find_motifs(sqibble, c("ATC", "TAG")) -``` - -There are ambiguous DNA bases in IUPAC codes and these can be used in motifs. One of them is `"N"` --- its meaning is "any of `A`, `C`, `G` or `T`: - -```{r sq_find_motifs_amb} -find_motifs(sqibble, "GNCC") -``` - -This example displays the difference between `"sought"` and `"found"` columns. The former contains the string representation of motif that the user was looking for, while the latter contains a `tidysq`-encoded sequence with an "instance" of motif. - -Two additional characters are reserved because of their special meaning in motifs. `"^"` means that this motif must be found at the start of a sequence, while `"$"` means the same, but with the end instead. They can be mixed with ambiguous letters, of course: - -```{r sq_find_motifs_start_end} -find_motifs(sqibble, c("^TAG", "ATN$")) -``` - -## Exporting sq objects - -After doing computations the user might wish to save their sequences for future use. One of the most popular formats for storing biological sequences is FASTA. `tidysq` allows the user to write sequences to FASTA file with `write_fasta()` function. Important thing to remember here that the arguments for the function are analogous to those used in `find_motifs()` -- either `sq` object and a vector of names or a `tibble` with columns of sequences and names: - -```{r write_fasta, eval=FALSE} -write_fasta(sq_dna, - c("seq1", "seq2", "rev1", "rev2"), - "just_your_ordinary_fasta_file.fasta") -# or -write_fasta(sqibble, - "just_your_ordinary_fasta_file.fasta") -``` +--- +title: "Quick Start" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Quick Start} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +`tidysq` package is meant to store and conduct operations on biological sequences. This vignette provides a guide to basic usage of `tidysq`, i.e. reading, manipulating and writing sequences to file. + +The most recent version of `tidysq` can be installed with `install_github()` function from `devtools`. + +```{r setup} +# devtools::install_github("BioGenies/tidysq") +library(tidysq) +``` + +## Sequence creation + +Biological sequences can be and often are represented as strings -- sequences of letters. For example, a DNA sequence can take the form of `"TAGGCCCTAGACCTG"`, where `A` means adenine, `C` -- cytosine, `G` -- guanine and `T` -- thymine. Exact IUPAC recommendations for one-letter codes can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC341218/). + +Within `tidysq` package sequence data is stored in `sq` objects, that is, vectors of biological sequences. They can be created from string vectors as above: + +```{r sq_from_string} +sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG")) +sq_dna +``` + +There are several thing to note. First, each sequence is an element of `sq` object. Many operations are vectorized --- they are applied to all sequences of a vector --- and `sq` objects are no different in this regard. Second, the first line of output says: `basic DNA sequences list`. This means that all sequences of this object are of DNA type and do not use ambiguous letters (more about that in "Advanced alphabet techniques" vignette). + +## Subsetting sequences + +Manipulating sequence objects is an integral part of `tidysq`. `sq` objects can be easily subsetted using usual R syntax: + +```{r sq_subset} +sq_dna[1] +``` + +Extracting subsequences is a bit more complicated than that --- because it uses designated function `bite()`. Its syntax, however, closely resembles that of base R --- indexing starts with one and negative indices are interpreted as "anything except that". It returns an `sq` object with all sequences subsetted: + +```{r sq_bite} +bite(sq_dna, 5:10) +bite(sq_dna, c(-9, -11, -13)) +``` + +It's possible to reverse sequences using this function: + +```{r sq_bite_reversing} +# Don't do it like that! +bite(sq_dna, 15:1) +``` + +However, this usage is strongly discouraged, because it's both ineffective and works badly with sequences of different lengths. Instead, there is a designated function `reverse()`: + +```{r sq_reverse} +reverse(sq_dna) +``` + +Note that it is very different to base `rev()`, which reverses only the order of sequences, not letters: + +```{r sq_rev} +rev(sq_dna) +``` + +We can combine two or more `sq` objects using base `c()` function: + +```{r sq_c} +sq_dna <- c(sq_dna, reverse(sq_dna)) +sq_dna +``` + +## Biological interpretation + +`tidysq` offers two functions specific to DNA/RNA sequences, namely `complement()` and `translate()`. The former creates sequences with complementary bases, that is, replaces `A` with `T`, `C` with `G` and *vice versa*. The latter translates input to amino acid sequences using [the translation table with three-letter codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables). + +These functions can be called as shown below: + +```{r sq_complement_translate} +complement(sq_dna) +translate(sq_dna) +``` + +One noteworthy feature here is that translation can be done with any genetic code table of those listed [on this Wikipedia page](https://en.wikipedia.org/wiki/List_of_genetic_codes): + +```{r sq_translate_other_table} +translate(sq_dna, table = 6) +``` + +## Finding motifs + +Motifs are short subsequences. These are often searched for in biological sequences. `tidysq` has two distinct functions that allow the user to perform such search. + +One of them is a `%has%` operator that takes `sq` object and character vector as parameters respectively. It returns a logical vector of the same length as `sq` object, where each element says whether all motifs passed as strings were found in given sequence: + +```{r sq_has} +sq_dna %has% "ATC" +# It can be used to subset sq +sq_dna[sq_dna %has% c("AG", "CC")] +``` + +It says nothing about motif placement within sequence nor it exact form, however. In this case, there is `find_motifs()` function that returns a whole `tibble` (from `tibble` package; basically improved version of `data.frame`) with various info about found motifs. Important thing to note here is that the second argument is a character vector of sequence names to avoid embedding potentially long sequences in resulting `tibble` potentially many times: + +```{r sq_find_motifs} +find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG")) +``` + +You can also provide this function with a `data.frame` (or, what we recommend, `tibble`) containing one column called `sq`, containing the sequences and the other column `name` containing the names. + +```{r sqibble_find_motifs} +sqibble <- tibble::tibble(sq = sq_dna, + name = c("seq1", "seq2", "rev1", "rev2")) + +# does the same as the call from previous chunk of code +find_motifs(sqibble, c("ATC", "TAG")) +``` + +There are ambiguous DNA bases in IUPAC codes and these can be used in motifs. One of them is `"N"` --- its meaning is "any of `A`, `C`, `G` or `T`: + +```{r sq_find_motifs_amb} +find_motifs(sqibble, "GNCC") +``` + +This example displays the difference between `"sought"` and `"found"` columns. The former contains the string representation of motif that the user was looking for, while the latter contains a `tidysq`-encoded sequence with an "instance" of motif. + +Two additional characters are reserved because of their special meaning in motifs. `"^"` means that this motif must be found at the start of a sequence, while `"$"` means the same, but with the end instead. They can be mixed with ambiguous letters, of course: + +```{r sq_find_motifs_start_end} +find_motifs(sqibble, c("^TAG", "ATN$")) +``` + +## Exporting sq objects + +After doing computations the user might wish to save their sequences for future use. One of the most popular formats for storing biological sequences is FASTA. `tidysq` allows the user to write sequences to FASTA file with `write_fasta()` function. Important thing to remember here that the arguments for the function are analogous to those used in `find_motifs()` -- either `sq` object and a vector of names or a `tibble` with columns of sequences and names: + +```{r write_fasta, eval=FALSE} +write_fasta(sq_dna, + c("seq1", "seq2", "rev1", "rev2"), + "just_your_ordinary_fasta_file.fasta") +# or +write_fasta(sqibble, + "just_your_ordinary_fasta_file.fasta") +``` From e2fac88f61e857b8b06f24bbd370de2d0ed7853b Mon Sep 17 00:00:00 2001 From: DominikRafacz Date: Thu, 26 Sep 2024 17:34:12 +0200 Subject: [PATCH 2/3] add cran comments --- cran-comments.md | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/cran-comments.md b/cran-comments.md index 4363b7f..72f33d2 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -1,11 +1,6 @@ -## Test environments -* local R installation, R 4.1.0 -* ubuntu 16.04 (on travis-ci), R 4.1.0 -* win-builder (devel) - ## R CMD check results -0 errors | 0 warnings | 1 note +0 errors | 0 warnings | 0 notes * This is a resubmission. -* Fixed the problem with deprecated usage of iterator +* Fixed issues related to new implementations of set operations on R-devel From d8a368ddf8f2bf173893c2b02aa4a1f8aad760ea Mon Sep 17 00:00:00 2001 From: DominikRafacz Date: Sun, 29 Sep 2024 17:28:42 +0200 Subject: [PATCH 3/3] fix unavailable URLs --- README.Rmd | 1 - README.md | 13 ++++++------- docs/index.html | 11 +++++------ vignettes/quick-start.Rmd | 2 +- 4 files changed, 12 insertions(+), 15 deletions(-) diff --git a/README.Rmd b/README.Rmd index 36fdb08..c013d84 100644 --- a/README.Rmd +++ b/README.Rmd @@ -17,7 +17,6 @@ knitr::opts_chunk$set( [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tidysq)](https://cran.r-project.org/package=tidysq) [![Github Actions Build Status](https://github.com/BioGenies/tidysq/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/BioGenies/tidysq/actions) - [![codecov.io](https://codecov.io/github/BioGenies/tidysq/coverage.svg?branch=master)](https://codecov.io/github/BioGenies/tidysq?branch=master) [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) diff --git a/README.md b/README.md index e2aa744..dc733f7 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,6 @@ [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tidysq)](https://cran.r-project.org/package=tidysq) [![Github Actions Build Status](https://github.com/BioGenies/tidysq/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/BioGenies/tidysq/actions) -[![codecov.io](https://codecov.io/github/BioGenies/tidysq/coverage.svg?branch=master)](https://codecov.io/github/BioGenies/tidysq?branch=master) [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) @@ -17,11 +16,11 @@ experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](h sequences (including amino acid and nucleic acid – e.g. RNA, DNA – sequences). Two major features of this package are: -- effective compression of sequence data, allowing to fit larger - datasets in **R**, +- effective compression of sequence data, allowing to fit larger + datasets in **R**, -- compatibility with most of `tidyverse` universe, especially `dplyr` - and `vctrs` packages, making analyses *tidier*. +- compatibility with most of `tidyverse` universe, especially `dplyr` + and `vctrs` packages, making analyses *tidier*. ## Getting started @@ -70,7 +69,7 @@ sqibble #> 8 VHPQKLVFF <15> AMY24|HABP2|Amyloid beta A4 peptide #> 9 VHHPKLVFF <15> AMY25|HABP3|Amyloid beta A4 peptide #> 10 VHHQPLVFF <15> AMY26|HABP4|Amyloid beta A4 peptide -#> # … with 411 more rows +#> # ℹ 411 more rows sq_ami <- sqibble$sq sq_ami @@ -156,7 +155,7 @@ sqibble %>% #> 8 VHHQEKLVF <16> AMY35|HABP13|Amyloid beta A4 peptide 16 #> 9 VHHQEKLVF <16> AMY36|HABP14|Amyloid beta A4 peptide 16 #> 10 KKLVFFAED  <9> AMY37|HABP15|Amyloid beta A4 peptide 9 -#> # … with 14 more rows +#> # ℹ 14 more rows ``` ## Citation diff --git a/docs/index.html b/docs/index.html index b531d47..f808ade 100644 --- a/docs/index.html +++ b/docs/index.html @@ -47,7 +47,7 @@
  • - +
  • @@ -56,7 +56,7 @@