forked from PolMine/RcppCWB
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
158 lines (106 loc) · 6.93 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
output: github_document
---
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7040475.svg)](https://doi.org/10.5281/zenodo.7040475)
[![License: GPL v3](http://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/RcppCWB)](https://cran.r-project.org/package=RcppCWB)
[![R build status](https://github.com/PolMine/RcppCWB/workflows/R-CMD-check/badge.svg)](https://github.com/PolMine/RcppCWB/actions)
[![codecov](https://codecov.io/gh/PolMine/RcppCWB/branch/master/graph/badge.svg)](https://app.codecov.io/gh/PolMine/RcppCWB)
# Rcpp bindings for the Corpus Workbench (CWB)
The package exposes functions of the Corpus Worbench (CWB) by way of Rcpp wrappers. Furthermore, the packages includes Rcpp code for performance critical operations. The main purpose of the package is to serve as an interface to the CWB for the package [polmineR](https://CRAN.R-project.org/package=RcppCWB).
There is a huge intellectual debt to the developers of the R-package 'rcqp', Bernard Desgraupes and Sylvain Loiseau. The main impetus for developing RcppCWB is that using Rcpp decreases the pains to maintain the package, to expand the CWB functionality exposed, and -- most importantly -- to make it portable to Windows systems.
### Installation on Windows
Pre-compiled 'RcppCWB' binaries can be installed from CRAN.
```{r install_RcppCWB_cran, eval = FALSE}
install.packages("RcppCWB")
```
If you want to get the development version, you need to compile RcppCWB yourself. Having [Rtools](https://cran.r-project.org/bin/windows/Rtools/) installed on your system is necessary. Using the mechanism offered by the devtools package, you can install RcppCWB from GitHub.
```{r install_RcppCWB_github, eval = FALSE}
if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB")
```
During the installation, cross-compiled versions of the corpus library (CL) are downloaded from the GitHub repository [PolMine/libcl](https://github.com/PolMine/libcl). The libcl repository also includes a reproducible workflow using Docker to build static libraries from the CWB source code.
## Installation on Ubuntu
The package includes the source code of the Corpus Workbench (CWB), slightly modified to make it compatible with R requirements. Compiling the CWB requires the pcre2 and glib libraries to be present. Using the Aptitude package manager (Ubuntu/Debian), running the following command from the shell will fulfill these dependencies.
```{sh ubuntu_install_dependencies, eval = FALSE}
sudo apt-get install libpcre2-dev libglib2.0-dev
```
Then, use the conventional R installation mechanism to install R dependencies, and the release of RcppCWB at CRAN.
```{r ubuntu_install_RcppCWB, eval = FALSE}
install.packages(pkgs = c("Rcpp", "knitr", "testthat"))
install.packages("RcppCWB")
```
To install the development version, using the mechanism offered by the devtools package is recommended.
```{r ubuntu_install_RcppCWB_github, eval = FALSE}
if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB", ref = "dev")
```
## Installation on macOS
On macOS, the [pcre2](http://www.pcre.org/) and [Glib](https://docs.gtk.org/glib) libraries need to be present. We recommend to use 'Homebrew' as a package manager for macOS. To install Homebrew, follow the instructions on the [Homebrew Website](https://brew.sh/index_de.html). It may also be necessary to also install [Xcode](https://developer.apple.com/xcode/) and [XQuartz](https://www.xquartz.org).
The following commands then need to be executed from a terminal window. They will install the C libraries the CWB relies on:
```{sh install_dependencies_macos, eval = FALSE}
brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre2 --universal
brew -v install readline
```
Then open R and use the conventional R installation mechanism to install dependencies, and the release of RcppCWB at CRAN.
```{r install_RcppCWB_macOS, eval = FALSE}
install.packages(pkgs = c("Rcpp", "knitr", "testthat"))
install.packages("RcppCWB")
```
To install the development version, using the mechanism offered by the devtools package is recommended.
```{r install_RcppCWB_macOS_github, eval = FALSE}
if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB")
```
## Usage
The package offers low-level access to CWB-indexed corpora. Using RcppCWB may not intuitive at the outset: It is designed to serve as a an efficient backend for packages offering higher-level functionality, such as polmineR. the
RcppCWB includes a small sample corpus called ('REUTERS'). After loading the package, we need to determine whether we can use the registry describing the corpus within the package, or whether we need to work with a temporary registry.
```{r initialize_RcppCWB}
library(RcppCWB)
registry <- use_tmp_registry()
```
To start with, we get the number of tokens of the corpus.
```{r total_no_tokens}
cpos_total <- cl_attribute_size(
corpus = "REUTERS", attribute = "word",
attribute_type = "p", registry = registry
)
cpos_total
```
To decode the token stream of the corpus.
```{r decode_token_stream}
token_stream_str <- cl_cpos2str(
corpus = "REUTERS", p_attribute = "word",
cpos = seq.int(from = 0, to = cpos_total - 1),
registry = registry
)
```
To get the corpus positions of a token.
```{r get_corpus_positions}
token_to_get <- "oil"
id_oil <- cl_str2id(corpus = "REUTERS", p_attribute = "word", str = token_to_get, registry = registry)
cpos_oil <- cl_id2cpos <- cl_id2cpos(corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = registry)
```
Get the frequency of token.
```{r get_token_frequency}
oil_freq <- cl_id2freq(corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = registry)
```
Using regular expressions.
```{r regex}
ids <- cl_regex2id(corpus = "REUTERS", p_attribute = "word", regex = "M.*", registry = registry)
m_words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", id = ids, registry = registry)
```
To use the CQP syntax, we need to initialize CQP first.
```{r cqp}
cqp_initialize(registry = registry)
cqp_query(corpus = "REUTERS", query = '"crude" "oil"')
cpos <- cqp_dump_subcorpus(corpus = "REUTERS")
cpos
```
## License
The packge is licensed under the [GNU General Public License 3](https://www.gnu.org/licenses/gpl-3.0.de.html). For the copyrights for the 'Corpus Workbench' (CWB) and acknowledgement of authorship, see the file COPYRIGHTS.
## Acknowledgements
There is a huge intellectual debt to the developers of the R-package 'rcqp', Bernard Desgraupes and Sylvain Loiseau. Developing RcppCWB would have been unthinkable without their original work to wrap the CWB into an R package.
The CWB is a classic and mature tool: The work of the CWB developers, Oliver Christ, Bruno Maximilian Schulze, Arne Fitschen and Stefan Evert is gratefully acknowledged.