Skip to content
This repository has been archived by the owner on Feb 22, 2024. It is now read-only.

Commit

Permalink
few changes to documentation, work in progress
Browse files Browse the repository at this point in the history
  • Loading branch information
andreybratus committed Mar 28, 2015
1 parent ef19241 commit 65b83f8
Show file tree
Hide file tree
Showing 8 changed files with 176 additions and 90 deletions.
228 changes: 156 additions & 72 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# BatchRefine [![Build Status](https://travis-ci.org/fusepoolP3/p3-batchrefine.svg?branch=master)](https://travis-ci.org/fusepoolP3/p3-batchrefine)

===========
BatchRefine [![Build Status](https://travis-ci.org/fusepoolP3/p3-batchrefine.svg?branch=master)](https://travis-ci.org/fusepoolP3/p3-batchrefine)
==========

BatchRefine is an effort to run [OpenRefine](http://openrefine.org)
effectively in batch workloads and ETL pipelines. Goals include:
Expand All @@ -11,8 +10,8 @@ effectively in batch workloads and ETL pipelines. Goals include:

This is a work in progress, and so is this documentation.

Introduction
------
### Introduction


BatchRefine currently works by providing a collection of wrappers
(referred to as _backends_) and, in some modes, a distribution layer,
Expand All @@ -25,23 +24,167 @@ chained with other transformers.

Backends and access clients can be combined to tailor the needs of the
application. We will discuss two simple combinations that provide
basic functionality: a basic Fusepool P3 transformer (accessible via
HTTP), and a command line tool.
basic functionality: [Command Line Tool](#command-line-tool) and a [Fusepool P3 transformer](#p3-transformer) (accessible via HTTP).

Whatever way you choose to use BatchRefine, you will need two things:

1. a CSV to use as input file;
2. an OpenRefine command history (referred to as a _transform
script_), packaged as a JSON file.

P3 Transformer
### Try it out

To try BatchRefine right away, use the pre-built docker image

```sh
docker run --rm -it -p 7100:7100 fusepool/batchrefine
```

This will start the [P3 Batchrefine transformer](#p3t) with default configurations,
which can be accessed as follows:

```sh
curl -XPOST -H 'Content-Type:text/csv' --data-binary @input.csv 'localhost:7100/?refinejson=http://url.to/transform.json'
```

Compiling and Running
--------------

### Building from Sources

Building BatchRefine from sources requires Maven 3 and Apache ant (for
building OpenRefine). The procedure, which is somewhat complex because
OpenRefine is not meant to be used as a library, is as follows. On a
clean folder:

1. Download the OpenRefine 2.6-beta.1 source distribution from:

https://github.com/OpenRefine/OpenRefine/archive/2.6-beta.1.tar.gz

2. Unzip, untar, and then build OpenRefine, the server and web app
JARs by running:

```sh
ant build jar_server jar_webapp
```

3. Switch to the `./extensions` folder under the OpenRefine root and
then download the OpenRefine RDF extension alpha 0.9.0 source
distribution:

https://github.com/fadmaa/grefine-rdf-extension/archive/v0.9.0.tar.gz

Unzip, untar, and then rename the folder it extracts into
`rdf-extension` and build it as follows:

```sh
mv grefine-rdf-extension-0.9.0 rdf-extension
cd rdf-extension
JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF-8' ant build
```

4. After that, switch back to the OpenRefine root and start it (```./refine```). A
running instance is required for the tests that BatchRefine will
run during the build.

5. Download BatchRefine from:

https://github.com/fusepoolP3/p3-batchrefine/releases/latest

into a sibling folder to OpenRefine (i.e. both OpenRefine and
BatchRefine should share the same parent folder). As usual, unzip
and untar. Switch to the `p3-batchrefine-v1.x.x` folder, and run:

```sh
./bin/refine-import.sh
mvn package
```

The JAR for starting the P3 transformer will be located under:

`./clients/clients-transformer/target/clients-transformer-{project.version}-jar-with-dependencies.jar`

whereas the JAR for starting the command line client will be under:

`./clients/clients-cli/target/clients-cli-{project.version}-jar-with-dependencies.jar`



### Running

This section describes how to run the tools, for more details refer to [Usage](#usage) section.

**Run the [Command Line Tool](#command-line-tool)**

```sh
./bin/batchrefine [--verbose] BACKEND_TYPE [backend_specific_options] INPUTFILE TRANSFORM [OUTPUTFILE]
```

If no `OUTPUTFILE` is specified, writes to `STDOUT`

Available backends:
* remote - simple http client that connects to an OpenRefine instance
* embedded - built-in OpenRefine version [(currently has limited functionality)](#running-with-the-embedded-backend)
* split - distributed backend able to connect to multiple OpenRefine instances and improve
performance by splitting input file.
* spark - distributed backend based on [Apache Spark](http://spark.apache.org) aimed at very large workloads
(currently has limited functionality)

To list the `backend_specific_options`:

```
./bin/batchrefine BACKEND_TYPE --help
```

**Run the [P3 Transformer](#p3-transformer)**

```sh
./bin/transformer.sh [TRANSFORMER_OPTIONS] BACKEND_TYPE [backend_specific_options]
```

`TRANSOFRMER_OPTIONS` are:


```
-v -- verbose logging
-p [PORT] -- port to which transformer listens (defaults: 7100)
-t [sync|async] -- transformer type: synchronous or asynchronous (defaults to sync)
```

Available backends:
* remote
* split
* spark

`backend_specific_options` are the same as for the command line client and can be listed with
a help option.

To start the most common configuration of the transformer that
connects to a locally running instance of OpenRefine

```sh
./bin/transformer remote
#which is equivalent to:
./bin/transformer -v -t sync remote -l localhost:3333
```



Usage
-----

## P3 Transformer


The BatchRefine P3 transformer wraps (multiple instances of)
OpenRefine under the Fusepool P3 HTTP API. We will show how to build a
transformer that operates over a single instance, running locally.

## Building with Docker
### Building with Docker

Building and deploying the P3 transformer with
[Docker](https://www.docker.com/) is easy. Assuming you have Docker
Expand All @@ -59,7 +202,7 @@ cd docker
./batchrefine-docker.sh bootstrap -t sync
```

## Running
### Running

After running the bootstrap step, you just have to run:

Expand Down Expand Up @@ -92,6 +235,7 @@ transformed according to what is described in `transform.json`.
best to build and run the transformer from sources (see the section
on building BatchRefine from sources).


Command Line Tool
=================

Expand All @@ -100,7 +244,7 @@ but clumsy for manual usage. The command line tool works better in
these cases, as you can simply do:

```sh
batchrefine input.csv transform.json > output.csv
./bin/batchrefine remote input.csv transform.json > output.csv
```

where, as before, `input.csv` is the input file, `transform.json` is
Expand Down Expand Up @@ -128,7 +272,7 @@ and BatchRefine, respectively. Then, to transform a file, from the
BatchRefine root, do:

```sh
./bin/batchrefine -e embedded input.csv transform.json
./bin/batchrefine embedded input.csv transform.json
```

this will produce a CSV file on stdout with the transform applied to
Expand All @@ -151,66 +295,6 @@ OpenRefine instance. If you have OpenRefine running on
./bin/batchrefine -e remote -h refine.example.com input.csv transform.json
```
Building from Sources
=====================

Building BatchRefine from sources requires Maven 3 and Apache ant (for
building OpenRefine). The procedure, which is somewhat complex because
OpenRefine is not meant to be used as a library, is as follows. On a
clean folder:

1. Download the OpenRefine 2.6-beta.1 source distribution from:

https://github.com/OpenRefine/OpenRefine/archive/2.6-beta.1.tar.gz

2. Unzip, untar, and then build OpenRefine, the server and web app
JARs by running:

```sh
ant build jar_server jar_webapp
```

3. Switch to the `./extensions` folder under the OpenRefine root and
then download the OpenRefine RDF extension alpha 0.9.0 source
distribution:

https://github.com/fadmaa/grefine-rdf-extension/archive/v0.9.0.tar.gz

Unzip, untar, and then rename the folder it extracts into
`rdf-extension` and build it as follows:

```sh
mv grefine-rdf-extension-0.9.0 rdf-extension
cd rdf-extension

JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF-8' ant build
```

4. After that, switch back to the OpenRefine root and start it. A
running instance is required for the tests that BatchRefine will
run during the build.

5. Download BatchRefine from:

https://github.com/fusepoolP3/batchrefine/archive/master.tar.gz

into a sibling folder to OpenRefine (i.e. both OpenRefine and
BatchRefine should share the same parent folder). As usual, unzip
and untar. Switch to the `p3-batchrefine-master` folder, and run:

```sh
./bin/refine-import.sh

mvn package
```

The JAR for starting the P3 transformer will be located under:

`./clients/clients-transformer/target/clients-transformer-{project.version}-jar-with-dependencies.jar`

whereas the JAR for starting the command line client will be under:

`./clients/clients-cli/target/clients-cli-{project.version}-jar-with-dependencies.jar`
Miscellaneous
=============
Expand Down
File renamed without changes.
6 changes: 6 additions & 0 deletions bin/transformer
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

BATCHREFINE_ROOT=${BATCHREFINE_ROOT:-$PWD}
VERSION=`cat ${BATCHREFINE_ROOT}/version`

java -jar ./clients/clients-transformer/target/clients-transformer-${VERSION}-jar-with-dependencies.jar "$@"
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,15 @@
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.json.JSONArray;
import org.kohsuke.args4j.Argument;
import org.kohsuke.args4j.CmdLineException;
import org.kohsuke.args4j.CmdLineParser;
import org.kohsuke.args4j.Option;
import org.kohsuke.args4j.*;
import org.kohsuke.args4j.spi.SubCommand;
import org.kohsuke.args4j.spi.SubCommandHandler;
import org.kohsuke.args4j.spi.SubCommands;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.net.ConnectException;
import java.net.URISyntaxException;
import java.util.List;
Expand Down Expand Up @@ -152,7 +150,7 @@ private void printUsage(CmdLineException ex) {
"USAGE: batchrefine [-v] ENGINETYPE [ENGINE OPTS] INPUT TRANSFORM [OUTPUT]\n\n" +
"If no OUTPUT is specified, writes to standard output.\n" +
"-------------------------------------------------------------------------------");
ex.getParser().printUsage(System.err);
ex.getParser().printUsage(new OutputStreamWriter(System.err),null, OptionHandlerFilter.ALL);
}

public static void main(String[] args) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ public static enum Format {
}


@Option(name = "-f", aliases = {("--format")}, usage = "The format in which to output the transformed data")
@Option(name = "-f", aliases = {("--format")}, usage = "The format in which to output the transformed data",hidden = true)
protected Format fFormat = Format.csv;
@Option(name = "-c", aliases = {("--config")}, metaVar = "config.properties", usage = "Load batchrefine config from properties file", handler = FileOptionHandler.class)
protected File configFile;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ public String getDefaultMetaVariable() {
StringBuffer rv = new StringBuffer();
rv.append("[");
for (Split t : Split.class.getEnumConstants()) {
rv.append(":int");
rv.append(t).append(" | ");
rv.append(t).append(":int")
.append(" | ");
}
rv.delete(rv.length() - 3, rv.length());
rv.append("]");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,13 @@
import eu.spaziodati.eu.clients.core.commands.SplitCommand;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.kohsuke.args4j.Argument;
import org.kohsuke.args4j.CmdLineException;
import org.kohsuke.args4j.CmdLineParser;
import org.kohsuke.args4j.Option;
import org.kohsuke.args4j.*;
import org.kohsuke.args4j.spi.SubCommand;
import org.kohsuke.args4j.spi.SubCommandHandler;
import org.kohsuke.args4j.spi.SubCommands;

import java.io.OutputStreamWriter;

/**
* Command Line Interface for launching {@link SynchronousTransformer}.
*
Expand Down Expand Up @@ -80,7 +79,7 @@ private void start() throws Exception {
private void printUsage(CmdLineParser parser) {
System.err.println("Usage: batchrefine-transformer [TRANSFORMER OPTIONS] ENGINETYPE [ENGINEOPTIONS]\n"
+ "Starts the BatchRefine Fusepool P3 Transformer.\n");
parser.printUsage(System.err);
parser.printUsage(new OutputStreamWriter(System.err), null, OptionHandlerFilter.PUBLIC);
}

public static void main(String[] args) throws Exception {
Expand Down
Loading

0 comments on commit 65b83f8

Please sign in to comment.