Swivel Spark prep

Distributed data preparation for the Swivel model

Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.

Development

./gradlew idea # if using InteliJ
./gradlew build
./gradlew test

Run

On a single machine, Apache Spark in Local mode

./gradlew shadowJar
./sparkprep --help

On an Apache Spark standalone cluster

./gradlew build

# https://github.com/tensorflow/ecosystem/tree/master/hadoop#build-and-install
cp <path-to-ecosystem-hadoop>/target/tensorflow-hadoop-1.0-SNAPSHOT-shaded-protobuf.jar .
# or use un-official build from .m2 or .gradle cache, after ./gradlew shadowJar
cp <path-to>/tensorflow-hadoop-1.0-01232017-SNAPSHOT-shaded-protobuf.jar .


MASTER="<master-url>" ./sparkprep-cluster --help

Algorithm

Pre-processing consist of 3 jobs:

reading or creating a vocabulary
coocurence matrix

vectorizing input: token->int using the vocaulary
build full dense coocurence matrix, for given window size
shard coocurence matrix to N pices (over each dimention)
- encode each shard in a single ProtoBuff
- save N^2 files

coocurence matrix: count marginal summs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Swivel Spark prep

Development

Run

Algorithm

Files

README.md

Latest commit

History

README.md

File metadata and controls

Swivel Spark prep

Development

Run

Algorithm