Swivel Spark prep

Distributed data preparation for the Swivel model

Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.

Development

./gradlew idea # if using InteliJ
./gradlew build
./gradlew test

Run

On a single machine, Apache Spark in Local mode

./gradlew shadowJar
./sparkprep --help

On an Apache Spark standalone cluster

./gradlew build

# https://github.com/tensorflow/ecosystem/tree/master/hadoop#build-and-install
cp <path-to-ecosystem-hadoop>/target/tensorflow-hadoop-1.0-SNAPSHOT-shaded-protobuf.jar .
# or use un-official build from .m2 or .gradle cache, after ./gradlew shadowJar
cp <path-to>/tensorflow-hadoop-1.0-01232017-SNAPSHOT-shaded-protobuf.jar .


MASTER="<master-url>" ./sparkprep-cluster --help

Algorithm

Pre-processing consist of 3 jobs:

reading or creating a vocabulary
coocurence matrix

vectorizing input: token->int using the vocaulary
build full dense coocurence matrix, for given window size
shard coocurence matrix to N pices (over each dimention)
- encode each shard in a single ProtoBuff
- save N^2 files

coocurence matrix: count marginal summs

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MAINTAINERS		MAINTAINERS
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
sparkprep		sparkprep
sparkprep-cluster		sparkprep-cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Swivel Spark prep

Development

Run

Algorithm

About

Releases

Packages

Contributors 4

Languages

License

src-d/swivel-spark-prep

Folders and files

Latest commit

History

Repository files navigation

Swivel Spark prep

Development

Run

Algorithm

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages