Uncharted Spark Pipeline

http://unchartedsoftware.github.io/sparkpipe-core

Apache Spark is a powerful tool for distributed data processing. Enhancing and maintaining productivity on this platform involves implementing Spark scripts in a modular, testable and reusable fashion.

The Uncharted Spark Pipeline facilitates expressing individual components of Spark scripts in a standardized way so that they can be:

connected in series (or even in a more complex dependency graph of operations)
unit tested effectively with mock inputs
reused and shared

Quick Start

Try the pipeline yourself using spark-shell:

$ spark-shell --packages software.uncharted.sparkpipe:sparkpipe-core:0.9.5

scala> import software.uncharted.sparkpipe.Pipe
scala> Pipe("hello").to(_+" world").run

Assuming you have a file named people.json, read a DataFrame from a file and manipulate it:

scala> :paste
import software.uncharted.sparkpipe.Pipe
import software.uncharted.sparkpipe.ops

Pipe(sqlContext)
.to(ops.core.dataframe.io.read("people.json", "json"))
.to(ops.core.dataframe.renameColumns(Map("age" -> "personAge")))
.to(_.filter("personAge > 21").count)
.run

Included Operations

The Uncharted Spark Pipeline comes bundled with core operations which perform a variety of useful tasks, and are intended to serve as aids in implementing more domain-specific operations.

For more information, check out the docs.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
gradle/wrapper		gradle/wrapper
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
scalastyle_config.xml		scalastyle_config.xml
settings.gradle		settings.gradle
test-environment.sh		test-environment.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uncharted Spark Pipeline

Quick Start

Included Operations

About

Releases

Packages

Languages

License

nkronenfeld/sparkpipe-core

Folders and files

Latest commit

History

Repository files navigation

Uncharted Spark Pipeline

Quick Start

Included Operations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages