Skip to content

vivekaxl/hpcc_i_spark

Repository files navigation

hpcc_i_spark

This package can be used to sample a dataset (given the logical filename) from a HPCC cluster. The dataset is returned as a list of list and this can then be used to build models using scikit learn. This can also be used to create RDD in the following way:

# Getting the content of the logical file from a ip. We only sample 2 points from each partition
content = get_content(logical_filename='vivek::data::c_ecolids.csv', thor_ip="152.46.17.96", no_sample=2)

# Convert the data from string to appropriate data type- float in this case
content = map(lambda c: map(float, c), content)

# Convert to RDD
rows_rdd = sc.parallelize((content))

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published