Skip to content

dcamper/DataPatterns

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataPatterns

DataPatterns is an ECL bundle that provides some basic data profiling and research tools to an ECL programmer.

Installation

This code is installed as an ECL Bundle. Complete instructions for managing ECL Bundles can be found in The ECL IDE and HPCC Client Tools documentation.

Use the ecl command line tool to install this bundle:

ecl bundle install https://github.com/hpcc-systems/DataPatterns.git

You may have to either navigate to the client tools bin directory before executing the command, or use the full path to the ecl tool.

After installation, all of the code here becomes available after you import it:

IMPORT DataPatterns;

Note that is possible to use this code without installing it as a bundle. To do so, simply make it available within your IDE and just ignore the Bundle.ecl file. With the Windows IDE, the DataPatterns directory must not be a top-level item in your repository list; it needs to be installed one level below the top level, such as within your "My Files" folder.

Profile

Documentation as pulled from the beginning of Profile.ecl:

Profile() is a function macro for profiling all or part of a dataset.
The output is a dataset containing the following information for each
profiled attribute:

     attribute               The name of the attribute
     given_attribute_type    The ECL type of the attribute as it was defined
                             in the input dataset
     best_attribute_type     An ECL data type that both allows all values
                             in the input dataset and consumes the least
                             amount of memory
     rec_count               The number of records analyzed in the dataset;
                             this may be fewer than the total number of
                             records, if the optional sampleSize argument
                             was provided with a value less than 100
     fill_count              The number of rec_count records containing
                             non-nil values
     fill_rate               The percentage of rec_count records containing
                             non-nil values; a 'nil value' is an empty
                             string or a numeric zero; note that BOOLEAN
                             attributes are always counted as filled,
                             regardless of their value; also, fixed-length
                             DATA attributes (e.g. DATA10) are also counted
                             as filled, given their typical function of
                             holding data blobs
     cardinality             The number of unique, non-nil values within
                             the attribute
     modes                   The most common values in the attribute, after
                             coercing all values to STRING, along with the
                             number of records in which the values were
                             found; if no value is repeated more than once
                             then no mode will be shown; up to five (5)
                             modes will be shown; note that string values
                             longer than the maxPatternLen argument will
                             be truncated
     min_length              The shortest length of a value when expressed
                             as a string; null values are ignored
     max_length              The longest length of a value when expressed
                             as a string
     ave_length              The average length of a value when expressed
                             as a string
     popular_patterns        The most common patterns of values; see below
     rare_patterns           The least common patterns of values; see below
     is_numeric              Boolean indicating if the original attribute
                             was numeric and therefore whether or not
                             the numeric_xxxx output fields will be
                             populated with actual values; if this value
                             is FALSE then all numeric_xxxx output values
                             should be ignored
     numeric_min             The smallest non-nil value found within the
                             attribute as a DECIMAL; the attribute must be
                             a numeric ECL datatype; non-numeric attributes
                             will return zero
     numeric_max             The largest non-nil value found within the
                             attribute as a DECIMAL; the attribute must be
                             a numeric ECL datatype; non-numeric attributes
                             will return zero
     numeric_mean            The mean (average) non-nil value found within
                             the attribute as a DECIMAL; the attribute must
                             be a numeric ECL datatype; non-numeric
                             attributes will return zero
     numeric_std_dev         The standard deviation of the non-nil values
                             in the attribute as a DECIMAL; the attribute
                             must be a numeric ECL datatype; non-numeric
                             attributes will return zero
     numeric_lower_quartile  The value separating the first (bottom) and
                             second quarters of non-nil values within
                             the attribute as a DECIMAL; the attribute must
                             be a numeric ECL datatype; non-numeric
                             attributes will return zero
     numeric_median          The median non-nil value within the attribute
                             as a DECIMAL; the attribute must be a numeric
                             ECL datatype; non-numeric attributes will return
                             zero
     numeric_upper_quartile  The value separating the third and fourth
                             (top) quarters of non-nil values within
                             the attribute as a DECIMAL; the attribute must
                             be a numeric ECL datatype; non-numeric
                             attributes will return zero
     numeric_correlations    A child dataset containing correlation values
                             comparing the current numeric attribute with all
                             other numeric attributes, listed in descending
                             correlation value order; the attribute must be
                             a numeric ECL datatype; non-numeric attributes
                             will return an empty child dataset; note that
                             this can be a time-consuming operation,
                             depending on the number of numeric attributes
                             in your dataset and the number of rows (if you
                             have N numeric attributes, then
                             N * (N - 1) / 2 calculations are performed,
                             each scanning all data rows)

Most profile outputs can be disabled.  See the 'features' argument, below.

Data patterns can give you an idea of what your data looks like when it is
expressed as a (human-readable) string.  The function converts each
character of the string into a fixed character palette to produce a "data
pattern" and then counts the number of unique patterns for that attribute.
The most- and least-popular patterns from the data will be shown in the
output, along with the number of times that pattern appears and an example
(randomly chosen from the actual data).  The character palette used is:

     A   Any uppercase letter
     a   Any lowercase letter
     9   Any numeric digit
     B   A boolean value (true or false)

All other characters are left as-is in the pattern.

Only the top level attributes within a dataset are processed; embedded
records and child recordsets are skipped.  SET data types (such as
SET OF INTEGER) are also skipped.  If the dataset contains only
embedded records and/or child recordsets, or if fieldListStr is specified
but with attributes that don't actually exist in the top level (or are
invalid) then an error will be produced during compilation time.

This function works best when the incoming dataset contains attributes that
have precise data types (e.g. UNSIGNED4 data types instead of numbers
stored in a STRING data type).

Function parameters:

@param   inFile          The dataset to process; REQUIRED
@param   fieldListStr    A string containing a comma-delimited list of
                         attribute names to process; use an empty string to
                         process all attributes in inFile; attributes named
                         here that are not found in the top level of inFile
                         will be ignored; OPTIONAL, defaults to an
                         empty string
@param   maxPatterns     The maximum number of patterns (both popular and
                         rare) to return for each attribute; OPTIONAL,
                         defaults to 100
@param   maxPatternLen   The maximum length of a pattern; longer patterns
                         are truncated in the output; this value is also
                         used to set the maximum length of the data to
                         consider when finding cardinality and mode values;
                         must be 33 or larger; OPTIONAL, defaults to 100
@param   features        A comma-delimited string listing the profiling
                         elements to be included in the output; OPTIONAL,
                         defaults to a comma-delimited string containing all
                         of the available keywords:
                             KEYWORD         AFFECTED OUTPUT
                             fill_rate       fill_rate
                                             fill_count
                             cardinality     cardinality
                             best_ecl_types  best_attribute_type
                             modes           modes
                             lengths         min_length
                                             max_length
                                             ave_length
                             patterns        popular_patterns
                                             rare_patterns
                             min_max         numeric_min
                                             numeric_max
                             mean            numeric_mean
                             std_dev         numeric_std_dev
                             quartiles       numeric_lower_quartile
                                             numeric_median
                                             numeric_upper_quartile
                             correlations    numeric_correlations
                         To omit the output associated with a single keyword,
                         set this argument to a comma-delimited string
                         containing all other keywords; note that the
                         is_numeric output will appear only if min_max,
                         mean, std_dev, quartiles, or correlations features
                         are active
@param   sampleSize      A positive integer representing a percentage of
                         inFile to examine, which is useful when analyzing a
                         very large dataset and only an estimated data
                         profile is sufficient; valid range for this
                         argument is 1-100; values outside of this range
                         will be clamped; OPTIONAL, defaults to 100 (which
                         indicates that the entire dataset will be analyzed)

Here is a very simple example of executing the full data profiling code:

IMPORT DataPatterns;

DataRec := { string source, unsigned bid, string name };

ds := DATASET('~thor::my_sample_data', DataRec, FLAT);

profileResults := DataPatterns.Profile(ds);

OUTPUT(profileResults, ALL, NAMED('profileResults'));

The data profiling code can be easily tested with the included Tests module. Execute the following in either hThor or Roxie (because all of the test data is inline it is best to not use Thor for running these tests):

IMPORT DataPatterns;

EVALUATE(DataPatterns.Tests);

If the tests pass then the execution will succeed and there will be no output. You will see a runtime error if any of the tests fail or if you execute the tests on Thor. Note that it may take some time to compile the code before execution.

BestRecordStructure

This is a function macro that, given a dataset, returns a recordset containing the "best" record definition for the given dataset. By default, the entire dataset will be examined. You can override this behavior by providing a percentage of the dataset to examine (1-100) as the second argument. This is useful if you are checking a very large file and are confident that a sample will provide correct results.

There is an important limitation in this function: Child datasets and embedded record definitions are ignored entirely (this comes from the fact that this function calls the Profile() function, and that function ignores child datasets and embedded records). All other top-level attributes are returned, though, so you can edit the returned string and insert anything that is missing.

Sample call:

IMPORT DataPatterns;

DataRec := { string source, unsigned bid, string name };

ds := DATASET('~thor::my_sample_data', DataRec, FLAT);

recordDefinition := DataPatterns.BestRecordStructure(ds);

OUTPUT(recordDefinition, NAMED('recordDefinition'), ALL);

The result will be a recordset containing only a STRING field. The first record will always contain 'RECORD' and the last record will always contain 'END;'. The records in between will contain declarations for the attributes found within the given dataset. The entire result can be copied and pasted into an ECL code module.

Note that, when outputing the result of BestRecordStructure to a workunit, it is a good idea to add an ALL flag to the OUTPUT function. This ensures that all attributes will be displayed. Otherwise, if you have more than 100 attributes in the given dataset, the result will be truncated.

About

See parent repo for up-to-date code

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • ECL 100.0%