Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Quality tool #17

Open
jmquintana79 opened this issue Oct 15, 2023 · 0 comments
Open

Data Quality tool #17

jmquintana79 opened this issue Oct 15, 2023 · 0 comments
Labels
ANALYSIS Analysis tools enhancement New feature or request

Comments

@jmquintana79
Copy link
Owner

jmquintana79 commented Oct 15, 2023

Introduction

Una herramienta para verificar la utilidad del dato es esencial, tanto para el conocimiento del dato (Data Analysis) compara una ML pipeline (verificación de calidad de training / testing datasets y también posibles diferencias entre ambos).

La libreria great-expectation, libreria para verificar la calidad del dato a través de unos test (expectations) para lanzar warnings, me ha dado la idea de desarrollar un método para abordar este problema.

La idea es crear una herramienta / metodologia sencilla con las siguientes características:

  • Una vez se reciba un nuevo dataset, lo primero de todo, antes incluso o justo después de un EDA, crear un objeto (por ejemplo json) con los parámetros necesarios de cada columna a analizar gracias a un template (por ejemplo, con los rangos máximos y mínimos permitidos.
  • Este template vendrá informado con valores por defecto.

NOTA: La librería mencionada anteriormente podría realizar quizás todo lo dicho hasta ahora.

  • Se pretende tener también un sistema de avisos general (datos pasaron el test o no ).
  • También estaría bien tener un sistema de avisos mas particular, donde se diga el registro y variable y la naturaleza de un fallo en alguno de los test. Realmente podría ser un log o incluso una tabla consultable.
  • Ademas de avisar, actue. Lo que se me ocurre mas sencillo es crear una nueva columna boolean "is_log_quality" para favorecer el filtrado de aquellos registros que tengan alguna alerta en alguna de las columnas. Obviamente, en caso de todos los registros sean "low quality", habría que avisarlo.

Glossary of Expectations

Table shape

expect_column_to_exist
expect_table_columns_to_match_ordered_list
expect_table_columns_to_match_set
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_table_row_count_to_equal_other_table

Missing values, unique values, and types

expect_column_values_to_be_unique
expect_column_values_to_not_be_null
expect_column_values_to_be_null
expect_column_values_to_be_of_type
expect_column_values_to_be_in_type_list

Sets and ranges

expect_column_values_to_be_in_set
expect_column_values_to_not_be_in_set
expect_column_values_to_be_between
expect_column_values_to_be_increasing
expect_column_values_to_be_decreasing

String matching

expect_column_value_lengths_to_be_between
expect_column_value_lengths_to_equal
expect_column_values_to_match_regex
expect_column_values_to_not_match_regex
expect_column_values_to_match_regex_list
expect_column_values_to_not_match_regex_list
expect_column_values_to_match_like_pattern
expect_column_values_to_not_match_like_pattern
expect_column_values_to_match_like_pattern_list
expect_column_values_to_not_match_like_pattern_list

Datetime and JSON parsing

expect_column_values_to_match_strftime_format
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_json_parseable
expect_column_values_to_match_json_schema

Aggregate functions

expect_column_distinct_values_to_be_in_set
expect_column_distinct_values_to_contain_set
expect_column_distinct_values_to_equal_set
expect_column_mean_to_be_between
expect_column_median_to_be_between
expect_column_quantile_values_to_be_between
expect_column_stdev_to_be_between
expect_column_unique_value_count_to_be_between
expect_column_proportion_of_unique_values_to_be_between
expect_column_most_common_value_to_be_in_set
expect_column_max_to_be_between
expect_column_min_to_be_between
expect_column_sum_to_be_between

Multi-column

expect_column_pair_values_A_to_be_greater_than_B
expect_column_pair_values_to_be_equal
expect_column_pair_values_to_be_in_set
expect_select_column_values_to_be_unique_within_record
expect_multicolumn_sum_to_equal
expect_column_pair_cramers_phi_value_to_be_less_than
expect_compound_columns_to_be_unique

Distributional functions

expect_column_kl_divergence_to_be_less_than
expect_column_bootstrapped_ks_test_p_value_to_be_greater_than
expect_column_chisquare_test_p_value_to_be_greater_than
expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than

FileDataAsset

File data assets reason at the file level, and the line level (for text data).

expect_file_line_regex_match_count_to_be_between
expect_file_line_regex_match_count_to_equal
expect_file_hash_to_equal
expect_file_size_to_be_between
expect_file_to_exist
expect_file_to_have_valid_table_header
expect_file_to_be_valid_json

References

@jmquintana79 jmquintana79 added enhancement New feature or request ANALYSIS Analysis tools labels Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ANALYSIS Analysis tools enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant