Merge pull request #19 from sumeshi/feature/v0.4.0

Feature/v0.4.0
sumeshi · Jan 18, 2025 · 1d95507 · 1d95507
2 parents 153c71d + 3870388
commit 1d95507
Show file tree

Hide file tree

Showing 32 changed files with 646 additions and 275 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 [![PyPI version](https://badge.fury.io/py/qsv.svg)](https://badge.fury.io/py/qsv)
 ![PyPI - Downloads](https://img.shields.io/pypi/dm/qsv)
 
-![quilter-csv](https://gist.githubusercontent.com/sumeshi/644af27c8960a9b6be6c7470fe4dca59/raw/4115bc2ccf9ab5fb40a455c34ac0be885b7f263d/quilter-csv.svg)
+![quilter-csv](https://gist.githubusercontent.com/sumeshi/644af27c8960a9b6be6c7470fe4dca59/raw/00d774e6814a462eb48e68f29fc6226976238777/quilter-csv.svg)
 
 A tool that provides elastic and rapid filtering for efficient analysis of huge CSV files, such as eventlogs.
 
@@ -107,7 +107,7 @@ Filters rows where the specified column matches the given regex.
 | -------- | ---------- | --------- | ------------- | -------------------------------------------------------------------- |
 | Argument | colname    | str       |               | The name of the column to test against the regex pattern.            |
 | Argument | pattern    | str       |               | A regular expression pattern used for matching values in the column. |
-| Argument | ignorecase | bool      | False         | If True, performs case-insensitive pattern matching.                 |
+| Option   | ignorecase | bool      | False         | If True, performs case-insensitive pattern matching.                 |
 
 ```
 $ qsv load ./Security.csv - contains 'Date and Time' '10/6/2016'
@@ -121,7 +121,7 @@ Replaces values using the specified regex.
 | Argument | colname     | str       |               | The name of the column whose values will be modified.                  |
 | Argument | pattern     | str       |               | A regular expression pattern identifying substrings to replace.        |
 | Argument | replacement | str       |               | The text that replaces matched substrings.                             |
-| Argument | ignorecase  | bool      | False         | If True, the regex matching is performed in a case-insensitive manner. |
+| Option   | ignorecase  | bool      | False         | If True, the regex matching is performed in a case-insensitive manner. |
 
 ```
 $ qsv load ./Security.csv - sed 'Date and Time' '/' '-'
@@ -134,7 +134,7 @@ This function is similar to running a grep command while preserving the header r
 | Category | Parameter  | Data Type | Default Value | Description                                                                     |
 | -------- | ---------- | --------- | ------------- | ------------------------------------------------------------------------------- |
 | Argument | pattern    | str       |               | A regular expression pattern used to filter rows. Any row with a match is kept. |
-| Argument | ignorecase | bool      | False         | If True, the regex match is case-insensitive.                                   |
+| Option   | ignorecase | bool      | False         | If True, the regex match is case-insensitive.                                   |
 
 ```
 $ qsv load ./Security.csv - grep 'LogonType'
@@ -191,12 +191,12 @@ Changes the timezone of the specified date column.
 
 The datetime format strings follow the same conventions as [Python](https://docs.python.org/3/library/datetime.html)'s datetime module (based on the C99 standard).
 
-| Category | Parameter       | Data Type | Default Value | Description                                                                                    |
-| -------- | --------------- | --------- | ------------- | ---------------------------------------------------------------------------------------------- |
-| Argument | colname         | str       |               | The name of the date/time column to convert.                                                   |
-| Option   | timezone_from   | str       | "UTC"         | The original timezone of the column's values.                                                  |
-| Option   | timezone_to     | str       | "UTC"  | The target timezone to convert values into.                                                    |
-| Option   | datetime_format | str       | AutoDetect    | The datetime format for parsing values. If not provided, the format is automatically inferred. |
+| Category | Parameter | Data Type | Default Value | Description                                                                                    |
+| -------- | --------- | --------- | ------------- | ---------------------------------------------------------------------------------------------- |
+| Argument | colname   | str       |               | The name of the date/time column to convert.                                                   |
+| Option   | tz_from   | str       | "UTC"         | The original timezone of the column's values.                                                  |
+| Option   | tz_to     | str       | "UTC"         | The target timezone to convert values into.                                                    |
+| Option   | dt_format | str       | AutoDetect    | The datetime format for parsing values. If not provided, the format is automatically inferred. |
 
 ```
 $ qsv load ./Security.csv - changetz 'Date and Time' --timezone_from=UTC --timezone_to=Asia/Tokyo --datetime_format="%m/%d/%Y %I:%M:%S %p"
@@ -312,51 +312,143 @@ $ qsv load Security.csv - dump ./Security-qsv.csv
 
 
 ### Quilt
-Quilt is a command that allows you to predefine a series of Initializer, Chainable Functions, and Finalizer processes in a YAML configuration file, and then execute them all at once.
+Quilt is a command-line tool that allows you to define a sequence of **Initializer**, **Chainable Functions**, and **Finalizer** processes in a YAML configuration file and execute them in a single pipeline.
 
-| Category | Parameter | Data Type  | Default Value | Description                                                                                                     |
-| -------- | --------- | ---------- | ------------- | --------------------------------------------------------------------------------------------------------------- |
-| Argument | config    | str        |               | The path to a YAML configuration file defining a set of initialization, transformation, and finalization steps. |
-| Argument | path      | tuple[str] |               | One or more paths to CSV files to be processed according to the predefined rules in the configuration file.     |
-| Option   | debug     | bool       | False         | Enabling this option will output each rule and its intermediate processing results to the standard output.      |
+#### Usage
 
+| Category | Parameter | Data Type  | Default Value | Description                                                                                                 |
+| -------- | --------- | ---------- | ------------- | ----------------------------------------------------------------------------------------------------------- |
+| Argument | config    | str        |               | Path to a YAML configuration file/directory that defines initializers, chainable functions, and finalizers steps.     |
+| Argument | path      | tuple[str] |               | One or more paths to CSV files to be processed according to the predefined rules in the configuration file. |
+
+#### Command Example
+```bash
+$ qsv quilt rules/test.yaml ./Security.csv
 ```
-$ qsv quilt rules ./Security.csv
-```
 
-rules/test.yaml
+#### Configuration Example
+`rules/test.yaml`
+
+```yaml
+title: 'test'
+description: 'test processes'
+version: '0.1.0'
+author: 'John Doe <john@example.com>'
+stages:
+  test_stage: # arbitrary stage name
+    type: process # operation type
+    steps:
+      load: 
+      isin:
+        colname: EventId
+        values:
+          - 4624
+      head:
+        number: 5
+      select:
+        colnames:
+          - RecordNumber
+          - TimeCreated
+      changetz:
+        colname: TimeCreated
+        tz_from: UTC
+        tz_to: Asia/Tokyo
+        dt_format: "%Y-%m-%d %H:%M:%S%.f"
+      showtable:
+```
+
+The above configuration file defines the following sequence of operations:
+1. Load a CSV file.
+2. Filter rows where the `EventId` column contains the value `4624`.
+3. Retrieve the first 5 rows.
+4. Extract the `RecordNumber` and `TimeCreated` columns.
+5. Convert the time zone of the `TimeCreated` column from `UTC` to `Asia/Tokyo`.
+6. Display the processing results in a table format.
+
+#### Pipeline Operations
+| Operation Type | Description                                                | Parameters                                                                                                                                    |
+| -------------- | ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
+| process        | Executes a series of operations on the dataset.            | `steps`: A dict of operations (e.g., `load`, `select`, `dump`) to apply.                                                                      |
+| concat         | Concatenates multiple datasets vertically or horizontally. | `sources`: List of stages to concat. <br>`params.how`: `vertical`, `vertical_relaxed`, `horizontal`, `diagonal`, `align`, etc.                       |
+| join           | Joins multiple datasets using keys.                        | `sources`: List of stages to join.<br>`params.key`: Column(s) used for joining.<br>`params.how`: `inner`, `left`, `right`, `full`, `semi`, `anti`, `cross`.<br>`params.coalesce`: bool |
+
+#### Sample YAML (`rules/test.yaml`):
+```yaml
+title: 'test'
+description: 'test pipelines'
+version: '0.1.0'
+author: 'John Doe <john@example.com>'
+stages:
+  load_stage:
+    type: process
+    steps:
+      load:
+
+  stage_1:
+    type: process
+    source: load_stage
+    steps:
+      select:
+        colnames: 
+          - TimeCreated
+          - PayloadData1
+
+  stage_2:
+    type: process
+    source: load_stage
+    steps:
+      select:
+        colnames: 
+          - TimeCreated
+          - PayloadData2
+
+  merge_stage:
+    type: join
+    sources:
+      - stage_1
+      - stage_2
+    params:
+      how: full
+      key: TimeCreated
+      coalesce: True
+  
+  stage_3:
+    type: process
+    source: merge_stage
+    steps:
+      showtable:
+```
+
+#### Note: Step Duplication
+Quilt supports YAML configurations with duplicate keys in steps.
+
+```yaml
+stages:
+test_stage:
+  steps:
+    load:
+    renamecol: # duplicate key
+      from: old_col1
+      to: new_col1
+    renamecol: # duplicate key
+      from: old_col2
+      to: new_col2
+    renamecol: # duplicate key
+      from: old_col3
+      to: new_col3
+    show:
+```
+
+Internally, these keys are handled as:
+
 ```yaml
-title: test
-description: test filter
-version: 0.1.0
-author: John Doe <john@example.com>
-rules:
-  load: 
-  isin:
-    colname: EventId
-    values:
-      - 4624
-  head:
-    number: 5
-  select:
-    colnames:
-      - RecordNumber
-      - TimeCreated
-  changetz:
-    colname: TimeCreated
-    timezone_from: UTC
-    timezone_to: Asia/Tokyo
-    datetime_format: "%Y-%m-%d %H:%M:%S%.f"
-  showtable:
-```
-
-Note: While the standard YAML specification does not permit duplicate key names, Quilt rules allow for duplicate keys under the rules section. Specifically, even when multiple renamecol entries are listed, they are internally replaced and processed as renamecol, renamecol_, renamecol__, and so on. This approach enables each entry to be recognized and handled as distinct rules.
-
-## Planned Features:
-- CSV cache (.pkl, duckdb, etc.)
-- Logical condition-based filtering (e.g., OR, AND) for more complex queries.
-- Grouping for operations like count
-- Support for joining data with other tables.
+renamecol
+renamecol_
+renamecol__
+```
+
+This ensures that each steps is treated as a distinct operation in the pipeline.
+
 
 ## Installation
 ### from PyPI

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "qsv"
-version = "0.3.10"
+version = "0.4.0"
 description = "A tool that provides elastic and rapid filtering for efficient analysis of huge CSV files, such as eventlogs."
 readme = "README.md"
 authors = [

diff --git a/src/qsv/__init__.py b/src/qsv/__init__.py
@@ -1,19 +1,8 @@
-import logging
 import fire
 from qsv.controllers.DataFrameController import DataFrameController
 
-logging.basicConfig(
-    level=logging.DEBUG,
-    format='%(asctime)s [%(levelname)s] %(message)s',
-    datefmt="%Y-%m-%dT%H:%M:%S%z",
-    handlers=[
-        logging.StreamHandler()
-    ]
-)
-
-# entrypoint
 def main():
     fire.Fire(DataFrameController)
 
 if __name__ == '__main__':
-    main()
+    main()