This script automates the process of handling large pipe-delimited text files. It splits the input file into smaller chunks, converts the chunks into CSV format, and generates SQL scripts for database insertion.
- Splitting Large Files:
- Splits the input text file into smaller files with a configurable number of lines.
- CSV Conversion:
- Converts split files from pipe-delimited format to CSV with proper encoding and delimiter replacement. PS: Specify
--delimiter=,
argument if the file you are working with is comma-delimited.
- Converts split files from pipe-delimited format to CSV with proper encoding and delimiter replacement. PS: Specify
- SQL Script Generation:
- Creates SQL
INSERT
statements for a buffer range of 1000 by combining each CSV row insert values into a single statement per 1000 csv file iterations whilst ensuring proper formatting.
- Creates SQL
- Customizable Options:
- Allows you to specify sub-folder names, delimiters, encoding, and number of lines per split if they are different from the scripts default configuration.
- Bash (Linux or WSL on Windows)
- Tools:
split
(for file splitting)sed
(for delimiter replacement)iconv
(for encoding conversion)awk
(for SQL formatting)
(Optional) You can opt to create an "input" folder and add the raw text file(s) in it instead of placing the files in the project root. On the command below, you may use the relative path (./input/<text_file>) if you choose do do so.
./lgtxtdata_processing_pipeline.sh <text_file> <sub_folder> [options]
--lines=<number>
: Number of lines per split file (default:10000
).--delimiter=<char>
: Delimiter in the input file (default:|
).--encoding=<encoding>
: Desired file encoding (default:UTF-8
).
./lgtxtdata_processing_pipeline.sh large_file.txt batch_2024
- Input File:
large_file.txt
- CSV Sub-folder:
batch_2024
./lgtxtdata_processing_pipeline.sh large_file.txt batch_2024 --lines=5000 --delimiter="|" --encoding="ISO-8859-1"
The script organizes outputs into the following directories under ./output
:
- Split Files:
./output/split_files/
- CSV Files:
./output/converted_csv/<sub_folder>/
- SQL Script:
./output/sql_files/<input_file_name>.sql
output/
├── split_files/
│ ├── split_large_file_00.txt
│ ├── split_large_file_01.txt
│ └── ...
├── converted_csv/
│ └── batch_2024/
│ ├── split_large_file_00.csv
│ ├── split_large_file_01.csv
│ └── ...
└── sql_files/
└── large_file.sql
Each row in the CSV is converted into an SQL INSERT
statement (example):
INSERT INTO [Outpatient_Visit_Data] VALUES ('00000000','78530000','NYATHI THUSILE','EYEG','ST JOHNS RETINAL CLINIC','13/06/2022 10:54:50');
- Validates the existence of the input file.
- Ensures all output directories are created before processing.
- Handles single quotes in data by removing them.
The script displays:
- Start Time: When execution begins.
- End Time: When processing finishes.
- Elapsed Time: Total duration of execution.
Feel free to fork and contribute improvements to this script.
Let me know if you need further adjustments!