Recursive_Masterize_Hospital_Entities.log

﻿Command=  ['C:\\apps\\spark-3.0.3-bin-hadoop2.7\\./bin/spark-submit.cmd', '--conf', 'spark.master=local[1]', '--conf', 'spark.app.name=TempSession.com', 'pyspark-shell']
JAVA_HOME=  C:\Program Files\Java\jdk-13.0.2

Successfully created \Data_Files\hospital_account_info.csv!

Standardized the input data columns, and sorted them to ensure better compression-statistics!
Data_Files\hospital_account_info.csv is now ready to be processed by the algorithm.

Formatted the Data_Files\hospital_account_info_raw.csv file into Data_Files\hospital_account_info.csv using PySpark successfully.

Finished reading the Source-file Data_Files\hospital_account_info.csv

Columns: ['COUNTRY' 'POSTAL_CODE' 'SITE_NAME' 'STATE' 'CITY' 'ADDRESS_LINE_1'
 'ADDRESS_LINE_2' 'ADDRESS_LINE_3' 'COUNTY_NAME' 'PHONE_NUM' 'SITE_TYPE'
 'SITE_OWNERSHIP' 'ACCOUNT_NUM']


Countries identified are: ['USA']

Special Character that will be replaced are:  !"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~

There will be 1 batches since incoming dataset-size = 64 and minibatch-size = 2000


Starting Batch[0]...

Successfully created \USA_country_df.csv!

USA_0 has 64 records.

Invoking the Rscript now...
R OUTPUT:

[1] "levenshtein .dll 0.6 0.5 3 USA Data_Files\\Raw_Scores 4 Dedup NA NA"

[1] "Loading levenshtein.dll !"

[1] 4000

         used (Mb) gc trigger (Mb) max used (Mb)

Ncells 286488  8.8     643737 19.7   462378 14.2

Vcells 285633  2.2    8388608 64.0  1450623 11.1

[1] "NRows= 64 , Candidate-pairs= 2016 , Columns are "

[1] "SR_NUM"         "CONCAT_ADDRESS" "SITE_NAME"      "STATE"         

[5] "CITY"           "POSTAL_CODE"   

[1] "N_combinations= 2016 , Columns are "

[1] "id1"            "id2"            "CONCAT_ADDRESS" "SITE_NAME"     

[5] "STATE"          "CITY"           "POSTAL_CODE"    "is_match"      

[1] "Scaling up column scores if threshold crossed"

[1] "SITE_NAME  :  0.6"

[1] "STATE  :  0.6"

[1] "CITY  :  0.6"

[1] "POSTAL_CODE  :  0.6"

[1] "CONCAT_ADDRESS  :  0.5"

[1] "Data_Files\\Raw_Scores//USA_Score_Features.csv"

[1] "Successfully created //Data_Files\\Raw_Scores//USA_Score_Features.csv !"



Time difference of 0.4069979 secs




2 raw-score pairs will be deleted off as their cyclic dependecies have lower score than existing.

Successfully created \Data_Files\Cleaned_Scores\USA_Cleaned_Feature_Scores.csv!

"SR_NUM_2" will be the master record


Found potential duplicates. Processing their master and cross-reference...

64 records get merged into 48

Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_Cross_Ref_Full_Report.csv!

Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_0_Master.csv!
1 csvs generated are: ['USA_0_Master.csv']

Max-depth for USA will be 1
1 csvs need to be processed: ['USA_0_Master.csv'] , length = 2


Get the unique set of all record-ids since there isn't a second file to compare.

48 records get merged into 48

Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_d1_0_Master.csv!

Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_d1_Raw_Cross_Ref.csv!


Depth[1] processed successfully.




Processed all 1 levels. Generating the master and cross-reference at the final-layer...

Successfully created \Data_Files\Master_Data\USA_Master.csv!
64 records get merged into 48

Successfully created \Data_Files\Master_Data\USA_Raw_Cross_Ref.csv!

Successfully created \Data_Files\Master_Data\USA_Cross_Ref_Full_Report.csv!



Pipeline completed execution...
SUCCESS: The process with PID 5380 (child process of PID 17916) has been terminated.
SUCCESS: The process with PID 17916 (child process of PID 18988) has been terminated.
SUCCESS: The process with PID 18988 (child process of PID 19224) has been terminated.