-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathRecursive_Masterize_Hospital_Entities.log
129 lines (63 loc) · 7.61 KB
/
Recursive_Masterize_Hospital_Entities.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
Command= ['C:\\apps\\spark-3.0.3-bin-hadoop2.7\\./bin/spark-submit.cmd', '--conf', 'spark.master=local[1]', '--conf', 'spark.app.name=TempSession.com', 'pyspark-shell']
JAVA_HOME= C:\Program Files\Java\jdk-13.0.2
Successfully created \Data_Files\hospital_account_info.csv!
Standardized the input data columns, and sorted them to ensure better compression-statistics!
Data_Files\hospital_account_info.csv is now ready to be processed by the algorithm.
Formatted the Data_Files\hospital_account_info_raw.csv file into Data_Files\hospital_account_info.csv using PySpark successfully.
Finished reading the Source-file Data_Files\hospital_account_info.csv
Columns: ['COUNTRY' 'POSTAL_CODE' 'SITE_NAME' 'STATE' 'CITY' 'ADDRESS_LINE_1'
'ADDRESS_LINE_2' 'ADDRESS_LINE_3' 'COUNTY_NAME' 'PHONE_NUM' 'SITE_TYPE'
'SITE_OWNERSHIP' 'ACCOUNT_NUM']
Countries identified are: ['USA']
Special Character that will be replaced are: !"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~
There will be 1 batches since incoming dataset-size = 64 and minibatch-size = 2000
Starting Batch[0]...
Successfully created \USA_country_df.csv!
USA_0 has 64 records.
Invoking the Rscript now...
R OUTPUT:
[1] "levenshtein .dll 0.6 0.5 3 USA Data_Files\\Raw_Scores 4 Dedup NA NA"
[1] "Loading levenshtein.dll !"
[1] 4000
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 286488 8.8 643737 19.7 462378 14.2
Vcells 285633 2.2 8388608 64.0 1450623 11.1
[1] "NRows= 64 , Candidate-pairs= 2016 , Columns are "
[1] "SR_NUM" "CONCAT_ADDRESS" "SITE_NAME" "STATE"
[5] "CITY" "POSTAL_CODE"
[1] "N_combinations= 2016 , Columns are "
[1] "id1" "id2" "CONCAT_ADDRESS" "SITE_NAME"
[5] "STATE" "CITY" "POSTAL_CODE" "is_match"
[1] "Scaling up column scores if threshold crossed"
[1] "SITE_NAME : 0.6"
[1] "STATE : 0.6"
[1] "CITY : 0.6"
[1] "POSTAL_CODE : 0.6"
[1] "CONCAT_ADDRESS : 0.5"
[1] "Data_Files\\Raw_Scores//USA_Score_Features.csv"
[1] "Successfully created //Data_Files\\Raw_Scores//USA_Score_Features.csv !"
Time difference of 0.4069979 secs
2 raw-score pairs will be deleted off as their cyclic dependecies have lower score than existing.
Successfully created \Data_Files\Cleaned_Scores\USA_Cleaned_Feature_Scores.csv!
"SR_NUM_2" will be the master record
Found potential duplicates. Processing their master and cross-reference...
64 records get merged into 48
Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_Cross_Ref_Full_Report.csv!
Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_0_Master.csv!
1 csvs generated are: ['USA_0_Master.csv']
Max-depth for USA will be 1
1 csvs need to be processed: ['USA_0_Master.csv'] , length = 2
Get the unique set of all record-ids since there isn't a second file to compare.
48 records get merged into 48
Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_d1_0_Master.csv!
Successfully created \Data_Files\Master_Data\Recursive_Staging_Area\USA_d1_Raw_Cross_Ref.csv!
Depth[1] processed successfully.
Processed all 1 levels. Generating the master and cross-reference at the final-layer...
Successfully created \Data_Files\Master_Data\USA_Master.csv!
64 records get merged into 48
Successfully created \Data_Files\Master_Data\USA_Raw_Cross_Ref.csv!
Successfully created \Data_Files\Master_Data\USA_Cross_Ref_Full_Report.csv!
Pipeline completed execution...
SUCCESS: The process with PID 5380 (child process of PID 17916) has been terminated.
SUCCESS: The process with PID 17916 (child process of PID 18988) has been terminated.
SUCCESS: The process with PID 18988 (child process of PID 19224) has been terminated.