-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSI.aux
203 lines (203 loc) · 30.1 KB
/
SI.aux
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
\relax
\providecommand\hyper@newdestlabel[2]{}
\providecommand\HyperFirstAtBeginDocument{\AtBeginDocument}
\HyperFirstAtBeginDocument{\ifx\hyper@anchor\@undefined
\global\let\oldcontentsline\contentsline
\gdef\contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}}
\global\let\oldnewlabel\newlabel
\gdef\newlabel#1#2{\newlabelxx{#1}#2}
\gdef\newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}}
\AtEndDocument{\ifx\hyper@anchor\@undefined
\let\contentsline\oldcontentsline
\let\newlabel\oldnewlabel
\fi}
\fi}
\global\let\hyper@last\relax
\gdef\HyperFirstAtBeginDocument#1{#1}
\providecommand\HyField@AuxAddToFields[1]{}
\providecommand\HyField@AuxAddToCoFields[2]{}
\providecommand \oddpage@label [2]{}
\citation{pmid31562252}
\citation{pmid31562252}
\gdef \LT@i {\LT@entry
{1}{16.4137pt}\LT@entry
{1}{61.77504pt}\LT@entry
{1}{420.72505pt}}
\@writefile{lot}{\contentsline {table}{\numberline {I}{\color {black} \sffamily \fontsize {9}{10}\selectfont Disease Categories With Detailed Set of ICD9 Codes Used }}{3}{table.1}}
\newlabel{tab0}{{I}{3}{\color {black} \sffamily \fontsize {9}{10}\selectfont Disease Categories With Detailed Set of ICD9 Codes Used}{table.1}{}}
\citation{jarquin2011racial}
\citation{jarquin2011racial}
\citation{ltgranger80}
\citation{CL12g}
\citation{GEMS}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {ASD Occurrence Patterns} Panel A illustrates the spatial distribution of ASD insurance claims, and panel B shows the same data after population normalization, illustrating the relatively small demographic skew to ASD prevalence within the general population with access to medical insurance, which is consistent with the suggestion that prevalence variation might be linked to regional and socioeconomic disparities in access to services\nobreakspace {}\cite {jarquin2011racial}. }}{9}{figure.1}}
\newlabel{figocc}{{1}{9}{\color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {ASD Occurrence Patterns} Panel A illustrates the spatial distribution of ASD insurance claims, and panel B shows the same data after population normalization, illustrating the relatively small demographic skew to ASD prevalence within the general population with access to medical insurance, which is consistent with the suggestion that prevalence variation might be linked to regional and socioeconomic disparities in access to services~\cite {jarquin2011racial}}{figure.1}{}}
\@writefile{toc}{\contentsline {section}{\numberline {I}Detailed Mathematical Approach}{9}{section.1}}
\newlabel{sec:mathdetails}{{I}{9}{Detailed Mathematical Approach}{section.1}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-A}}Time-series Modeling of Diagnostic History}{9}{subsection.1.1}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-B}}Step 1: Partitioning The Human Disease Spectrum}{9}{subsection.1.2}}
\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \color {black}\textbf {Effect of Race and Ethnicity on Predictive Performance with 95\% Confidence Bounds from the UCM dataset}. Panels A and B show the variation of AUC achieved in out-of-sample data in three race-based population groups (White, African-American and multi-racial). We find no significant differences. Panels C and D show the performance in Hispanic vs non-Hispanic sub-populations. We find that children with Hispanic background have a lower AUC, but the differences are not significant. Other races/ethnicities were not considered due to lack of sufficient data. }}{10}{figure.2}}
\newlabel{figrace}{{2}{10}{\color {black} \sffamily \fontsize {9}{10}\selectfont \HCOL \textbf {Effect of Race and Ethnicity on Predictive Performance with 95\% Confidence Bounds from the UCM dataset}. Panels A and B show the variation of AUC achieved in out-of-sample data in three race-based population groups (White, African-American and multi-racial). We find no significant differences. Panels C and D show the performance in Hispanic vs non-Hispanic sub-populations. We find that children with Hispanic background have a lower AUC, but the differences are not significant. Other races/ethnicities were not considered due to lack of sufficient data}{figure.2}{}}
\newlabel{eq1}{{1}{10}{Step 1: Partitioning The Human Disease Spectrum}{equation.1.1}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-C}}Step 2: Model Inference \& The Sequence Likelihood Defect}{10}{subsection.1.3}}
\citation{Cover,kullback1951}
\citation{doob1953stochastic}
\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Probabilistic Finite State Automata models generated for different disease categories for the control and positive\xspace cohorts. We note that in the first cases (digestive disorder), the models get more complex in the positive\xspace cohort, suggesting that the disorders become less random. However, in the categories of otic and integumentary disorders, the models become less complex suggesting increased independence from past events of similar nature. In case of infectious diseases, the model gets more complex for males, and less complex for females, suggesting distinct sex-specific responses associated with high ASD risk. }}{11}{figure.3}}
\newlabel{EXT-autgrid}{{3}{11}{\color {black} \sffamily \fontsize {9}{10}\selectfont Probabilistic Finite State Automata models generated for different disease categories for the control and \treatment cohorts. We note that in the first cases (digestive disorder), the models get more complex in the \treatment cohort, suggesting that the disorders become less random. However, in the categories of otic and integumentary disorders, the models become less complex suggesting increased independence from past events of similar nature. In case of infectious diseases, the model gets more complex for males, and less complex for females, suggesting distinct sex-specific responses associated with high ASD risk}{figure.3}{}}
\citation{Cover}
\citation{friedman}
\citation{breiman}
\citation{friedman}
\citation{hochreiter}
\newlabel{eqR}{{4}{12}{Step 2: Model Inference \& The Sequence Likelihood Defect}{equation.1.4}{}}
\newlabel{eq6}{{5}{12}{Step 2: Model Inference \& The Sequence Likelihood Defect}{equation.1.5}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-D}}Step 3: Risk Estimation Pipeline With Semi-supervised \& Supervised Learning Modules}{12}{subsection.1.4}}
\@writefile{toc}{\contentsline {section}{\numberline {II}Comparison With State of the Art Off-the-shelf ML Algorithms}{12}{section.2}}
\newlabel{sec:offtheshelf}{{II}{12}{Comparison With State of the Art Off-the-shelf ML Algorithms}{section.2}{}}
\@writefile{toc}{\contentsline {section}{\numberline {III}Pipeline Variations, Feature Subsets and Neural Network (NN) Post-processing}{12}{section.3}}
\newlabel{sec:pipelinevar}{{III}{12}{Pipeline Variations, Feature Subsets and Neural Network (NN) Post-processing}{section.3}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \color {black}The simplest LSTM investigated as a baseline. More complex models have significantly larger number of trainable parameters (1 to 10 Million). In contrast our pipeline has 13,744 trainable parameters. }}{13}{figure.4}}
\newlabel{figlstmex}{{4}{13}{\color {black} \sffamily \fontsize {9}{10}\selectfont \HCOL The simplest LSTM investigated as a baseline. More complex models have significantly larger number of trainable parameters (1 to 10 Million). In contrast our pipeline has 13,744 trainable parameters}{figure.4}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Panel A.} Performance of standard tools on correctly predicting eventual ASD diagnosis, computed at age $150$ weeks of age. Long-short Term Memory (LSTM) networks are the state of the art variation of recurrent neural nets, and Random Forests and Gradient Boosting classifiers (CatBoost) are generally regarded as a representative state of the art classification algorithms. Sequence Likelihood Defect (SLD) is the approach developed in this study. LSTMB denotes LSTM with identical pre-processing as in our pipeline (instead of using raw diagnostic codes). \textbf {Panel B.} illustrates that we get good performance with LSTMB with males in the Truven dataset, but the performance is sensitive to the sizes of the training set, and degrades for smaller samples available for females and in the UCM database }}{14}{figure.5}}
\newlabel{EXT-figcompwsoa}{{5}{14}{\color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Panel A.} Performance of standard tools on correctly predicting eventual ASD diagnosis, computed at age $150$ weeks of age. Long-short Term Memory (LSTM) networks are the state of the art variation of recurrent neural nets, and Random Forests and Gradient Boosting classifiers (CatBoost) are generally regarded as a representative state of the art classification algorithms. Sequence Likelihood Defect (SLD) is the approach developed in this study. LSTMB denotes LSTM with identical pre-processing as in our pipeline (instead of using raw diagnostic codes). \textbf {Panel B.} illustrates that we get good performance with LSTMB with males in the Truven dataset, but the performance is sensitive to the sizes of the training set, and degrades for smaller samples available for females and in the UCM database}{figure.5}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {III-A}}Feature Subset Evaluations \& Code Density As A Feature}{14}{subsection.3.1}}
\newlabel{subsec:features}{{\unhbox \voidb@x \hbox {III-A}}{14}{Feature Subset Evaluations \& Code Density As A Feature}{subsection.3.1}{}}
\@writefile{toc}{\contentsline {section}{\numberline {IV}Threshold Selection on ROC Curve}{14}{section.4}}
\newlabel{sec:F1}{{IV}{14}{Threshold Selection on ROC Curve}{section.4}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Evaluations of Feature Subsets, Class Imbalance, Code Density, Coding Uncertainty, \& Disambiguation from Other Psychiatric Phenotypes.} Panel A illustrates that the pipeline performance where the control group is restricted to children to have at least one psychiatric phenotype other than ASD. It is clear that we have very good discrimination between ASD and non-ASD phenotypes. Panel B illustrates the situation where we restrict the treatment cohort to children to have at least $2$ AD diagnostic codes, to see whether the pipeline performance is markedly different in populations where the coding errors/uncertainty is smaller. We see that such restrictions have no appreciable effect on pipeline performance. Panel C illustrates the AUC distributions obtained by using sampled control cohorts that are of the same size as the treatment cohort, to evaluate the effect of class imbalance. Again we see that such restrictions do not appreciably change performance. Panel D explores the performance changes when we use a restricted set of features, or simply use code density as the sole feature. We conclude that the combined feature set used in our optimized pipeline is superior to using the subsets individually. Code density is the least performant feature, and is not stable across databases. }}{15}{figure.6}}
\newlabel{EXT-figcompsi}{{6}{15}{\color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Evaluations of Feature Subsets, Class Imbalance, Code Density, Coding Uncertainty, \& Disambiguation from Other Psychiatric Phenotypes.} Panel A illustrates that the pipeline performance where the control group is restricted to children to have at least one psychiatric phenotype other than ASD. It is clear that we have very good discrimination between ASD and non-ASD phenotypes. Panel B illustrates the situation where we restrict the treatment cohort to children to have at least $2$ AD diagnostic codes, to see whether the pipeline performance is markedly different in populations where the coding errors/uncertainty is smaller. We see that such restrictions have no appreciable effect on pipeline performance. Panel C illustrates the AUC distributions obtained by using sampled control cohorts that are of the same size as the treatment cohort, to evaluate the effect of class imbalance. Again we see that such restrictions do not appreciably change performance. Panel D explores the performance changes when we use a restricted set of features, or simply use code density as the sole feature. We conclude that the combined feature set used in our optimized pipeline is superior to using the subsets individually. Code density is the least performant feature, and is not stable across databases}{figure.6}{}}
\@writefile{toc}{\contentsline {section}{\numberline {V}Note on Receiver Operating Characteristics (ROC) and Precision-recall Curves}{15}{section.5}}
\newlabel{sec:ROC}{{V}{15}{Note on Receiver Operating Characteristics (ROC) and Precision-recall Curves}{section.5}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {7}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Details of Co-morbidity Patterns (at age $<3$ years)} for immunologic (panel A), respiratory (panel B), infections (panel C), and disorders with similar pathobiology manifesting opposing association with autism (panel D). }}{16}{figure.7}}
\newlabel{EXT-fig4}{{7}{16}{\color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Details of Co-morbidity Patterns (at age $<3$ years)} for immunologic (panel A), respiratory (panel B), infections (panel C), and disorders with similar pathobiology manifesting opposing association with autism (panel D)}{figure.7}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Co-morbidity Patterns} for mental disorders, vaccinations and health-service encounters. }}{17}{figure.8}}
\newlabel{EXT-figvv1}{{8}{17}{\color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {Co-morbidity Patterns} for mental disorders, vaccinations and health-service encounters}{figure.8}{}}
\citation{pmid31562252}
\@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Predictive Performance without psychiatric codes (ICD9 290 - 319) and codes for health status and services (ICD9 V0-V91) included. As shown, the performance is comparable at 150 weeks, with the AUC for females marginally lower (compare with Fig.\nobreakspace {}\ref {main-fig1} in the main text). The feature importances also are similar, with infectious diseases inferred to have the most importance (or weight) in the pipeline, which is also the case once we add psychiatric phenotypes, and codes for health services in our analysis. As shown in SI-Fig.\nobreakspace {}\ref {EXT-figvv1}A, the psychiatric codes all increase risk, and the vaccination codes (See SI-Fig.\nobreakspace {}\ref {EXT-figvv1}B) all decrease risk when those codes are included. This is why an alternate analysis was carried out to make sure that we are not picking up on psychiatric codes alone. Note in particular that the sensitivity/specificity point highlighted in panel A above is identical after adding the codes. This suggests that our predictive performance arises from patterns learned from co-morbidities, which are not just neuropsychiatric in nature. }}{18}{figure.9}}
\newlabel{EXT-fig1nop}{{9}{18}{\color {black} \sffamily \fontsize {9}{10}\selectfont Predictive Performance without psychiatric codes (ICD9 290 - 319) and codes for health status and services (ICD9 V0-V91) included. As shown, the performance is comparable at 150 weeks, with the AUC for females marginally lower (compare with Fig.~\ref {main-fig1} in the main text). The feature importances also are similar, with infectious diseases inferred to have the most importance (or weight) in the pipeline, which is also the case once we add psychiatric phenotypes, and codes for health services in our analysis. As shown in SI-Fig.~\ref {EXT-figvv1}A, the psychiatric codes all increase risk, and the vaccination codes (See SI-Fig.~\ref {EXT-figvv1}B) all decrease risk when those codes are included. This is why an alternate analysis was carried out to make sure that we are not picking up on psychiatric codes alone. Note in particular that the sensitivity/specificity point highlighted in panel A above is identical after adding the codes. This suggests that our predictive performance arises from patterns learned from co-morbidities, which are not just neuropsychiatric in nature}{figure.9}{}}
\newlabel{eq9}{{14}{18}{Note on Receiver Operating Characteristics (ROC) and Precision-recall Curves}{equation.5.14}{}}
\newlabel{eqPPV}{{15}{18}{Note on Receiver Operating Characteristics (ROC) and Precision-recall Curves}{equation.5.15}{}}
\citation{gordon2016whittling,penner2018practice,hyman2020identification}
\citation{johnson2007identification,zwaigenbaum2015early}
\citation{robins2014validation,hyman2020identification}
\citation{hyman2020identification}
\citation{penner2018practice}
\citation{hyman2020identification}
\citation{esler2015autism}
\citation{chlebowski2010using}
\citation{hyman2020identification}
\@writefile{toc}{\contentsline {section}{\numberline {VI}Effect of Class Imbalance}{19}{section.6}}
\newlabel{subsec:classimbalance}{{VI}{19}{Effect of Class Imbalance}{section.6}{}}
\@writefile{toc}{\contentsline {section}{\numberline {VII}Note on ASD Clinical Diagnosis \& Uncertainty of EHR Record}{19}{section.7}}
\newlabel{sec:diag}{{VII}{19}{Note on ASD Clinical Diagnosis \& Uncertainty of EHR Record}{section.7}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {VII-A}}Diagnostic Evaluations}{19}{subsection.7.1}}
\newlabel{subsec:diageval}{{\unhbox \voidb@x \hbox {VII-A}}{19}{Diagnostic Evaluations}{subsection.7.1}{}}
\citation{falkmer2013diagnostic}
\citation{falkmer2013diagnostic}
\citation{falkmer2013diagnostic}
\citation{hyman2020identification}
\citation{robins2014validation,hyman2020identification}
\citation{pmid31562252}
\citation{hyman2020identification}
\citation{hyman2020identification}
\citation{pmid31562252}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {VII-B}}Change In Diagnostic Criteria for ASD, Inclusion of PDD, Asperger, and Disambiguation From Unrelated Psychiatric Phenotypes}{20}{subsection.7.2}}
\newlabel{subsec:otherpsych}{{\unhbox \voidb@x \hbox {VII-B}}{20}{Change In Diagnostic Criteria for ASD, Inclusion of PDD, Asperger, and Disambiguation From Unrelated Psychiatric Phenotypes}{subsection.7.2}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {VII-C}}Performance Comparison With M-CHAT/F}{20}{subsection.7.3}}
\citation{lord2006autism,kleinman2008diagnostic}
\citation{bolton2012autism,kozlowski2011parents}
\citation{baio2014prevalence}
\citation{kalb2012determinants}
\citation{bisgaier2011access}
\citation{fenikile2015barriers}
\citation{fenikile2015barriers}
\citation{gordon2016whittling}
\citation{gordon2016whittling,althouse2006pediatric}
\citation{pmid31562252}
\citation{pmid31562252}
\@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {4D Search To Take Advantage of Data on Population Stratification (Using Prevalence of 2.23\% as reported by CHOP\nobreakspace {}\cite {pmid31562252}).} While as a standalone tool our approach is comparable to M-CHAT/F at around the 26 month mark (and later), we can take advantage of the independence of the tests to devise a conditional choice of the operating parameters for the new approach. In particular, taking advantage of published estimated prevalence rates of different categories of M-CHAT/F scores, and true positives in each sub-population upon stratification, we can choose a different set of specificity and sensitivity in each sub-population to yield significantly improved overall performance across databases, and much earlier. Additionally, we can choose to operate at a high recall point, where we maximize overall sensitivity, or a high precision point, where we maximize the positive predictive value. }}{21}{figure.10}}
\newlabel{EXT-fig4D}{{10}{21}{\color {black} \sffamily \fontsize {9}{10}\selectfont \textbf {4D Search To Take Advantage of Data on Population Stratification (Using Prevalence of 2.23\% as reported by CHOP~\cite {pmid31562252}).} While as a standalone tool our approach is comparable to M-CHAT/F at around the 26 month mark (and later), we can take advantage of the independence of the tests to devise a conditional choice of the operating parameters for the new approach. In particular, taking advantage of published estimated prevalence rates of different categories of M-CHAT/F scores, and true positives in each sub-population upon stratification, we can choose a different set of specificity and sensitivity in each sub-population to yield significantly improved overall performance across databases, and much earlier. Additionally, we can choose to operate at a high recall point, where we maximize overall sensitivity, or a high precision point, where we maximize the positive predictive value}{figure.10}{}}
\@writefile{toc}{\contentsline {section}{\numberline {VIII}Improving Wait-times For Diagnostic Evaluations by Reducing False Positives}{21}{section.8}}
\newlabel{sec:waittime}{{VIII}{21}{Improving Wait-times For Diagnostic Evaluations by Reducing False Positives}{section.8}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {VIII-A}}4D Decision Optimization Using M-CHAT/F Population Stratification To Boost PPV}{21}{subsection.8.1}}
\newlabel{subsec:4D}{{\unhbox \voidb@x \hbox {VIII-A}}{21}{4D Decision Optimization Using M-CHAT/F Population Stratification To Boost PPV}{subsection.8.1}{}}
\citation{pmid31562252}
\citation{pmid31562252}
\citation{pmid31562252}
\newlabel{eqscpop}{{22}{22}{4D Decision Optimization Using M-CHAT/F Population Stratification To Boost PPV}{equation.8.22}{}}
\@writefile{lot}{\contentsline {table}{\numberline {II}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Boosted Sensitivity, Specificity and PPV Achieved at \textbf {150 weeks} Conditioned on M-CHAT/F Scores }}{22}{table.2}}
\newlabel{EXT-tabboost150}{{II}{22}{\color {black} \sffamily \fontsize {9}{10}\selectfont Boosted Sensitivity, Specificity and PPV Achieved at \textbf {150 weeks} Conditioned on M-CHAT/F Scores}{table.2}{}}
\@writefile{lot}{\contentsline {table}{\numberline {III}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Population Stratification Results on large M-CHAT/F Study(n=20,375)\nobreakspace {}\cite {pmid31562252} }}{22}{table.3}}
\newlabel{EXT-tabCHOP}{{III}{22}{\color {black} \sffamily \fontsize {9}{10}\selectfont Population Stratification Results on large M-CHAT/F Study(n=20,375)~\cite {pmid31562252}}{table.3}{}}
\@writefile{lot}{\contentsline {table}{\numberline {IV}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont $\gamma ,\gamma '$ Computed from Population Stratification Recorded In M-CHAT/F Study\nobreakspace {}\cite {pmid31562252} ($\rho =0.0223$) }}{22}{table.4}}
\newlabel{EXT-tabCHOP2}{{IV}{22}{\color {black} \sffamily \fontsize {9}{10}\selectfont $\gamma ,\gamma '$ Computed from Population Stratification Recorded In M-CHAT/F Study~\cite {pmid31562252} ($\rho =0.0223$)}{table.4}{}}
\citation{hyman2020identification}
\citation{pmid31562252}
\citation{CL12g}
\citation{hopcroft2008introduction}
\citation{klenke2013probability}
\citation{doob1990stochastic}
\citation{klenke2013probability}
\@writefile{toc}{\contentsline {section}{\numberline {IX}Generating PFSA Models From Set of Input Streams with Variable Input Lengths}{23}{section.9}}
\newlabel{sec:varl}{{IX}{23}{Generating PFSA Models From Set of Input Streams with Variable Input Lengths}{section.9}{}}
\@writefile{toc}{\contentsline {section}{\numberline {X}Probabilsitic Finite State Automata Inference}{23}{section.10}}
\newlabel{sec:PFSA}{{X}{23}{Probabilsitic Finite State Automata Inference}{section.10}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {X-A}}Probabilistic Finite-State Automaton}{23}{subsection.10.1}}
\newlabel{subsec:DEFN_PFSA}{{\unhbox \voidb@x \hbox {X-A}}{23}{Probabilistic Finite-State Automaton}{subsection.10.1}{}}
\newlabel{defn:StochasticProcessOverSigma}{{2}{23}{Stochastic Process over $\Sigma $}{defn.2}{}}
\newlabel{defn:MeasureAndDeriv}{{3}{23}{Sequence-Induced Measure and Derivative}{defn.3}{}}
\citation{chattopadhyay2008structural}
\citation{Chattopadhyay20140826}
\citation{bondy2008graph}
\citation{vidyasagar2014hidden,kai1967markov_stdis}
\newlabel{defn:NerodeEquiv}{{4}{24}{Probabilistic Nerode Equivalence and Causal States \cite {chattopadhyay2008structural}}{defn.4}{}}
\newlabel{defn:PFSA}{{5}{24}{Probabilistic Finite-State Automaton (PFSA)}{defn.5}{}}
\newlabel{defn:StrongConn}{{8}{24}{Strongly Connected PFSA}{defn.8}{}}
\newlabel{defn:GammaExpr}{{9}{24}{$\Gamma $-Expression}{defn.9}{}}
\newlabel{defn:InducedDistr}{{10}{24}{Sequence-Induced Distribution on States}{defn.10}{}}
\newlabel{defn:StochasticProcessOfPFSA}{{11}{24}{Stochastic Process Generated by a PFSA}{defn.11}{}}
\citation{trahtman2008road}
\citation{cover2012elements}
\citation{matthews2016sparse}
\newlabel{def:JointSyncSeq}{{15}{25}{Joint $\varepsilon $-Synchronizing Sequence}{defn.15}{}}
\@writefile{toc}{\contentsline {section}{\numberline {XI}Sequence Likelihood Defect}{25}{section.11}}
\newlabel{sec:SLD}{{XI}{25}{Sequence Likelihood Defect}{section.11}{}}
\newlabel{thm:Closed-formFormulaForEntropyRate}{{1}{25}{Closed-form Formula for Entropy Rate and KL Divergence}{thm.1}{}}
\citation{cover2012elements}
\citation{hardy1992divergent}
\newlabel{alg:GenL}{{1}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.1}{}}
\newlabel{alg:GenConv}{{3}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.3}{}}
\newlabel{alg:GenSyncSeq}{{4}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.4}{}}
\newlabel{alg:GenStep2Start}{{5}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.5}{}}
\newlabel{alg:GenIdenStateStart}{{12}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.12}{}}
\newlabel{alg:GenIdenStateEnd}{{17}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.17}{}}
\newlabel{alg:GenStep2End}{{19}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.19}{}}
\newlabel{alg:GenIdenTransProbStart}{{20}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.20}{}}
\newlabel{alg:GenIdenTransProbEnd}{{25}{26}{Probabilistic Finite-State Automaton}{AlgoLine.1.25}{}}
\@writefile{loa}{\contentsline {algocf}{\numberline {1}{\ignorespaces \textrm {\bf \texttt {GenESeSS}}\xspace }}{26}{algocf.1}}
\newlabel{alg:GenESeSS}{{1}{26}{Probabilistic Finite-State Automaton}{algocf.1}{}}
\newlabel{thm:convergenceOfLLH}{{2}{26}{Convergence of log-likelihood}{thm.2}{}}
\citation{CL12g}
\citation{CL12g}
\citation{CL12g}
\@writefile{loa}{\contentsline {algocf}{\numberline {2}{\ignorespaces Log-likelihood}}{27}{algocf.2}}
\newlabel{alg:LLK}{{2}{27}{Sequence Likelihood Defect}{algocf.2}{}}
\@writefile{toc}{\contentsline {section}{\numberline {XII}Pipeline Optimization}{27}{section.12}}
\newlabel{sec:pipeline}{{XII}{27}{Pipeline Optimization}{section.12}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {XII-A}}Input Data Format}{27}{subsection.12.1}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {XII-B}}Algorithms}{27}{subsection.12.2}}
\@writefile{toc}{\contentsline {section}{\numberline {XIII}Example Run with Released Application}{27}{section.13}}
\newlabel{sec:app}{{XIII}{27}{Example Run with Released Application}{section.13}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {XIII-A}}Prerequisites \& Installation}{27}{subsection.13.1}}
\@writefile{loa}{\contentsline {algocf}{\numberline {3}{\ignorespaces ICD-9 Encoding}}{28}{algocf.3}}
\newlabel{algo1}{{3}{28}{Input Data Format}{algocf.3}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {XIII-B}}EHR data format}{28}{subsection.13.2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {XIII-C}}Sample Python code risk estimation}{28}{subsection.13.3}}
\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {XIII-D}}Sample Python script risk estimation}{28}{subsection.13.4}}
\@writefile{loa}{\contentsline {algocf}{\numberline {4}{\ignorespaces Prediction Pipeline Training}}{29}{algocf.4}}
\newlabel{algo2}{{4}{29}{Input Data Format}{algocf.4}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {11}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Pipeline schema: How the data set is split into test sets and two training sets: one for inferring HMM models, and one for training the boosting classifier. The two key algorithms here are \texttt {genESeSS}\nobreakspace {}\cite {CL12g} and the llk which does the sequence likelihood computation described in Section\nobreakspace {}\ref {sec:SLD} }}{30}{figure.11}}
\newlabel{figschema}{{11}{30}{\color {black} \sffamily \fontsize {9}{10}\selectfont Pipeline schema: How the data set is split into test sets and two training sets: one for inferring HMM models, and one for training the boosting classifier. The two key algorithms here are \texttt {genESeSS}~\cite {CL12g} and the llk which does the sequence likelihood computation described in Section~\ref {sec:SLD}}{figure.11}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Screen capture of the page on pypi.org hosting the released application Link: \href {https://pypi.org/project/ehrzero/}{https://pypi.org/project/ehrzero/} }}{31}{figure.12}}
\@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Python code prediction example }}{31}{figure.13}}
\@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces \color {black} \sffamily \fontsize {9}{10}\selectfont Python script prediction example }}{31}{figure.14}}