#main_scienceA.tex#

\documentclass[onecolumn,10pt]{IEEEtran}
\input{preamble_science.tex}
\usepackage{multirow}
\usepackage[size=normal]{subcaption}
\externaldocument[SI-]{SI}

\cfoot{\scriptsize\thepage}
\lfoot{}
\rfoot{}

\begin{document}
\input{custom_defs}
\maketitle

% \begin{abstract}
%    Autism spectrum disorder (ASD) is a developmental disability associated with  significant social  and behavioral challenges. There is   a need for tools that help identify children with ASD as early as possible~\cite{cdc0,nimh}. Our current incomplete understanding of ASD pathogenesis, and the lack of reliable biomarkers hampers early detection,  intervention, and developmental trajectories. In this study we develop and validate  machine inferred digital biomarkers for autism using individual diagnostic codes already recorded during medical encounters. Our risk estimator  identifies  children at high risk  with a corresponding area under the receiver operating characteristic curve (AUC)  exceeding 80\% from shortly after two years of age for either sex, and across two independent databases of patient records. Thus, we systematically leverage ASD co-morbidities -  with no requirement of additional blood work, tests  or  procedures - to compute the Autism Co-morbid Risk Score (\acor) which predicts  elevated  risk  during the earliest childhood years, when interventions are the  most effective.  By itself, \acor has superior performance to common questionnaires-based screenings such as the M-CHAT/F~\cite{pmid31562252}, and has the potential to  reduce socio-economic, ethnic and demographic biases. In addition to standalone performance,  independence from questionnaire based screening allows us to further boost performance by conditioning on the individual M-CHAT/F scores --   we can either halve the false positive rate of current screening protocols or boost sensitivity to over 60\%, while maintaining  specificity  above 95\%.  Adopted  in practice, \acor could significantly reduce the median diagnostic age for ASD, and  reduce  long post-screen wait-times~\cite{pmid27565363}  experienced by families for confirmatory diagnoses and access to  evidence based interventions.
%  \end{abstract}


 % 1 abstract reduction
 \begin{abstract}
   In this study we develop digital biomarkers for autism spectrum disorder (ASD), computed from patterns of past  medical encounters, identifying  children at high risk  with an area under the receiver operating characteristic curve (AUC)  exceeding 80\% from shortly after two years of age for either sex, and across two independent patient databases. We  leverage uncharted ASD co-morbidities -  with no requirement of additional blood work,   or  procedures - to estimate  the Autism Co-morbid Risk Score (\acor),  during the earliest  childhood years, when interventions are the  most effective.  \acor has superior predictive performance to common questionnaires-based screenings, and can  reduce their current socio-economic, ethnic and demographic biases. Additionally, we can  boost performance by conditioning on current screening scores --    either halving the state-of-the-art false positive rate  or boosting sensitivity to over 60\%, while maintaining  specificity  above 95\%.  Thus, \acor can significantly reduce the median diagnostic age, eliminating  long  wait-times~\cite{pmid27565363}  experienced by families for confirmatory diagnoses, and access to  evidence based interventions.
 \end{abstract}


\section*{Introduction}
% 
Autism spectrum disorder is a developmental disability associated with significant social  and behavioral challenges.
Even though ASD may be diagnosed as early as the  age of two~\cite{cdc},  children frequently remain undiagnosed  until after the fourth birthday~\cite{pmid24529515}. With genetic and metabolomic tests~\cite{howsmon2017classification,li2018high,hicks2018validation,smith2020metabolomics} still in their infancy,  a careful review of behavioral history and a direct
observation of symptoms is currently
necessary~\cite{volkmar2014practice,hyman2020identification} for a clinical diagnosis.  Starting with a positive initial screen, a confirmed diagnosis of ASD is a   multi-step process that often takes 3 months to 1 year,  delaying entry into time-critical intervention programs. While   lengthy evaluations~\cite{kalb2012determinants}, cost of care~\cite{bisgaier2011access},  lack of providers~\cite{fenikile2015barriers}, and lack of comfort in diagnosing ASD by primary care providers~\cite{fenikile2015barriers} are all responsible to varying degrees~\cite{gordon2016whittling}, one  obvious source of this delay is the number of false positives produced in the initial ASD-specific screening tools in use today. For example, a large-scale study of the  M-CHAT/F~\cite{pmid31562252} (n=20,375), which is  used commonly as a screening tool~\cite{robins2014validation,hyman2020identification},  is demonstrated to have an estimated  sensitivity of 38.8\%, specificity of 94.9\% and Positive Predictive Value (PPV) of 14.6\%. Thus,  currently  out of every 100 children with ASD,  M-CHAT/F flags about 39, and out of every 100 children it flags, about 85 are false positives, exacerbating  wait times and queues~\cite{gordon2016whittling}.  Automated   screening  that might be administered with  no specialized training, requires no behavioral observations, and is functionally independent of the tools employed in current practice,  has the potential for  immediate transformative  impact on patient care.


While the neurobiological basis of autism remains poorly understood,  a detailed assessment conducted by the US Centers for Disease Control and Prevention (CDC) demonstrated that  children with ASD experience  higher than expected rates of many diseases~\cite{cdc}. These include conditions related to dysregulation of immune pathways such as eczema, allergies, asthma, as well as ear and respiratory infections, gastrointestinal problems, developmental issues, severe headaches, migraines, and seizures~\cite{pmid30733689,pmid22511918}. In the present study, we exploit   these   co-morbidities to estimate the risk of  childhood neuropsychiatric disorders on the autism spectrum. We refer to the risk estimated by our approach as the Autism Co-morbid Risk Score (\acor). Using only sequences of diagnostic codes from past doctor's visits, our risk estimator reliably
predicts an eventual clinical  diagnosis -- or the lack thereof -- for individual patients.
Thus, the key clinical  contribution of this study is the formalization  of subtle co-morbidity patterns as a reliable screening tool, and potentially  improve wait-times for diagnostic evaluations by significantly reducing the number of false positives encountered in initial screens in current practice.
 
 
A  screening  tool that tracks the risk of an eventual ASD diagnosis,  based  on the information already being gathered during regular doctor's visits, and which may be implemented as a  fully automated background process requiring no time commitment from providers has the potential to reduce avoidable diagnostic  delays at no additional burden of time, money and personnel resources.  Use of patterns emergent in  the diagnostic history to estimate risk might help reduce the subjective component in questionnaire-based screening tools, resulting in 1) reduced effect of potential language and cultural barriers in diverse populations, and 2) possibly better identification of  children with milder symptoms~\cite{hyman2020identification}.
Furthermore, being functionally independent of the M-CHAT/F, we show that there is clear advantage to combining the outcomes of the two tools: we can take advantage of any population stratification induced by the M-CHAT/F scores to significantly boost combined screening performance (See Materials \&  Methods, and Supplementary text, section~\ref{SI-sec:waittime}).
 
\input{figsandtab.tex}

Use of sophisticated analytics to identify children at high risk is a topic of substantial current interest, with independent progress being made by several groups~\cite{hyde2019applications,abbas2020multi,duda2016clinical,duda2014testing,fusaro2014potential,wall2012use,wall2012use2}. Many of these approaches  focus on analyzing questionnaires, with recent efforts demonstrating the use of  automated pattern recognition in video clips of toddler behavior. However, the inclusion of older kids above the age of  $5$ years and  small cohort sizes might limit the immediate clinical adoption of these approaches for universal screening.

Laboratory tests for ASD have also begun to emerge, particularly leveraging detection of abnormal metabolites in plasma~\cite{smith2020metabolomics,howsmon2017classification}, and  salivary poly-omic RNA~\cite{hicks2018validation}. However, as before, inclusion  of older children, limits  applicability in  screening, where we need a decision at $18-24$ months. In addition, such approaches -- while  instrumental in deciphering ASD pathophysiology -- might be too expensive for universal adoption at this time.

In contrast to \acor, the above approaches require additional data or tests. Use of comorbidity patterns derived from past  Electronic Health Records  has been either limited to establishing correlative associations~\cite{doshi2014comorbidity,bishop2018using}, or has substantially underperformed~\cite{lingren2016electronic}(AUC $\leqq 65\%)$ compared to our results.

\section*{Materials \& Methods}
We view the task of predicting  ASD diagnoses   as a binary classification problem: sequences of diagnostic codes are classified into positive and control categories, where ``positive'' refers to children eventually diagnosed with ASD, as indicated by the presence of a clinical diagnosis (ICD9 code 299.X) in their medical records. Of the two independent sources of clinical incidence data used in this study,  the primary source used to train our predictive pipeline  is the Truven Health Analytics MarketScan\textsuperscript{\textregistered} Commercial Claims and Encounters Database for the years 2003 to 2012~\cite{hansen2017truven} (referred to  as the Truven dataset). This US national database merges  data contributed by over 150 insurance carriers and large self-insurance companies,  and comprises over  4.6 billion inpatient and outpatient service claims and  almost six billion diagnosis codes. We extracted histories of patients within the age of $0-6$ years, and excluded  patients for whom one or more of the following criteria fails:  1) At least one code pertaining to one of the $17$ disease categories we use (See later for discussion of disease categories) is present in the diagnostic history, 2) The first and last available record for a patient are  at least 15 weeks apart. These exclusion criteria ensure that we are not considering patients who have too few observations. Additionally, during validation runs,  we restricted the control set to patients observable in the databases to those whose last record is not before the first 200 weeks of life. The characteristics of excluded patients is shown in Table~\ref{tab2}. We trained with over  30M diagnostic records (16,649,548 for males and  14,318,303  for females with 9,835 unique  codes).

While the Truven database is used for both training and out-of-sample cross-validation with held-back  data, our second independent dataset  comprising de-identified diagnostic records for children treated at the University of Chicago Medical Center between the years of 2006 to 2018 (the UCM dataset) aids in further cross-validation. We considered children between the ages of $0-6$ years, and  applied the same exclusion criteria as the Truven dataset.  The  number of  patients used from the two databases is shown in Table~\ref{tab2}. Our datasets are consistent with documented ASD prevalence and median diagnostic age (3 years in the claims database  versus 3 years 10 months to 4 years  in the US~\cite{pmid29701730}) with no significant geospatial prevalence variation (See SI-Fig.~\ref{SI-figocc}).

The significant diversity of diagnostic codes (6,089 distinct ICD-9 codes and 11,522 distinct ICD-10 codes in total in the two datasets), along with the sparsity of codes per sequence and the need to make good predictions as early as possible,  makes this a difficult learning problem.  We proceed by  partitioning the  disease spectrum into $17 $ broad categories, $e.g.$ infectious diseases, immunologic disorders, endocrinal disorders etc. Each patient is then represented by $17$ distinct time series, each  tracking an individual disease category. At the population level, these disease-specific sparse stochastic time series are  compressed into specialized Markov models (separately for the control and the treatment cohorts) to identify  the distinctive patterns  pertaining to elevated ASD risk. Each of these inferred models is a Probabilistic Finite State Automaton (PFSA)~\cite{CR08}. See Supplementary Text Section~\ref{SI-sec:PFSA} for details on PFSA inference~\cite{CL12g,Chattopadhyay20140826}.  {\HCOL Importantly, the number of states and connectivities in the PFSA models are explicitly inferred from data; with the algorithms generating higher resolution models when needed, and opting for a lower resolution otherwise automatically. This results in a framework with fewer nominal free parameters compared to standard neural network or deep learning architectures: the simplest deep learning model we investigated has over 185K trainable parameters, while our final pipeline has 13,744. A smaller set of free parameters implies a smaller sample complexity, $i.e.$, can be learned with less data, which, at least in the present application, gives us better performance relative to the standard architectures we investigated (See SI-Fig.~\ref{SI-EXT-figcompwsoa} in the Supplementary text).}


After inferring  models of diagnostic history, we need to quantify how such models diverge in the \treatment cohort vs those underlying the \control cohort.
We use a novel approach to evaluate subtle deviations in stochastic observations known as the sequence likelihood defect (SLD)~\cite{huang2019data}, to  quantify similarity of observed time-series of diagnostic events to the control vs the \treatment cohorts for individual patients. This novel stochastic inference approach provides  significant boost to the overall performance of our predictors; with only state-of-the-art machine learning  the predictive performance is significantly worse (See Supplementary Text Sections~\ref{SI-sec:pipelinevar} and \ref{SI-sec:offtheshelf}, as well as reported performance in the literature for predicting ASD risk from EHR data with standard algorithms~\cite{lingren2016electronic}).

We briefly outline the SLD computation: To reliably infer the cohort-type of a new patient, $i.e$, the likelihood of a diagnostic sequence being generated by the corresponding cohort model, we generalize the notion of Kullbeck-Leibler (KL) divergence~\cite{Cover,kullback1951} between probability distributions to a divergence $\mathcal{D}_{\textrm{KL}}(G \vert \vert H)$ between ergodic stationary categorical stochastic processes~\cite{doob1953stochastic} $G,H$ as:
\cgather{
  \mathcal{D}_{\textrm{KL}}(G \vert \vert H) = \lim_{n\rightarrow \infty} \frac{1}{n}  \sum_{x:|x| = n}p_G(x)\log\frac{p_G(x)}{p_H(x)}  }%
where $\vert x\vert $ is the sequence length, and $p_G(x) ,p_H(x) $ are the probabilities of sequence $x$ being generated by the processes $G,H$ respectively. Defining the  log-likelihood of  $x$ being generated by a process $G$ as:
\cgather{
  L(x,G)= -\frac{1}{\vert x\vert}\log p_G(x) 
}%
The cohort-type for an observed sequence $x$ | which is actually generated by the hidden process $G$ | can be formally inferred from observations based on the following provable relationships (See Suppl. text Section~\ref{SI-sec:PFSA}, Theorem 6 and 7):
\begin{subequations}\label{eqR}\cgather{
    \lim_{\vert x \vert \rightarrow \infty}L(x,G) = \mathcal{H}(G)   \\
    \lim_{\vert x \vert \rightarrow \infty } L(x,H)  =\mathcal{H}(G) +  \mathcal{D}_{\textrm{KL}}(G \vert \vert H)   
  }\end{subequations}%
where  $\mathcal{H}(\cdot)$ is the entropy rate of a process~\cite{Cover}. Importantly, Eq.~\eqref{eqR} shows that the computed likelihood has an additional non-negative contribution from the divergence term when we choose the incorrect generative process.  Thus, if a  patient is eventually going to be diagnosed with ASD, then we expect that the disease-specific mapped series corresponding to  her diagnostic history be modeled by the corresponding model in the \treatment cohort. Denoting the model  corresponding to disease category $j$ for \treatment and control cohorts as $G^{j}_+,G^{j}_0$ respectively, we can compute the sequence likelihood defect (SLD, $\Delta^j$) as:
\cgather{
  \Delta^j \triangleq L(G^{j}_0,x) - L(G^{j}_+,x) \rightarrow \mathcal{D}_{\textrm{KL}}(G^{j}_0 \vert \vert G^{j}_+) \label{eq6}
}%
With  the inferred   models and  the individual diagnostic history, we  estimate the SLD measure on the  right-hand side of Eqn.~\eqref{eq6}. The higher this likelihood defect, the higher  the similarity of diagnosis history to that of children with autism.
                
In addition to the category specific Markov models, we use a range of engineered features that reflect various aspects of the diagnostic histories, including the proportion of weeks in which a diagnostic code is generated, the maximum length of consecutive weeks with codes, the maximum length of weeks with no codes (See Tab.~\ref{EXT-tab1} for complete description), resulting in a total of $165$ different features that are evaluated for each patient. With these inferred patterns included as features,  we train a second level predictor that learns to map   individual patients  to the control or the \treatment groups based on their  similarity  to the identified  Markov models of category-specific diagnostic histories, and the other engineered features (See Section~\ref{SI-sec:mathdetails} on detailed mathematical details  in Supplementary  text).


Since we need to infer the Markov models prior to the calculation of the likelihood defects, we need two training sets: one that is used to infer the models, and one that subsequently trains the final classifier  with features  derived  from the inferred models along with other engineered features. Thus, the analysis proceeds by first carrying out a random 3-way split of the set of unique patients (in the Truven dataset) into \textit{Markov model inference} ($25\%$), \textit{classifier training} ($25\%$) and \textit{test} ($50\%$) sets. The approximate sample sizes of the three sets are as follows: $\approx 700K$ for each of  the training sets, and $\approx 1.5M$ for the test set. The features used in our pipeline may be ranked in order of their relative importance (See Fig.~\ref{fig1}B for the top 15 features), by
estimating the loss in performance when dropped out of the analysis. We verified that different random splits do not adversely affect performance. The UCM dataset in its entirety is used as a test set, with no retraining of the pipeline.

Our pipeline maps medical histories to a   raw indicator of 
risk. Ultimately, to make crisp predictions, we must choose  a decision threshold for this raw score. In this study, we base our analysis on maximizing the $F_1$-score, defined as the harmonic mean of sensitivity and specificity, to make a   balanced trade-off between Type 1 and Type 2 errors (See Supplementary text, Section~\ref{SI-sec:F1}). The \textit{relative risk} is then defined as the ratio of the raw  risk to the  decision threshold, and a value  $>1$  predicts a future ASD diagnosis. Our two-step learning algorithm outperforms standard tools, and achieves  stable performance across datasets strictly superior to documented M-CHAT/F.

The independence of our approach from questionnaire-based screening implies that we can further boost our performance by conditioning  the sensitivity/specificity trade-offs on individual M-CHAT/F scores. In particular, we leverage the population stratification induced by M-CHAT/F to improve combined performance. Here a combination of \acor with M-CHAT/F refers to the conditional tuning of the sensitivity/specificity  for \acor in each sub-population such that the overall performance is maximized. 
To describe this approach briefly, we assume that there are $m$ sub-populations with the sensitivities, specificities achieved, and the prevalences in each sub-population are given by $s_i,c_i$ and $\rho_i$ respectively, with $ i \in \{1,\cdots, m\}$. Let $\beta_i$ be the relative size of each sub-population. Then, we have (See Supplementary text, Section~\ref{SI-subsec:4D}):
\begin{subequations}\label{eqscpop}
  \cgather{
    s= \sum_{i=1}^m s_i \gamma_i  \\
    c= \sum_{i=1}^m c_i \gamma_i' 
    \intertext{
      where we have denoted:
    }
    \gamma_i = \beta_i \frac{\rho_i }{\rho}, \textrm{ and }  \gamma_i'= \beta_i \frac{1-\rho_i}{1-\rho}
  }%
\end{subequations}%
and $s,c,\rho$ are the overall sensitivity, specificity, and prevalence. Knowing the values of $\gamma_i, \gamma_i'$, we can carry out an $m$-dimensional search to identify the feasible choices of $s_i,c_i$ pairs for each $i$, such that some global constraint is satisfied, $e.g.$ minimum values of specificity, sensitivity, and PPV.
We consider  $4$ sub-populations defined by M-CHAT/F score brackets~\cite{pmid31562252}, and if the screen result is considered a positive (high risk, indicating the need for a full diagnostic evaluation) or a negative, $i.e. $, low risk: 1) score   $\leq 2$  screening ASD negative, 2) score $[3-7]$ screening ASD negative on follow-up, 3) score  $[3-7]$ and  screening ASD positive on follow-up, and 4) score  $\geq 8$,  screening ASD positive. (See SI-Table~\ref{SI-EXT-tabCHOP}). The ``follow-up'' in the context of M-CHAT/F refers to the re-evaluation of responses by qualified personnel. We use published data from the CHOP study~\cite{pmid31562252} on the relative sizes and the prevalence statistics in these sub-populations to   compute the feasible conditional choices of our  operating point  to vastly supersede standalone M-CHAT/F performance. The CHOP study  is the only large-scale study of M-CHAT/F we are aware of with sufficient follow-up after the age of four years to provide a reasonable degree of confidence in the sensitivity of M-CHAT/F.

Two limiting operating conditions are  of particular interest in this optimization scheme, 1) where we maximize PPV under some minimum specificity and sensitivity (denoted as  the High Precision or the HP operating point), and 2) where we maximize sensitivity under some minimum PPV and specificity (denoted as the High Recall or the HR  operating point). Taking these minimum values of specificity, sensitivity, and PPV to be those reported for  M-CHAT/F, we identify the feasible set of conditional choices in a four-dimensional decision space  that would  significantly outperform M-CHAT/F in universal screening. The results are shown in Fig.~\ref{figprc}B.

We carried out a battery of tests to ensure that our results are not significantly impacted by class imbalance (since our control cohort is orders of magnitude larger) or systematic coding errors, $e.g.$, we verified that restricting the \treatment cohort to children with at least two  distinct ASD diagnostic codes in their medical histories instead of one, has little impact on  out-of-sample predictive performance (SI-Fig.~\ref{SI-EXT-figcompsi}B). 

Use of de-identified patient data for the UCM dataset was approved by the Institutional Review Borad (IRB) of the University of Chicago. An exemption was granted due to the de-identified nature of the data. (IRB Committee: Biological Sciences Division, Contact: Amy Horst, ahorst@medicine.bsd.uchicago.edu. IRB\#: Predictive Diagnoses IRB19-1040).

\section*{Results}
We measure our performance using several standard metrics including the AUC, sensitivity, specificity and the PPV. For the prediction of the eventual ASD  status, we achieve an out-of-sample AUC of $82.3\%$ and  $82.5\%$ for males and females respectively at $125$ weeks for the Truven dataset. In the UCM dataset, our performance is comparable: $83.1\%$ and $81.3\%$ for males and females respectively (Fig.~\ref{fig1} and \ref{fig2}).  Our AUC is shown to improve approximately  linearly  with patient age: Fig.~\ref{fig2}A illustrates that the  AUC  reaches 90\%  in the Truven dataset at the age of four. Importantly, note that Fig.~\ref{fig2}B suggests that the risk progressions are somewhat monotonic. Additionally, the computed confidence bounds suggest that the odds of a child with low risk upto 100 or 150 weeks of age abruptly shifting to a high risk trajectory is low. Thus,  ``borderline'' cases need to be surveiled longer, while robust decisions may be obtained in other cases much earlier.

Recall, that the UCM dataset is used purely for validation, and good  performance on these independent datasets lends strong evidence for our claims. Furthermore, applicability in new datasets \textit{without local re-training} makes it readily  deployable in clinical settings.

Enumerating the top $15$ predictive features (Fig.~\ref{fig1}B), ranked  according to their automatically inferred weights  (the feature ``importances''), we found that while infections and immunologic disorders are the most predictive, there is a significant effect from each of  the $17$ disease categories. Thus, the  co-morbid indicators are  distributed across the disease spectrum, and no single  disorder is uniquely implicated (See also Fig.~\ref{fig2}F). Importantly, predictability is relatively agnostic to the number of local cases across US counties (Fig.~\ref{fig1}C-D) which is important in light of the current uneven distribution of  diagnostic resources~\cite{gordon2016whittling,althouse2006pediatric} across states and regions.
 
Unlike individual predictions which only become relevant over 2 years, the average risk over the populations is clearly different  from around the  first birthday (Fig.~\ref{fig2}B), with the risk for the  \treatment cohort rapidly rising. Also, we see a saturation of the risk after $\approx 3$ years, which corresponds to the median diagnosis age in the database. Thus, if a child is not diagnosed up to that age, then the  risk  falls, since the probability of a diagnosis in the population starts to go down after this age. While average discrimination is not useful for individual patients, these reveal important clues as to how the  risk evolves over time. Additionally, while  each  new diagnostic code within the first year of life  increases the risk burden by approximately $2\%$ irrespective of sex (Fig.~\ref{fig2}D), distinct  categories modulate the risk differently, $e.g.$, for a single random patient  illustrated in Fig.~\ref{fig2}F infections and immunological disorders dominate early, while  diseases of the nervous system and sensory organs, as well as ill-defined symptoms,  dominate the latter period.

Given these results, it is important to ask how much earlier can we trigger an intervention? On average,  the first time the relative risk (risk divided by the decision threshold set to maximize F1-score, see Methods) crosses the $90\%$ threshold precedes  diagnosis by  $\approx 188$ weeks in the Truven dataset, and $\approx 129$ weeks in the UCM dataset. This does not mean that we are   leading a possible clinical diagnosis by over $2$ years; a significant portion of this delay arises from families waiting in queue for diagnostic evaluations. Nevertheless, since delays are rarely greater than   one year~\cite{gordon2016whittling},  we are still likely to produce valid red flags significantly earlier than the current practice.

Our approach produces a strictly superior PPV (exceeding M-CHAT/F PPV by at  least $14\%$ (14.1-33.6\%) when sensitivity and specificity are held at comparable values (approx. $38\%$ and $95\%$) around the age of 26 months ($\approx 112$ weeks, note 26 consecutive months is approximately 112-113 weeks). Fig.~\ref{figprc}A and Table~\ref{EXT-tabssp} show  the out-of-sample  PPV vs sensitivity curves   for the two databases, stratified by sex, computed at $100$, $112$ and $100$ weeks. A single illustrative operating point is also shown on the ROC curve in Fig.~\ref{fig1}C, where at $150$ weeks, we have a sensitivity of $51.8\%$ and a PPV of $15.8\%$ and $18.8\%$ for males and females respectively, both at a specificity of 95\%. 

Beyond standalone performance, independence from standardized questionnaires implies that we stand to gain substantially  from a combined operation. With the recently reported population stratification induced by M-CHAT/F scores~\cite{pmid31562252} (SI-Table~\ref{SI-EXT-tabCHOP2}), we can compute a conditional choice of sensitivity  for our tool, in each sub-population (M-CHAT/F score brackets: $0-2$, $3-7$ (negative assessment), $3-7$ (positive assessment), and $>8$), leading to a  significant performance boost. With such conditional operation, we get a PPV close to or exceeding $30\%$ at the high precision (HP) operating point across datasets ($>33\%$ for Truven, $>28\%$ for UCM), or a sensitivity close to or exceeding $50\%$ for the high recall (HR) operating point ($>58\%$ for  Truven, $>50\%$ for UCM), when we restrict specificities to above $95\%$ (See Table~\ref{EXT-tabboost}, Fig.~\ref{figprc}B(i), and SI-Fig.~\ref{SI-EXT-fig4D}  in the supplementary text). Comparing   with standalone M-CHAT/F performance (Fig.~\ref{figprc}B(ii)), we show that for any prevalence between $1.7\%$ and $2.23\%$, we can   \textit{double the PPV} without losing sensitivity at $>98\%$ specificity, or increase the sensitivity by $\sim 50\%$ without sacrificing PPV and  keeping specificity $\geqq 94\%$.%

\section*{Discussion}
In this study, we operationalize a documented aspect of ASD symptomology in  that it has   a wide range  of co-morbidities~\cite{pmid22511918,pmid30733689,pmid25681541} occurring at above-average  rates~\cite{hyman2020identification}. Association of ASD  with epilepsy~\cite{pmid23935565}, gastrointestinal disorders\cite{pmid30646068,pmid21651783,pmid30823414,pmid21282636,pmid29028817,pmid30109601}, mental health disorders~\cite{pmid24729779}, insomnia, decreased motor skills~\cite{pmid30337860}, allergies including eczema~\cite{pmid30646068,pmid21651783,pmid30823414,pmid21282636,pmid29028817,pmid30109601}, immunologic~\cite{pmid30971960,pmid30941018,pmid29691724,pmid29307081,pmid27351598,pmid26793298,pmid30095240,pmid25681541} and metabolic\cite{pmid30178105,pmid27957319,pmid29028817} disorders are  widely reported. These studies, along with support from large-scale exome sequencing~\cite{Satterstrom484113,pmid25038753}, have linked the disorder to putative mechanisms of  chronic neuroinflammation,  implicating immune dysregulation and microglial activation~\cite{pmid15546155,pmid21595886,pmid21629840,pmid26793298,pmid30483058,pmid29691724} during important brain developmental periods  of  myelination and synaptogenesis. However, these advances have not yet led  to   clinically relevant diagnostic biomarkers.  Majority of the co-morbid conditions are common in the control population, and  rate differentials at the population level do not automatically yield individual risk~\cite{Pearce2000}.

ASD genes exhibit extensive phenotypic variability, with identical variants associated with diverse individual outcomes not limited to ASD, including schizophrenia, intellectual disability, language impairment, epilepsy, neuropsychiatric disorders and, also typical development~\cite{pmid23537858}. Additionally, no single
gene can be considered ``causal'' for more than 1\% of cases of idiopathic autism~\cite{pmid23637569}.

Despite these hurdles, laboratory tests and potential biomarkers for ASD have begun to emerge~\cite{smith2020metabolomics,howsmon2017classification,hicks2018validation}. These tools are still in their infancy, and have not demonstrated performance in the 18-24 month age group. In the absence of clinically useful biomarkers,  current screening in pediatric primary care visits uses standardized  questionnaires to categorize behavior. This is  susceptible to potential interpretative biases arising from language barriers, as well as social and cultural differences, often leading to systematic under-diagnosis in diverse populations~\cite{hyman2020identification}. In this study we use  time-stamped sequence of past  disorders  to elicit crucial information on the developing risk of an eventual  diagnosis, and formulate the autism comorbid risk score. The \acor is largely free from aforementioned  biases (See performance comparison in out-of-sample populations stratified by race, gender and ethnicity in SI-Fig.~\ref{SI-figrace}, where we find no significant differences in predictive performances across racial and ethnic boundaries), and yet significantly outperforms the tools in current practice. Our ability to maintain high performance in diverse populations is an important feature of \acor; it is not obvious a priori that \acor resists  inheriting or compounding the systemic biases that might exist in diagnostic procedures.  Nevertheless, our investigations with the UCM dataset shows that we have no significant differences in predictive performance in  African-American, white or multi-racial sub-populations. While the average performance amongst Hispanic children were somewhat lower, the differences  were not significant. 

Going beyond screening performance, this approach provides a new tool to uncover clues to ASD pathobiology, potentially linking  the observed vulnerability to  diverse immunological, endocrinological  and neurological impairments to the possibility of allostatic  stress load  disrupting key regulators of  CNS  organization and  synaptogenesis. Charting individual disorders in the co-morbidity burden further reveals novel associations in normalized prevalence | the odds of experiencing a specific disorder, particularly in the early years (age~$<3$ years), normalized over all unique disorders experienced in the specified time-frame. We focus on  the true positives in the \treatment cohort and the true negatives in the control cohort to investigate  patterns that correctly disambiguate  ASD status. On these lines  Fig.~\ref{EXT-fig3} and SI-Fig.~\ref{SI-EXT-fig4}  in the supplementary text outline two key  observations: 1) \textit{negative associations:} some  diseases that are negatively associated with ASD  with respect to normalized prevalence, $i.e.$, having those codes relatively  over-represented  in one's diagnostic history favors ending up in the control cohort, 2) \textit{impact of sex:} there are sex-specific differences in the impact of specific disorders,  and given a fixed level of impact, the number of codes that drive the outcomes is significantly more in males (Fig.~\ref{EXT-fig3}A vs B).

Some of the disorders that show up in Fig.~\ref{EXT-fig3}, panels A and B are  surprising, $e.g.$,  congenital hemiplegia or diplegia of the upper limbs indicative of either  cerebral palsy (CP) or a spinal cord/brain injury, neither of which has a direct  link to autism. However, this effect is easily explainable: since only about $7\%$ of the children with  cerebral palsy (CP) are estimated to have a  co-occurring ASD~\cite{cdccp,christensen2014prevalence}, and with the prevalence of CP  significantly lower  (1 in 352 vs 1 in 59 for autism), it follows that  only a small number of children (approximately $1.17\%$) with autism have co-occurring CP. Thus, with significantly higher prevalence in children diagnosed with autism compared to the general population ($1.7\%$ vs $0.28\%$), CP codes show  up with higher odds in the true positive set. Other patterns are harder to explain. For example,  SI-Fig.~\ref{SI-EXT-fig4}A shows that the immunological, metabolic, and endocrine disorders are almost completely risk-increasing, and respiratory diseases (panel B) are largely risk-decreasing. On the other hand, infectious diseases have roughly equal representations in the risk-increasing and risk-decreasing classes (panel C). The risk-decreasing infectious diseases tend to be due to viral or fungal organisms, which might point to the use of antibiotics in bacterial infections, and the consequent dysbiosis of the gut microbiota~\cite{pmid30823414,pmid27957319} as a risk factor.

Any predictive analysis of ASD must address if we can
discriminate  ASD from  general developmental and behavioral disorders.
The DSM-5 established a single category of ASD to replace
the subtypes of autistic disorder, Asperger syndrome, and pervasive developmental disorders~\cite{hyman2020identification}. This aligns with our use of diagnostic codes from ICD9 299.X as specification of an ASD diagnosis, and use standardized mapping to 299.X from ICD10 codes when we encounter them. For other psychiatric disorders, we get  high discrimination reaching AUCs over $90\%$ at $100 -125$ weeks of age (SI-Fig.~\ref{SI-EXT-figcompsi}A), which establishes that our pipeline is indeed largely specific to ASD.

Can our performance be matched by simply asking how often a child is sick? While  the  density of ICD codes in a child's medical history is higher for those eventually diagnosed with autism (See Table~\ref{tab2}), we found this to be a crude measure. While somewhat predictive,  achieving AUC  $\approx 75\%$ in the Truven database at $150$ weeks (See SI-Fig.~\ref{SI-EXT-figcompsi}, panel D in the supplementary text), code density by itself does not have stable  performance across the two databases, with particularly poor performance in validation in the UCM database. Furthermore, adding code density as a  feature shows  no appreciable  improvement in our pipeline.

We also investigated the impact of removing all psychiatric codes (ICD9 290 - 319, and corresponding ICD10) from the patient histories to eliminate the possibility that our performance is only reflective of prior psychiatric evaluation results. Training and validation on this modified data found  no appreciable difference in performance (See SI-Fig.~\ref{SI-EXT-fig1nop}).
Additionally, we found that including information on prescribed medications and medical procedures in addition to diagnostic codes did not improve results.

  A key  limitation of our approach is that automated pattern recognition strategies are not guaranteed to reveal true causal precursors of future diseases. {\HCOL Thus, we must  further scrutinize the patterns we find to be predictive. This is particularly true for the inferred  negative associations of some disorders  with the future odds of a ASD diagnosis; these might be indicative of collider bias~\cite{berkson1946limitations} suggesting spurious associations arising from conditioning on the common effect of unrelated causes. Here the putative collider in our framework is the presence of the child in a clinic or a medical facility for a medical issue. In future, we will  investigate the possibility of identifying more transparent and interpretable risk precursors via post-processing of our inferred models.}

Our approach is also impacted by the uncurated nature of medical history, which invariably includes coding mistakes and other artifacts, $e.g.$   potential for  over-diagnosis of  children on the borderline of the diagnostic criteria due to clinicians' desire to help families access service, and biases arising from changes in diagnostic practices over time~\cite{10.1001/jamapsychiatry.2019.1956}. Discontinuities in patient medical histories from change in provider-networks  can also introduce  uncertainties  in risk estimates, and socio-economic status of patients and differential access to care  might skew patterns in EHR databases. Despite these limitations, the design of a questionnaire-free component to ASD screening  that systematically leverages co-morbidities  has far-reaching consequences, by potentially slashing the false positives and wait-times, as well as removing systemic under-diagnosis issues amongst females and minorities. 

In future, we will also explore the impact of  maternal medical history, and the  use of calculated risk to trigger   blood-work to look for expected  transcriptomic  signatures of ASD. Finally,  the analysis developed here applies to phenotypes beyond ASD, thus opening the door to the possibility of  general  comorbidity-aware risk predictions  from electronic health record databases. 

{\HCOL One of our immediate future  goals  is to  prospectively validate \acor  in a clinical setting. Ultimately, we hope to see widespread adoption of this new tool. An key hurdle to adoption is a general lack of  trust in  non-transparent decision algorithms, particularly in the light of the biases, mis-classifications, misinterpretations, and subjectivity in the training datasets~\cite{cheng2020there}. We hope that results from prospective trials, and the key advantages laid out in this study will help alleviate such concerns.
}


\subsection*{Data \& Software Availability} 
All data and models needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. The Truven and UCM datasets  cannot be made available due to ethical and patient privacy considerations. Preliminary software implementation of the pipeline is available at: 
\url{https://github.com/zeroknowledgediscovery/ehrzero}, and installation in standard python environments may be done from \url{https://pypi.org/project/ehrzero/}. To enable fast execution, some more compute intensive features are disabled in this version. Results from this software are for demonstration purposes only, and must not be interpreted as medical advice, or serve as replacement for such.


\section*{Acknowledgment}
This work is funded in part by the Defense Advanced Research Projects Agency (DARPA) project number  HR00111890043/P00004. The claims made in this study  do not  reflect the position or the policy of the US Government. The UCM dataset is provided by the Clinical Research Data Warehouse (CRDW) maintained by the Center for Research Informatics (CRI) at the  University of Chicago. The Center for Research Informatics is funded by the Biological Sciences Division, the Institute for Translational Medicine/CTSA (NIH UL1 TR000430) at the University of Chicago. 

Competing Interests: The authors declare that they have no competing interests.

\section*{Author Contributions}
DO implemented the algorithm and ran validation tests. DO, YH, JH and IC carried out mathematical modeling, and  algorithm design. PS, MM and IC interpreted results and guided research.
IC procured funding and  wrote the paper.

\section*{Supplementary Materials} Supplemental tables, figures and modeling details are included in the Supplementary Text document.

\bibliographystyle{Science}
\nocite{*}
\bibliography{mergedbib}
\end{document}


%
% LocalWords:  comorbidity pathophysiology comorbidities