Skip to content

DataMonitoring

grahamcrowell_laptop edited this page Feb 25, 2017 · 2 revisions

Factless Data Monitoring

Periodic data profile logging allows us to detect when a data set has changed in unexpected.

Table row count

Consider log of COUNT(*) of ED_VisitFact recorded each time a it's ETL package is executed.

$x_k=$ED_VisitFact_Log where $x=$RowCount (ie COUNT(*)) and $k$ is $t=$EtlDate or $i=$ROW_NUMBER()

for each log entry compute:

change in row count RowCount - LAG(RowCount) $$\Delta x_i=x_i-x_{i-1}$$ number of days between log entries DATEDIFF(DAY,LAG(EtlDate),EtlDate) $$\Delta t_i = t_i-t_{i-1}$$ row count change per day $$y_i=\frac{\Delta x_i}{\Delta t_i}=\frac{x_i-x_{i-1}}{t_i-t_{i-1}}$$

Model $y_i$ as discrete markov process and estimate expected value $\mu=\mathbb{E}(y_i)$ and variance $\sigma^2=\text{Var}(y_i)$.

Define alert triggers: if $|y_i-\mu|>\alpha\sigma$ where $\alpha>0$ is a sensitivity constant.

Define meta parameter object

  • database name,
  • schema name,
  • table name,
  • column name(s) (optional)

Meta parameter objects imply SQL COUNT(*) aggregation where each column is added to GROUP BY clause.

Case: No GROUP BY no columns

Let $x$=COUNT(*)(*) data:

  • $i=0,1,\ldots,N$ etl sequence index,
  • $T=(t_i)_{i=0}^N$ etl time stamp sequence,
  • $X=(x_i)_{i=0}^N$ row count

calculations:

  • row count change per day: $\frac{dx}{dt}\approx$
  • relative row count change per day: $\frac{dx}{dt}\approx\frac{x_i-x_{i-1}}{x_i(t_i-t_{i-1})}$
Clone this wiki locally