Skip to content
dandanlen edited this page Jan 7, 2020 · 6 revisions

Motivation

Anonymized data from the Diffix-protected datasets is inherently restricted, so exploring the data is not straightforward. The analyst needs to be familiar with the imposed limitations, and knowledgable of possible workarounds. The aim of this project is to build a system/component/process that builds a high-level picture of the shape of a given data set whilst intelligently navigating the restrictions imposed by Diffix.

The most fundamental limitation is that you can't query any data that would uniquely (or even almost-uniquely) identify a person in the database. As a result, the main way of extracting information about a given dataset is through aggregates. On their own, the aggregate functions such as min, max, count, avg... return very coarse-grained stats of limited usefulness. However, using tricks such as calculating aggregates over sub-ranges of data, we can extract enhanced statistics such as histograms to aid the analyst in his/her exploration of the dataset.

Clone this wiki locally