Evaluating hypothesis tests on differentially private histogram-based synthetic data
Böhmeke, Jan (2024-06-27)
Evaluating hypothesis tests on differentially private histogram-based synthetic data
Böhmeke, Jan
(27.06.2024)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2024062859806
https://urn.fi/URN:NBN:fi-fe2024062859806
Tiivistelmä
Sharing synthetic data that preserves privacy has been suggested as an option for releasing sensitive data without compromising individuals’ privacy. The synthetic data should maintain the structure and statistical characteristics of the original data, while ensuring individuals privacy. Differential privacy (DP) effectively assures privacy concerns, while preserving structure and characteristics of the original data. Objectives of this research is to evaluate Students T-test and Mann-Whitney U test empirically to verify if those tests are prone to result in loss of tests validity or decreased power. Empirically demonstrating this is done in terms of Type I and Type II errors. I evaluate the statistical hypothesis tests on sets of additively smoothed DP synthetic data generated from sets of original data. The original data sets are simulated questionnaire data (n=20 000) following 5-point Likert Scale and 10-point Likert Scale and Kaggle Cardiovascular Dataset (n=70 000). The validity of tests was preserved for all privacy budget values (0.001 ≤ ϵ ≤ 100) and sampled dataset sizes (50,100,500,1000) for all data. The power of the tests was considerably reduced in all cases.