Generating synthetic data sets using non-confidential published tables and confidentialised unit record data sets

Project description

Synthetic data sets are useful for project & software development, training and in some cases substantive social research. This project investigated methods for creating large synthetic data sets using both publicly available published tables and confidentialised microdata. It built on the lead researcher’s 2006 OS Research project by allowing the creation of synthetic data sets containing more variables than is possible when using marginal tables alone.

Project aims

Specifically, the project aimed to:

  1. produce statistically robust methods for constructing synthetic data sets that accurately mimic the statistical characteristics of the relevant population, both published marginal tables and confidentialised microdata.
  2. investigate to what extent such synthetic data sets could be used for substantive social research, and under what circumstances conclusions reached from the analysis of such synthetic data sets are statistically valid.
  3. create statistical software that would implement these methods and allow the routine generation of synthetic data.