When you hear the term data governance, is your first thought one of draconian policies that put security and regulations above business value? Unfortunately, this is the approach that many organizations have taken with data governance. They focus so heavily on restricting data to meet security and regulatory requirements that they eliminate the ability to generate business value from the data. The future of data governance must include finding ways to continue to protect the data but doing it in a way that enables organizational innovation.
Even though having a strong data governance policy and a strong innovative culture seem contradictory, there are some constructs that can be put in place to make it feasible. Three of the most important practices and processes to enable innovative data governance are synthetic data, DataOps, and a walled garden for your citizen data scientists.
Synthetic Data
The first important feature of innovative data governance is providing a data set that is statistically similar to the real data set without exposing private or confidential data. This can be accomplished using synthetic data.
Synthetic data is created using real data to seed a process that can then generate data that appears real but is not. Variational autoencoders (VAEs), generative adversarial networks (GANs), and real world simulation create data that can provide a basis for experimentation without leaking real data and exposing the organization to untenable risk.
VAEs are neural networks composed of encoders and decoders. During the encoding process, the data is transformed in such a way that its feature set is compressed. During this compression, features are transformed and combined, removing the details of the original data. During the decoding process, the compression of the feature set is reversed, resulting in a data set that is like the original data but different. The purpose of this process is to identify a set of encoders and decoders that generate output data that is not directly attributable to the initial data source.
Consider an analogy of this process: taking a book and running it through a language translator (encoder) and then running it through a language translator in reverse (decoder). The resulting text would be similar but different.
GANs are a more complex construct that consists of pair of neural nets. One neural net is the generator and the other is the discriminator. The generator uses seed data to create new data sets. The discriminator is then used to determine if the generated data set is real or synthetic. Over an iterative process, the generator improves its output to the point where the discriminator cannot differentiate the real data set from the synthetic data set. At this point, the generator can create data sets that appear undifferentiable from the real data but can be used for data experimentation.
In addition to these two methods, some organizations are using gaming engines and physics based engines to simulate data sets based on scientific principles and how objects in the real world interact with scientific principles (e.g., physics, chemistry, biology). As these virtual simulations are run, the resulting data set, which is representative of the actual data, can be collected for analysis and experimentation.