Generating Synthetic Sequential Data Using GANs
Sequential data — data that has time dependency — is very common in business, ranging from credit card transactions to medical healthcare records to stock market prices. But privacy regulations limit and dramatically slow-down access to useful data, essential to research and development. This creates a demand for highly representative, yet fully private, synthetic sequential data, which is challenging, to say the least.
Generating synthetic time-series and sequential data is more challenging than tabular data where normally all the information regarding one individual is stored in a single row. In sequential data, information can be spread through many rows, like credit card transactions, and preservation of correlations between rows — the events — and columns — the variables is key. Furthermore, the length of the sequences is variable; some cases may comprise just a few transactions while others may have thousands.
Generative models for sequential data and time-series have been studied extensively , however, many of these efforts have resulted in relatively poor synthetic data quality and low flexibility. In many cases the models are designed to be specific to each problem, thus requiring detailed domain knowledge.
In this post, we describe and apply an extended version of a recent powerful method to generate synthetic sequential data — DoppelGANger . It is a framework based on Generative Adversarial Networks (GANs) with some innovations that make possible the generation of synthetic versions of complex sequential datasets.
We build on this work by introducing two innovations:
A learning strategy to speed up the convergence of the GAN and avoid mode collapse.
A well-designed noise in the discriminator to make the process differentially private without degrading the quality of the data, using a modified version of the moments accountant strategy to improve the stability of the model.
Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.
Common approaches to sequential data generation
Most of the models for time-series data generation use one of the following approaches:
Dynamic stationary processes that work by representing each point in the time series as a sum of deterministic processes with some noise added. This is a widely used approach for modeling time-series with techniques like bootstrapping . However, some prior knowledge of long-term dependencies, like cyclical patterns, has to be incorporated to constrain the deterministic process. This makes it very difficult to model datasets with complex, unknown correlations.
Markov Models are a popular approach for modeling categorical time series by representing system dynamics as a conditional probability distribution. Variants, such as Hidden Markov models , have...