Domain Gap Embeddings for Generative Dataset Augmentation

Carnegie Mellon University
*Equal Contribution

CVPR 2024

Can we use off-the-shelf large pre-trained models (LPMs) as synthetic data generators for effective few-shot dataset augmentation toward specific distributions?

Teaser Image

To address these issues, we propose DoGE, a few-shot cross-distribution data augmentation framework that is task-agnostic and inference-only.


The performance of deep learning models is intrinsically tied to the quality, volume, and relevance of their training data. Gathering ample data for production scenarios often demands significant time and resources. Among various strategies, data augmentation circumvents exhaustive data collection by generating new data points from existing ones. However, traditional augmentation techniques can be less effective amidst a shift in training and testing distributions.

This paper explores the potential of synthetic data by leveraging large pre-trained models for data augmentation, especially when confronted with distribution shifts. Although recent advancements in generative models have enabled several prior works in cross-distribution data generation, they require model fine-tuning and a complex setup. To bypass these shortcomings, we introduce Domain Gap Embeddings, a plug-and-play semantic data augmentation framework in a cross-distribution few-shot setting. Our method extracts disparities between source and desired data distributions in a latent form, and subsequently steers a generative process to supplement the training set with endless diverse synthetic samples. Our evaluations, conducted on a subpopulation shift and three domain adaptation scenarios under a few-shot paradigm, reveal that our versatile method improves performance across tasks without needing hands-on intervention or intricate fine-tuning. Our method paves the way to effortlessly generate realistic, controllable synthetic datasets following the test distributions, bolstering real-world efficacy for downstream task models.


Teaser Image

Our two-phase inference-only augmentation framework: (left) domain gap extraction (right) target domain generation.

Above shows our proposed framework of Domain Gap Embeddings for Generative Dataset Augmentation.

The left part illustrates the process of Domain Gap Extraction:

  1. The input consists of (a) a source dataset \(\mathcal{D_S}\), with \(|\mathcal{D_S}|=N\), and (b) a few data samples \(\mathcal{D_T}=\{y_j\}_{j=1}^m\) from a different target distribution with \(m\ll N\).
  2. We encode images from a randomly sampled subset \(\hat{\mathcal{D}}_S=\{x_i\}_{i=1}^n \subseteq\) \(\mathcal{D_S}\) and \(\mathcal{D_T}\) into the CLIP space via (c) a CLIP image encoder \(\mathcal{E_I}\), denoted as \(z_{x_i} = \mathcal{E_I}(x_i)\) and \(z_{y_j} = \mathcal{E_I}(y_j)\) respectively.
  3. The (d) Domain Gap Extractor captures the representation \(\Delta z\) of the distribution gap

The right part indicates the step of Target Domain Generation:

  1. We augment source image embeddings \(\{z_{x_i}\}_{i=1}^k\) with domain gap embedding \(\Delta z\) to yield \(\{z_{y_i}\}_{i=1}^k\)
  2. \(z_{y_i} = z_{x_i} + C\cdot \Delta z + \epsilon\) with perturbation \(\epsilon \sim \mathcal{N}(\mathbf{0}, 10^{-3}I)\) and edit strength \(C \sim \mathcal{N}(c, 0.05)\).
  3. (e) Stable UnCLIP model converts CLIP embeddings back to (f) synthetic target dataset.
  • Optionally, control images from (g) Control Image Extractor regulates generation via ControlNet.


To illustrate the versatility and efficacy of DoGE, we evaluate performance improvements in various experiments including semantic augmentation on faces and style transfer on objects.

Results Image

Above shows the effectiveness in semantic augmentation. We use a subset of CelebA dataset with perceived males wearing eyeglasses and vice versa. We select as few as 20 images in the target data distribution and successfully add/remove eyeglasses from the faces.

Below illustrates our success in transferring styles. We use the real domain in the DomainNet dataset and successfully converted the realistic pictures of objects into four other styles.

Results Image
The results in the table below provides quantitative assessments of the synthetic datasets created through DoGE. In comparison to other general data synthesis methods, we achieve the highest improvements across all tasks. More experiments and comparisons are detailed in the paper.

Results Image


      author={Yinong Oliver Wang, Younjoon Chung, Chen Henry Wu and Fernando De la Torre},
      title={Domain Gap Embeddings for Generative Dataset Augmentation},