Using Retrieved Real Images Performs Better (2024)

Scott Geng^♣ Cheng-Yu Hsieh^♣ Vivek Ramanujan^♣ Matthew Wallingford^♣
Chun-Liang Li^♣ Pang Wei Koh ^♣♠ Ranjay Krishna^†^†footnotemark: ^♣♠

^♣

University of Washington ^♠Allen Institute for AI
sgeng@cs.washington.eduEqual advising.

Abstract

Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data.However, every synthetic image ultimately originates from the upstream data used to train the generator. What additional value does the intermediate generator provide over directly training on relevant parts of the upstream data?Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion—a generative model trained on the LAION-2B dataset—against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from our simple retrieval baseline.Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images.Overall, we argue that retrieval is a critical baseline to consider when training with synthetic data—a baseline that current methods do not yet surpass. We release code, data, and models at https://github.com/scottgeng00/unmet-promise.

1 Introduction

The success of modern machine learning systems fundamentally relies on the quantity[32, 9, 27, 54, 49, 62], quality[19, 69, 43, 42, 35], and distribution[16, 21, 59, 67, 8] of the data they are trained on. However, acquiring large amounts of quality data remains challenging, due to the sheer cost of data collection and annotation. As demand for training data continues to rise, the field is actively exploring approaches to automatically curate data at scale[19, 2, 17]. One burgeoning approach is to source synthetic training data from conditional generative models. Generative models enable data to be tailored to specific requirements and generated at scale, presenting a promising alternative to the challenges of real data curation.Recent work highlights this potential: for example, in natural language processing (NLP), researchers prompt strong proprietary language models to cheaply synthesize large-scale datasets for instruction tuning[65, 28, 58].

Using Retrieved Real Images Performs Better (1)

Analogously, in computer vision—the focus of our research—many recent works train models on synthetic images from modern text-to-image generators, aiming to achieve state-of-the-art visual recognition performance [53, 60, 3, 22, 23].For example, SynCLR[59] cleverly prompts Stable Diffusion for synthetic images tailored to pre-specified downstream image recognition domains; they find that a CLIP-like model trained from scratch on the resulting targeted synthetic images can outperform CLIP trained on LAION-2B, a significantly larger untargeted dataset of real images. This result is quite surprising.Stable Diffusion is also trained on LAION-2B, so by the data processing inequality, the synthetic images it generates cannot contain any additional information over the images in LAION-2B. Yet, training on these derivative synthetic images appears to outperform training directly on LAION-2B. How do we make sense of these additional gains?

In this paper, we argue that the performance gained by training on generated synthetic images needs to be contextualized against a critical baseline missing in prior work: training on real images from the generative model’spretraining data.In particular, prior work has often compared task-targeted synthetic images to general, untargeted real images, thereby entangling the effects of training on synthetic versus real images with the effects of targeted versus general data collection. However, these variables are not intrinsically conflated. Any generative model we use to synthesize images fundamentally derives from its upstream training data. Instead of using that upstream data to train an intermediate generative model and synthesize targeted synthetic images, we can alternatively seek to directly retrieve targeted real images from the upstream source (Figure1). By comparing synthetic training data against this retrieval baseline, we isolate the value added by the generative model.

We formalize our study under the ubiquitous problem of task adaptation, where we seek to curate task-targeted images to finetune a pretrained vision model.We empirically compare training on targeted synthetic images generated from Stable Diffusion 1.5—a text-to-image model trained on the upstream LAION-2B dataset—against training on targeted real images retrieved from LAION-2B itself. We perform hundreds of experiment runs across an order of magnitude of data scales on five visual recognition tasks (ImageNet[13], Describable Textures (DTD)[11], FGVC-Aircraft[39], StanfordCars[34], and Oxford Flowers102[44]) where training on synthetic data has shown promise[59, 22].

Together, we find that training on targeted real data retrieved from a generative model’s upstream training dataset outperforms training on synthetic data from the generative model. For example, while training on targeted synthetic images can improve downstream accuracy by up to 7.1% (absolute) on its best-case benchmark (FGVC-Aircraft), training on targeted real images helps even further, boosting accuracy by a massive $17.7\%$ . On other benchmarks, such as ImageNet, we find that training on synthetic images can sometimes hurt performance even when training on real data improves it.We further show that these findings hold across several different versions of Stable Diffusion, as well as when we train on a mix of synthetic and real data. Our analysis suggests that the consistent underperformance of models trained on synthetic images is partially due to low-level generator artifacts in the synthetic images (e.g., blurs), and partially because synthetic images may distort high-level class-specific visual details that real images preserve.

Overall, we conclude that retrieval is a critical baseline to consider when evaluating the true added utility of generated synthetic training data.Our goal is not to make normative claims about whether synthetic data will ever surpass this standard, but to contribute a simple baseline to aim for, and a clean set of experiments to explicitly measure progress towards surpassing it. For instance, by conceptualizing retrieval from a generator’s training data as a strong alternative to synthesizing data, a natural future direction for improving synthetic training data is to synthesize image compositions that are explicitly absent from the generator’s upstream training set. Images generated in this manner may offer unique value beyond what can be retrieved from the training data.Finally, in settings where the upstream dataset of a generator is unavailable altogether (e.g., due to privacy concerns, due to proprietary data, or due to download bandwidth restrictions), the retrieval baseline is unrealizable by assumption; synthetic data therefore retains strong utility for distilling knowledge from generative models and for privacy preservation. We release all code, models, and over 1TB of generated images to guide future work (https://github.com/scottgeng00/unmet-promise).

2 Related Work

Learning from synthetic data.

Synthetic data has been widely explored in the context of many machine learning problems[65, 28, 24, 31, 58, 55, 5, 38].For example, in NLP, synthetic data generated from strong large language models[45, 10] has been used to distill instruction-following behavior[65] and task-specific knowledge[29] into smaller models. In computer vision, prior works have sought to use synthetic data to improve the state-of-the-art across a breadth of visual tasks, such as object detection[46, 30], semantic segmentation[52, 50, 7], and optical flow[14]. Traditionally, this synthetic training data has been sourced from expert-crafted simulation and rendering pipelines[47, 14, 50, 7, 52, 30].Recent advances in text-to-image synthesis via diffusion models[56, 26, 51] are changing this paradigm, inspiring a new wave of works that seek to train visual models on synthetic data algorithmically sampled from modern generative models[23, 60, 53, 22, 71]. This structural shift in the source of synthetic images from expert-supervised programmatic simulation to a learned generator that itself derives supervision from upstream data raises a critical question: does the intermediate step of training a generator and sampling synthetic data provide any gains over simply training on the upstream data directly? Our work formalizes and empirically grounds this question, contributing experiments and baselines to rigorously measure the benefits of training on modern data-derived synthetic data.

Adapting pretrained vision models. Large-scale pretrained image models such as CLIP[48, 9] offer transferable visual features that benefit a wide range of downstream vision tasks. It is now common practice to use pretrained models as a starting point when deploying downstream task-specific models instead of training them from scratch[66, 63]. From an algorithmic perspective, many methods have been proposed to adapt CLIP models to downstream tasks, each with varying trade-offs[68, 70, 40, 20, 4, 66]. We choose to study simple full-finetuning, centering our work on the data we adapt on as opposed to the algorithm. In particular, the quality and relevance of adaptation data has a crucial impact on downstream task performance; distribution shifts at inference time can significantly hurt performance[33]. Acquiring task-targeted data thus remains an active area of research[37, 64, 24]. Most related to our work is[37, 64], who also employ retrieval as a technique for collecting task targeted-data. Our work builds upon these methods to construct baselines for systematically measuring the true added utility of model-generated synthetic training images.

3 Problem Setting and Method

Given a large dataset ${\mathcal{D}}$ of general real image-text pairs and a downstream visual classification task specified as a set of text class names ${\mathcal{C}}$ , we aim to algorithmically curate a targeted adaptation dataset ${\mathcal{D}_{\mathcal{C}}}$ of images $x_{i}$ and one-hot class labels $y_{i}$ to finetune and improve a pretrained vision model’s performance on the downstream task. We compare two high-level approaches for sourcing this targeted data, shown in Figure1: (1) we retrieve targeted real images directly from ${\mathcal{D}}$ , forming a targeted retrieved dataset ${\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}\subset{\mathcal{D}}$ .Alternatively, (2) we generate targeted synthetic images by prompting an intermediate text-to-image generative model trained on ${\mathcal{D}}$ , forming a targeted synthetic dataset ${\mathcal{D}_{\mathcal{C}}^{(\text{synthetic})}}$ . We detail each approach below.

3.1 Sourcing data by generating synthetic images

We follow SynCLR[59], a representative method for curating synthetic training data from off-the-shelf text-to-image models.Given the set of visual class names ${\mathcal{C}}$ , we first synthesize a large corpus of corresponding image captions by prompting a large language model (details in AppendixB.1). For example, if the class name $c\in{\mathcal{C}}$ is “rose,” then a generated caption might be “a close-up of a pink rose in bloom.”We then use these captions as input for a text-to-image generator $G$ trained on the upstream data ${\mathcal{D}}$ , yielding a large set of synthesized images $\widetilde{x_{i}}$ . Each image is assigned a class label $y_{i}$ based on the class name $c$ used to synthesize its caption. These synthetic images and labels $\{(\widetilde{x_{i}},y_{i})\}$ form our curated dataset ${\mathcal{D}_{\mathcal{C}}^{(\text{synthetic})}}$ .

3.2 Sourcing data by retrieving real images

Rather than querying a generator trained on an upstream dataset ${\mathcal{D}}$ , we can directly train on parts of ${\mathcal{D}}$ itself by retrieving relevant data. ${\mathcal{D}}$ consists of image-text pairs $(x_{i},t_{i})$ . To retrieve relevant pairs, we consider two strategies. We additionally deduplicate all retrieved images with respect to our evaluation datasets following[19] to minimize test set leakage. We apply NSFW filtering[54].

Strategy 1: hard substring matching.

Inspired by[64], we retrieve the set ${\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}$ of all images $x_{i}$ whose corresponding caption $t_{i}$ contains at least one target class name $c\in{\mathcal{C}}$ as a substring:

\displaystyle{\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}=\left\{(x_{i},y_%{i})\,\,\colon\,(x_{i},t_{i})\in{\mathcal{D}}\textrm{ such that some class $c%\in{\mathcal{C}}$ is a substring of $t_{i}$}\right\}.

Here, label $y_{i}$ is assigned based on the class $c$ contained in $t_{i}$ . If an image-text pair $(x_{i},t_{i})\in{\mathcal{D}}$ has text $t_{i}$ containing multiple class names $c,c^{\prime}\in{\mathcal{C}}$ , then we simply retrieve $x_{i}$ multiple times and assign each instance a different label, once for each unique matched class name.

Strategy 2: semantic $k$ -NN retrieval.

Hard substring matching is simple and effective when the target visual concepts $c\in{\mathcal{C}}$ are concrete entities that are likely to be described in text captions (e.g., $c=$ “fire lily”), but may be less effective when the concepts are abstract (e.g., $c=$ “lined texture”). Thus, we also consider semantic (soft) retrieval via CLIP image-text embedding space similarity¹¹1We perform semantic retrieval using precomputed LAION-2B embeddings from OpenAI CLIP ViT-L/14[48], the same model Stable Diffusion uses to embed text prompts during generation.. We convert each target class name $c\in{\mathcal{C}}$ into a set of natural language search queries $Q_{c}$ based on the templates from the original CLIP paper[48].For each query $q_{c}\in Q_{c}$ , we use approximate $k$ -NN search[15] to retrieve the set $S_{q_{c}}$ of $k$ -nearest image-text pairs $(x_{i},t_{i})\in{\mathcal{D}}$ by CLIP similarity between the query $q_{c}$ and either the image $x_{i}$ or the text $t_{i}$ :

\displaystyle S_{q_{c}}=\Bigl{\{}\operatorname*{arg\,top-k}_{(x_{i},t_{i})\in D%}\text{CLIP}(x_{i},q_{c})\Bigr{\}}\cup\Bigl{\{}\operatorname*{arg\,top-k}_{(x_%{i},t_{i})\in D}\text{CLIP}(t_{i},q_{c})\Bigr{\}}.

We assign each image-text pair $(x_{i},t_{i})\in S_{q_{c}}$ a class label $y_{i}$ based on the class name in query $q_{c}$ . We form the targeted dataset ${\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}$ by unioning over all queries $q_{c}\in Q_{c}$ and all classes $c\in{\mathcal{C}}$ :

\displaystyle{\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}=\bigcup_{c\in{%\mathcal{C}}}\bigcup_{q_{c}\in Q_{c}}\left\{(x_{i},y_{i})\colon(x_{i},t_{i},y_%{i})\in S_{q_{c}}\right\}.

3.3 Additional data filtering and postprocessing

Data filtering has been shown to improve training performance for both real and synthetic data and is widely used in practice[19, 23]. We filter both our synthetic and retrieved datasets. Given a curated dataset ${\mathcal{D}_{\mathcal{C}}}$ , we compute the CLIP similarity of each $x_{i}\in{\mathcal{D}_{\mathcal{C}}}$ with text corresponding to its assigned label $y_{i}$ (e.g., “a photo of {class name}”), constructed using the CLIP zero-shot classification templates[48]. When there are multiple templates for a given class, we aggregate by taking the maximum similarity across templates. We keep the top 30% of images by aggregate similarity. Intuitively, filtering helps remove generated and retrieved images with class-misaligned content. For example, an image labeled “dog” but without any dogs present (i.e., due to retrieval or generation errors) would receive a lower CLIP similarity score and thus likely be discarded.

Our synthetic adaptation datasets ${\mathcal{D}_{\mathcal{C}}^{(\text{synthetic})}}$ are class-balanced by construction (i.e., we uniformly generate images for each class). We further postprocess the retrieved adaptation datasets ${\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}$ to improve class balancing by manually fixing a global threshold $M$ and truncating the dataset such that each class label $y_{i}$ occurs at most $M$ times.

4 Main Experiments

We seek to measure the utility of learning from model-generated synthetic images. Grounding this question empirically, our experiments compare finetuning a pretrained CLIP model on (1) targeted synthetic images ${\mathcal{D}_{\mathcal{C}}^{(\text{synthetic})}}$ to (2) finetuning on targeted retrieved real images ${\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}$ .

Benchmarks. We focus on five downstream tasks where synthetic data has shown promise compared to similar scale untargeted real data[59]. We select (a) ImageNet-1K[13] and Describable Textures (DTD)[11] to evaluate recognition performance on broad categories and (b)FGVC-Aircraft[39], StanfordCars[34], and Oxford Flowers102[44] to evaluate performance in fine-grained settings. We use standard pre-defined train, test, and validation splits when available, and otherwise randomly subset the training set to create missing train-validation splits (details in AppendixC.2).

Using Retrieved Real Images Performs Better (2)

Finetuning data curation. For each downstream benchmark, we first curate an adaptation dataset ${\mathcal{D}_{\mathcal{C}}}$ (Section3) by either (1) generating synthetic images with Stable Diffusion 1.5[51], trained on the LAION-2B dataset[54], or (2) retrieving real images directly from LAION-2B. We treat the choice between our substring-based and semantic retrieval strategies as a hyperparameter, using downstream validation set accuracy to determine the best choice for each benchmark. Hyperparameters for retrieval are detailed in AppendixC.1.

Model adaptation and evaluation. We adapt a LAION-2B pretrained CLIP ViT-B/16 [9] image encoder by finetuning on the curated adaptation dataset ${\mathcal{D}_{\mathcal{C}}}$ with a cross-entropy classification loss for a pre-set number of epochs. To elucidate the scaling trends of synthetic and retrieved data, we finetune across an order of magnitude of different adaptation dataset scales, subsampled from the full targeted adaptation dataset ${\mathcal{D}_{\mathcal{C}}}$ . We report zero-shot (ZS) and linear probing (LP) test set accuracy, using the benchmark train set to train the linear head. For both LP and ZS evaluation, we use the validation set to identify the best epoch and finetuning hyperparameters.For each data scale, we aggregate accuracy across the results of at least three random seeds, and report the standard deviation due to seed randomness. Additional training and full hyperparameter details are provided in AppendixC.3.

4.1 Main results: synthetic training data lags behind a baseline of retrieved real images

We present our main zero-shot and linear probing scaling results in Figure2.

At equivalent data scales, finetuning with model-generated synthetic images can help, but is universally matched or outperformed by finetuning directly with images from the generator’s training data.

Consistent with prior research[59], we find that training with targeted synthetic data can improve an unadapted model. For example, on FGVC-Aircraft—the setting where previous works have found strongest gains—finetuning with 139K Stable-Diffusion generated images improved downstream linear probing accuracy by an average of 3.8 percentage points over an off-the-shelf CLIP model ( $64.9\%\rightarrow 68.7\%$ ); on DTD, training with 110K synthetic images improves zero-shot accuracy by 3.3 points ( $56.3\%\rightarrow 59.6\%$ ).

However, the gains from training on synthetic data are consistently matched or surpassed by training on retrieved real data. For instance, on FGVC-Aircraft, finetuning with an equivalent 139K LAION-2B retrieved images boosts performance by a massive 17.8 points ( $64.9\%\rightarrow 82.7\%$ ). Moreover, adapting with retrieved data can improve performance even when synthetic data does not (e.g., on ImageNet and Flowers102 zero-shot accuracy.) Finally, adapting with synthetic data can sometimes even hurt performance (ImageNet, StanfordCars, Flowers102 zero-shot), while targeted retrieved data improves or at least does not hurt performance across all settings considered. Given equal amounts of targeted retrieved and synthetic data, retrieved data is the clear winner.

Synthetic data can sometimes decrease the gap with retrieved data given additional scale, but remains behind.

The amount of data we can retrieve is fundamentally limited by the finite upstream data pool. For example, even after searching all 2 billion LAION-2B samples for images containing an FGVC-Aircraft class name in the caption, substring-based retrieval returned only 139K targeted images post-filtering. In contrast, it is straightforward to create ever-larger synthetic datasets by simply generating more data.

Scaling the synthetic adaptation dataset size beyond the amount of retrieved data considered (illustrated in the gray-shaded regions of Figure2), we find that increasing the amount of targeted synthetic data does not always improve performance. For example, on DTD, synthetic data exhibits U-shaped scaling, with performance positively scaling up to 110K synthetic training images, after which performance declines. On ImageNet, Flowers102, and StanfordCars, increasing the synthetic dataset size consistently hurts zero-shot accuracy and has minimal impact on linear probing performance.

On Aircraft, scaling helps; there is a log-linear relationship between the size of the synthetic adaptation dataset and downstream linear probing accuracy (e.g., scaling from 139K $\rightarrow$ 250K synthetic images improves linear probing accuracy from $68.7\%\rightarrow 70.7\%$ ). However, synthetic data still lags behind retrieved data: matching the performance of a mere 15K retrieved aircraft images requires scaling the synthetic dataset to 500K images, reflecting a $\sim$ 33x difference in dataset size and required finetuning compute. Naively extrapolating this ratio, matching the performance of the full 139K retrieved adaptation dataset would require nearly 5M synthetic images after top 30% filtering. We note, however, that synthetic data is unlikely to truly scale infinitely, as synthetic data fundamentally derives from the (finite) training set of our generative model. Still, the performance of synthetic data is likely unsaturated at the 500K scale (i.e., accuracy is still trending up); due to compute limitations, studying whether further scaling can outperform retrieved data is left for future work.

Synthetic data can improve a model’s task representation without significantly improving the model’s task performance.

Broadly speaking, zero-shot task accuracy measures a model’s ability to directly solve the downstream task, whereas linear probing accuracy measures the quality of the model’s learned task-relevant representation. We find that even when training on synthetic data improves the model’s representation (i.e., downstream linear probing accuracy), it may not significantly improve the model’s zero-shot accuracy. In contrast, when training on retrieved data improves the model’s representation, zero-shot accuracy also exhibits positive scaling. For example, CLIP adapted with either 15K retrieved images or 500K synthetic images both achieve a similar linear probing accuracy ( $\sim 72\%$ ), yet the model adapted with synthetic data achieves a much worse zero-shot accuracy ( $28.9\%$ versus $39.5\%$ ). We discuss possible reasons for this qualitative discrepancy in model behavior in our analyses below (Section5.1).

5 Analysis

In this section, we explore two questions to better understand our main results. First, what factors drive the underperformance of synthetic data? Second, do our findings hold under variations of our experimental setup? We focus our analysis experiments on ImageNet, to understand general image recognition performance, and FGVC-Aircraft, the sole benchmark where synthetic data exhibited strong positive log-linear scaling.

5.1 Why does synthetic data lag retrieved real data?

Qualitative visualizations.

We visualize a random selection of images from our curated synthetic and retrieved adaptation datasets in Figure3. Compared to retrieved real images, we observe that the synthetic images (1) contain low-level generator artifacts, and (2) differ in visual content distribution, both in terms of semantic details and overall image composition. For example, although the synthetic FGVC-Aircraft adaptation images (top two rows of Figure3) are recognizable as airplanes, the visual content often contains incorrect class-relevant semantic details: a correctly-depicted “Airbus A320” should have one engine per wing and two sets of wheels at its rear, yet our synthetic images often exhibit incorrect engine or wheel configurations. This qualitative discrepancy in visual detail precision may partially explain why training on synthetic data does not improve task zero-shot accuracy; synthetic images do not retain enough class-accurate details to directly teach the model the downstream task. In contrast, training on synthetic images can improve linear probing accuracy, because the synthetic images still broadly look like aircraft and thus may help align the model’s representation to the downstream domain.

Using Retrieved Real Images Performs Better (3)

Synthetically perturbing retrieved real images.

To disentangle the effect of low-level generator artifacts and visual content differences between synthetic and retrieved real images on downstream model performance, we trained on “hybrid” images that have similar semantic visual content as our retrieved real images but contain generator artifacts like our synthetic images. Following SDEdit[41], we use Stable Diffusion to synthetically perturb our retrieved images to introduce model-specific artifacts present in the synthetic images Stable Diffusion generates. Given a noise strength parameter $\gamma\in[0,1]$ and a retrieved image $x_{0}$ , SDEdit adds Gaussian noise to $x_{0}$ according to timestep $t=\gamma$ of Stable Diffusion’s time-dependent forward process.We then denoise the noisy image using the same reverse diffusion process as in text-to-image generation, yielding a perturbed image $x^{(\gamma)}$ that looks semantically like $x_{0}$ while also containing Stable Diffusion-specific artifacts. Increasing $\gamma$ increases the amount of Gaussian noise added to $x_{0}$ , thereby increasing the severity of visual artifacts introduced in the resulting $x^{(\gamma)}$ . In pseudocode,

\displaystyle x^{(\gamma)}=\textrm{ StableDiffusion.denoise}(\textrm{%StableDiffusion.add\_noise$(x_{0},\gamma)$},\gamma).

Starting from the full targeted retrieved adaptation datasets ${\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}$ for FGVC-Aircraft and ImageNet, we use SDEdit to introduce generator artifacts into the retrieved real images over a range of $\gamma$ values and visualize the resulting perturbed images in Figure4. We plot the results of training on these perturbed images across $\gamma$ in Figure5.

Using Retrieved Real Images Performs Better (4)

Using Retrieved Real Images Performs Better (5)

Our results suggest three takeaways. First, generator artifacts indeed contribute to the underperformance of synthetic training images, especially for fine-grained classification tasks. On FGVC-Aircraft, any amount of added generator artifacts drops downstream accuracy. Second, the impact of artifacts is relatively lower for broad classification domains such as ImageNet, where downstream performance is not significantly impacted until we perturb with a relatively strong noise strength of $\gamma=0.5$ . Finally, visual content differences between synthetic and retrieved images also play a key role in the performance gap between synthetic and retrieved training data. When we perturb images with strength $\gamma=0.5$ , the resulting images are heavily afflicted with artifacts, but still retain the important class-relevant details of retrieved real images, such as correct airplane engine configurations. Training on $\gamma=0.5$ images significantly outperforms training on synthetic images. Intriguingly, training on aircraft images perturbed beyond the point where class-relevant visual details are damaged $(\gamma\geq 0.6)$ still outperforms synthetic data; we speculate that this is because these heavily perturbed images still retain the overall image composition of retrieved images.

5.2 Synthesizing data via another generative model

For our main scaling experiments, we generate synthetic image datasets using Stable Diffusion 1.5 to maintain consistency with prior work[59, 22]. To what degree does our choice of generative model impact our findings? At the time of our study, Stable Diffusion 1.x models are the only modern text-to-image models with open source training data available to retrieve from. Therefore, we focus our study here on Stable Diffusion (SD) versions 1.1, 1.3, and 1.5. Starting from SD v1.1, which is trained on the full LAION-2B dataset, SD v1.3 and v1.5 are derived by further finetuning on high-quality subsets of LAION-2B. This additional finetuning improves image generation fidelity[51], but may lead to the model forgetting parts of the LAION-2B distribution[18].Following our main experiment setup (Section4), we use SD v1.1 and SD v1.3 to generate various-sized targeted synthetic adaptation datasets for ImageNet and FGVC-Aircraft. Results are plotted in Figure6. Overall, while training with synthetic data from different generative models yields varying performance, synthetic data from all generative models considered consistently fall short of retrieval.

5.3 Mixing synthetic and retrieved data

On FGVC-Aircraft, finetuning CLIP with either synthetic or retrieved data alone consistently improves downstream task accuracy. The gains from retrieved data are stronger than the gains from synthetic data across all data scales; however, synthetic data may improve CLIP in ways that is complementary to retrieved data, and thus present orthogonal value. To test this possibility, we measure whether training on a mix of synthetic and retrieved Aircraft adaptation data significantly outperforms training with either alone. Starting from our largest retrieved adaptation dataset ${\mathcal{D}_{\mathcal{C}}^{(\text{retrieved})}}$ (139K images), we progressively add in increasing amounts of synthetic images from our synthetic adaptation dataset ${\mathcal{D}_{\mathcal{C}}^{(\text{synthetic})}}$ and finetune a pretrained CLIP model with the resulting mix. We plot results in Figure7.We find that training on the mixed images outperforms training on synthetic images alone; however, training on a mix significantly drops performance compared to using retrieved data alone.

Using Retrieved Real Images Performs Better (6)

Using Retrieved Real Images Performs Better (7)

6 Discussion

Our work sought to answer a key question: given that all model-generated synthetic images derive from the generator’s upstream training data, does training on synthetic images provide value over training on the parts of the upstream real data directly? We contribute a set of rigorous experiments to ground this question empirically, and discover that training on upstream real images collected via our simple retrieval baseline significantly outperforms training on synthetic images. Our initial question is answered negatively. We therefore argue that retrieval is a critical baseline to surpass in order to show value from synthetic training data, and encourage comparison against it in future research.

Importantly, we do not seek to make normative claims about whether training with synthetic images will ever surpass this baseline—future work may unlock gains that we have not yet found. As a first step, we contribute analyses of why synthetic training images underperform upstream real images, finding that both generator artifacts and semantic errors within synthetic images are key areas for future improvement. Furthermore, given that image retrieval is a strong alternative to image synthesis, a natural next step is to generate image compositions that are explicitly absent from the generator’s upstream training dataset; synthesizing these “missing" images may offer unique value beyond the existing upstream real images. Such an approach leverages the compositional generalization abilities of the generator, which recent research promisingly suggests may be stronger than the compositionality of a discriminative model trained on the same upstream data[36, 12].

Finally, our findings assume access to the generative model’s upstream training data, an assumption that may not always hold. The upstream pool may be proprietary or strictly regulated due to privacy concerns. In such settings, training directly on the upstream data is impossible; synthetic data from a generative model trained on this unavailable upstream data remains an exciting alternative to acquire otherwise inaccessible information.

Limitations.

As an empirical study, our compute budget limits the number of experimental variations we consider. Our results are derived from adapting CLIP models with standard full finetuning; we conjecture that our findings generalize to other pretrained backbones and adaptation methods as well, but we were not able to test this empirically. Moreover, at the time of our work, Stable Diffusion is the only text-to-image model with publicly available training data to retrieve from (i.e. LAION-2B); we do not study other generators trained on other data pools.Finally, we focus on model accuracy, leaving a comparison of model robustness and fairness from training on synthetic versus real data to future work.

Acknowledgments and Disclosure of Funding

We graciously thank (in alphabetical order) Eric Frankel, Jacqueline He, Athena Tsu, and Rui Xin for their helpful comments and feedback. SG is supported by the NSF GRFP.

References

Achiam etal. [2023]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
Albalak etal. [2024]Alon Albalak, Yanai Elazar, SangMichael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, etal.A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024.
Azizi etal. [2023]Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and DavidJ Fleet.Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023.
Bahng etal. [2022]Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola.Exploring visual prompts for adapting large-scale models.arXiv preprint arXiv:2203.17274, 2022.
BaradadJurjo etal. [2021]Manel BaradadJurjo, Jonas Wulff, Tongzhou Wang, Phillip Isola, and Antonio Torralba.Learning to see by looking at noise.Advances in Neural Information Processing Systems, 34:2556–2569, 2021.
Beaumont [2022]Romain Beaumont.Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them.https://github.com/rom1504/clip-retrieval, 2022.
Chen etal. [2019]Yuhua Chen, Wen Li, Xiaoran Chen, and LucVan Gool.Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1841–1850, 2019.
Chen etal. [2023]Zeming Chen, AlejandroHernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, etal.Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023.
Cherti etal. [2023]Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev.Reproducible scaling laws for contrastive language-image learning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
Chowdhery etal. [2023]Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, etal.Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023.
Cimpoi etal. [2014]M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi.Describing textures in the wild.In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
Clark and Jaini [2024]Kevin Clark and Priyank Jaini.Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36, 2024.
Deng etal. [2009]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Dosovitskiy etal. [2015]Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van DerSmagt, Daniel Cremers, and Thomas Brox.Flownet: Learning optical flow with convolutional networks.In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
Douze etal. [2024]Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou.The faiss library.2024.
Fang etal. [2022]Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt.Data determines distributional robustness in contrastive language image pre-training (clip).In International Conference on Machine Learning, pages 6216–6234. PMLR, 2022.
Fang etal. [2023]Alex Fang, AlbinMadappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar.Data filtering networks.arXiv preprint arXiv:2309.17425, 2023.
French [1999]RobertM French.Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999.
Gadre etal. [2024]SamirYitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, etal.Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024.
Goyal etal. [2023]Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan.Finetune like you pretrain: Improved finetuning of zero-shot vision models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023.
Gururangan etal. [2020]Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and NoahA Smith.Don’t stop pretraining: Adapt language models to domains and tasks.arXiv preprint arXiv:2004.10964, 2020.
Hammoud etal. [2024]Hasan Abed AlKader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, and Bernard Ghanem.Synthclip: Are we ready for a fully synthetic clip training?arXiv preprint arXiv:2402.01832, 2024.
He etal. [2022a]Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi.Is synthetic data from generative models ready for image recognition?arXiv preprint arXiv:2210.07574, 2022a.
He etal. [2022b]Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, and Mohammad Norouzi.Generate, annotate, and learn: Nlp with synthetic text.Transactions of the Association for Computational Linguistics, 10:826–842, 2022b.
Ho and Salimans [2022]Jonathan Ho and Tim Salimans.Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022.
Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020.
Hoffmann etal. [2022]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego deLas Casas, LisaAnne Hendricks, Johannes Welbl, Aidan Clark, etal.Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022.
Honovich etal. [2022]Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick.Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689, 2022.
Hsieh etal. [2023]Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister.Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301, 2023.
Johnson-Roberson etal. [2016]Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, SharathNittur Sridhar, Karl Rosaen, and Ram Vasudevan.Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?arXiv preprint arXiv:1610.01983, 2016.
Jung etal. [2023]Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, and Yejin Choi.Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing.arXiv preprint arXiv:2305.16635, 2023.
Kaplan etal. [2020]Jared Kaplan, Sam McCandlish, Tom Henighan, TomB Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020.
Koh etal. [2021]PangWei Koh, Shiori Sagawa, Henrik Marklund, SangMichael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, RichardLanas Phillips, Irena Gao, etal.Wilds: A benchmark of in-the-wild distribution shifts.In International conference on machine learning, pages 5637–5664. PMLR, 2021.
Krause etal. [2013]Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.3d object representations for fine-grained categorization.In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
Lee etal. [2021]Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini.Deduplicating training data makes language models better.arXiv preprint arXiv:2107.06499, 2021.
Li etal. [2023]AlexanderC Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak.Your diffusion model is secretly a zero-shot classifier.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023.
Liu etal. [2023]Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, YongJae Lee, and Chunyuan Li.Learning customized visual models with retrieval-augmented knowledge.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15148–15158, 2023.
Liu etal. [2024]Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, etal.Best practices and lessons learned on synthetic data for language models.arXiv preprint arXiv:2404.07503, 2024.
Maji etal. [2013]Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi.Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013.
Mao etal. [2022]Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl Vondrick.Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022.
Meng etal. [2021]Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon.Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021.
Nguyen etal. [2022]Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt.Quality not quantity: On the interaction between dataset design and robustness of clip.Advances in Neural Information Processing Systems, 35:21455–21469, 2022.
Nguyen etal. [2024]Thao Nguyen, SamirYitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt.Improving multimodal datasets with image captioning.Advances in Neural Information Processing Systems, 36, 2024.
Nilsback and Zisserman [2008]Maria-Elena Nilsback and Andrew Zisserman.Automated flower classification over a large number of classes.In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
OpenAI [2022]OpenAI.Chatgpt.2022.
Peng etal. [2015]Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko.Learning deep object detectors from 3d models.In Proceedings of the IEEE international conference on computer vision, pages 1278–1286, 2015.
Peng etal. [2017]Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko.Visda: The visual domain adaptation challenge.arXiv preprint arXiv:1710.06924, 2017.
Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Ramanujan etal. [2024]Vivek Ramanujan, Thao Nguyen, Sewoong Oh, Ali Farhadi, and Ludwig Schmidt.On the connection between pre-training data diversity and fine-tuning robustness.Advances in Neural Information Processing Systems, 36, 2024.
Richter etal. [2016]StephanR Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun.Playing for data: Ground truth from computer games.In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 102–118. Springer, 2016.
Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Ros etal. [2016]German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and AntonioM Lopez.The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
Sarıyıldız etal. [2023]MertBülent Sarıyıldız, Karteek Alahari, Diane Larlus, and Yannis Kalantidis.Fake it till you make it: Learning transferable representations from synthetic imagenet clones.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8011–8021, 2023.
Schuhmann etal. [2022]Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
Silver etal. [2017]David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, etal.Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017.
Sohl-Dickstein etal. [2015]Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.Deep unsupervised learning using nonequilibrium thermodynamics.In International conference on machine learning, pages 2256–2265. PMLR, 2015.
Song etal. [2020]Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020.
Taori etal. [2023]Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and TatsunoriB Hashimoto.Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
Tian etal. [2023]Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola.Learning vision from models rivals learning vision from data.arXiv preprint arXiv:2312.17742, 2023.
Tian etal. [2024]Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan.Stablerep: Synthetic images from text-to-image models make strong visual representation learners.Advances in Neural Information Processing Systems, 36, 2024.
Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
Udandarao etal. [2024]Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, PhilipHS Torr, Adel Bibi, Samuel Albanie, and Matthias Bethge.No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance.arXiv preprint arXiv:2404.04125, 2024.
Vasu etal. [2023]Pavan KumarAnasosalu Vasu, Hadi Pouransari, Fartash fa*ghri, Raviteja Vemulapalli, and Oncel Tuzel.Mobileclip: Fast image-text models through multi-modal reinforced training.arXiv preprint arXiv:2311.17049, 2023.
Wallingford etal. [2024]Matthew Wallingford, Vivek Ramanujan, Alex Fang, Aditya Kusupati, Roozbeh Mottaghi, Aniruddha Kembhavi, Ludwig Schmidt, and Ali Farhadi.Neural priming for sample-efficient adaptation.Advances in Neural Information Processing Systems, 36, 2024.
Wang etal. [2022]Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, NoahA Smith, Daniel Khashabi, and Hannaneh Hajishirzi.Self-instruct: Aligning language models with self-generated instructions.arXiv preprint arXiv:2212.10560, 2022.
Wortsman etal. [2022]Mitchell Wortsman, Gabriel Ilharco, JongWook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, RaphaelGontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, etal.Robust fine-tuning of zero-shot models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022.
Xu etal. [2023]Hu Xu, Saining Xie, XiaoqingEllen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer.Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023.
Zhang etal. [2021]Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li.Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021.
Zhou etal. [2024]Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, etal.Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36, 2024.
Zhou etal. [2022]Kaiyang Zhou, Jingkang Yang, ChenChange Loy, and Ziwei Liu.Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022.
Zhou etal. [2023]Yongchao Zhou, Hshmat Sahak, and Jimmy Ba.Using synthetic data for data augmentation to improve classification accuracy.2023.

Appendix A Broader Impacts

We release all models and synthetic data from our work to benefit future research. All assets released are meant to be scientific research artifacts. Moreover, all models released are task-specific classification models, limiting potential for misuse. Nonetheless, we do not encourage the use or deployment of our models in practice.

Appendix B Details in Methodology

B.1 Sourcing data by generating synthetic images

Given a set of visual class names $\mathcal{C}$ from our target task, we first synthesize a large corpus of image captions for each class name by prompting a large language model (we use LLamA-2 7B[61]). For each concept name $c\in\mathcal{C}$ , we use three type of prompts to convert $c$ into an image caption following[59]. For the sake of completeness, we detail the prompts here:

1. $c\mapsto$ caption. We prompt the language model (LM) to directly translate the class name into a caption using a prompt with 3 few-shot in-context examples.

2. $c,background\mapsto$ caption. We prompt the LM with an additional background attribute that is randomly sampled from a set that is predetermined based on the domain of $\mathcal{C}$ . For example, if $\mathcal{C}$ contains a list of flower names, then possible background attributes might include “garden," “meadow," or “forest." These background attributes are automatically generated by prompting a strong instruction-tuned language model such as GPT-4[1] with the class names $\mathcal{C}$ . We provide the LM with 3 in-context examples of $c,background\mapsto$ caption mappings.

3. $c,relation\mapsto$ caption. We prompt with an additional spatial relationship attribute that is sampled from a domain-invariant set of relationships, such as “next to," “below," “besides," etc. We provide 3 in-context examples of $c,relation\mapsto$ caption mappings.

Each of these captions are directly used as text input to Stable Diffusion 1.5 to produce our targeted synthetic dataset ${\mathcal{D}_{\mathcal{C}}^{(\text{synthetic})}}$ . When sampling from Stable Diffusion, we denoise for 50 DDIM[57] steps starting from Gaussian noise, using a classifier-free guidance[25] scale of 2.0.

Generating the images can be computationally expensive; every 1M synthetic images generated (pre-filtering) requires around 12 hours of generation on 64 NVIDIA A100 GPUs. To lower the barrier for future research, we will release our generated synthetic images at https://github.com/scottgeng00/unmet-promise.

Appendix C Details in Experimental Setup

C.1 Retrieval hyperparameters

We perform $k$ -NN retrieval with $k=2000$ for every downstream benchmark except ImageNet-1K, where we use $k=500$ . We picked these values of $k$ with a rough target of retrieving 10K images per class. In particular, the original CLIP paper[48] has a different set of class template strings for each benchmark; our query sets $Q_{c}$ for each benchmark are differently sized, and our values of $k$ vary to reflect that.

For class balancing, we set $M=10000$ . We do not tune $k$ or $M$ in our experiments.

Choosing based on downstream validation set accuracy, we use our substring retrieval strategy for FGVC-Aircraft and Flowers102; we use our semantic retrieval strategy for ImageNet-1K, DTD, and StanfordCars.

We use precomputed $k$ -NN search indicies from LAION-2B[54] to query against OpenAI CLIP ViT-L/14 image embeddings. No search indicies are available for querying against text embeddings; we construct our own using FAISS[15], using the configuration OPQ256 768,IVF131072 HNSW32,PQ256x8. Computing 2 billion OpenAI CLIP ViT-L/14 text embeddings for the captions in LAION-2B took approximately 2 hours on 100 GPUs of varying capacity. Computing the search index from the embeddings took approximately 12 hours on 128 CPU cores.

C.2 Details for downstream benchmarks

We use the standard pre-defined train-test-validation splits for FGVC-Aircraft, DTD, and Flowers-102. Standard validation splits are not available for StanfordCars and ImageNet-1K. We construct a train-validation split for StanfordCars by randomly splitting 80% and 20% of the pre-defined training set (respectively) using torch.utils.data.random_split with random seed 42. We construct a validation set for ImageNet-1K by randomly subsampling 50K images from the pre-defined training set using torch.utils.data.random_split with random seed 42.

C.3 Details for model adaptation

Finetuning details.

To finetune CLIP for a specific downstream image classification task, we first initialize a linear readout head $W$ using the weights from the text-based zero-shot CLIP model[9]. Concretely, we initialize $W$ using the CLIP text embeddings of the class names for the desired downstream task. We then append the classification head $W$ on top of CLIP’s vision encoder, and train end-to-end using a standard cross entropy classification loss against one-hot labels.

We could alternatively choose to finetune CLIP with a contrastive objective, where each positive pair is a synthetic or retrieved image alongside its corresponding caption. However, we find that cross entropy finetuning performs better across the board, so we use cross entropy finetuning for all experiments in our paper.

A full adaptation dataset scale sweep for a single benchmark and a fixed set of hyperparameters takes approximately 24-36 hours on 2 NVIDIA A40 GPUs.

Random seed.

For our main experiments, generator ablations, and data mixing experiments we report results aggregated across at least three random seeds. The random seed is used to (1) seed the training algorithm, and (2) controls adaptation dataset subsampling.

Hyperparameter details.

We start with relatively standard hyperparameters from prior work[66], and initially tune them in our setting by finetuning CLIP on a small-scale dataset of retrieved or synthetic images from each downstream benchmark and grid-sweeping by learning rate and batch size. From the hyperparameters we tried at this scale, we find the following work best for both synthetic and retrieved images across all downstream benchmarks:

•
Batch size: 512
•
Warmup steps: 500
•
LR schedule: Cosine decay
•
L2 weight decay: 0.1

We find that models are sensitive to learning rate; there is no one optimal learning rate across all settings. Thus, for our full-scale experiments, we sweep learning rate across $\{$ 5e-4, 1e-5, 1e-6 $\}$ , and select the best learning rate for each downstream benchmark based on validation set accuracy.

We train with an AdamW optimizer, using $\beta_{1}=0.9,\beta_{2}=0.95$ .

On all benchmarks except ImageNet, we finetune for a fixed 30 epochs. On ImageNet, we train for a fixed 10 epochs to save compute, as we found that validation set accuracy plateaued early on.

Appendix D Details in Licensing Information

D.1 Benchmarks

ImageNet-1K is released under a custom license that specifies non-commercial research use only. Details can be found at https://www.image-net.org/download.php.Licensing information is unavailable for DTD, Flowers102, FGVC-Aircraft, and StanfordCars; all four datasets are products of academic research and are publicly available online for download.

D.2 Models

Stable Diffusion 1.1, 1.3, and 1.5 are all released under a CreativeML OpenRAIL M license. OpenCLIP models and OpenAI CLIP are released under MIT License.

D.3 Data

LAION-2B metadata, precomputed embeddings, and $k$ -NN search indices are released under CC-BY-4.0 licensing. As a web-scraped dataset, all images pointed to by the URLs retain their original licenses.

D.4 Software

We build off the open source code of[51, 37, 6, 15, 66]. FAISS[15] and clip-retrieval[6] are released under MIT license. SynCLR code[59] is released under Apache 2.0. Stable Diffusion code[51] is released under CreativeML Open RAIL-M. WiSE-FT (the codebase we build off of for CLIP finetuning) is released under an MIT license.

Using Retrieved Real Images Performs Better (2024)

FAQs

Why is image retrieval important? ›

Image Retrieval is a fundamental and long-standing computer vision task that involves finding images similar to a provided query from a large database. It's often considered as a form of fine-grained, instance-level classification.

Find Out More ›

How does retrieval augmented generation rag improve large language models responses? ›

The retrieval mechanism in RAG ensures that the retrieved information is relevant to the input query or context. By providing the LLM with contextually relevant information, RAG helps the model generate responses that are more coherent and aligned with the given context.

Keep Reading ›

When to use retrieval augmented generation? ›

RAG allows developers to provide the latest research, statistics, or news to the generative models. They can use RAG to connect the LLM directly to live social media feeds, news sites, or other frequently-updated information sources. The LLM can then provide the latest information to the users.

Show Me More ›

What are the benefits of retrieval? ›

Retrieval practice is the strategy of recalling facts, concepts, or events from memory in order to enhance learning. The act of retrieving something from your memory actually strengthens the connections holding it there, making it more likely that you'll be able to recall it in the future.

Know More ›

What is the importance of retrieval system? ›

Efficient information access: Above all, IR systems save people untold amounts of time and effort. Information retrieval enables users to quickly access relevant information without manually searching through vast troves of documents and data.

Tell Me More ›

Is retrieval practice effective? ›

Retrieval practice, which requires students to generate an answer to a question, has been proven to be the most effective revision strategy, and thus a technique that students should be looking to employ.

Know More ›

What is retrieval effectiveness? ›

These relevance-based variables are chosen to reflect in some way what has now become known as the retrieval effectiveness: the ability of the system to retrieve relevant documents while at the same time suppressing the retrieval of non-relevant documents.

View Details ›

Why is retrieval practice such a powerful method to enhance learning? ›

Real and deep learning happens when we transfer information from our short-term memory stores into our long-term memory stores. Retrieval practice supports this process because it requires students to recall previously-learned information.

Does ChatGPT use rag? ›

ChatGPT, despite all its undeniable capabilities, has one serious flaw—it hallucinates answers. Fortunately, there is a solution to this daunting problem, called the RAG system.

Discover More ›

Why is the retrieval process in creating a memory so important? ›

Our ability to retrieve information from long-term memory is vital to our everyday functioning. You must be able to retrieve information from memory in order to do everything from knowing how to brush your hair and teeth, to driving to work, to knowing how to perform your job once you get there.

Know More ›

What is the difference between generative model and retrieval model? ›

What is the key difference between retrieval-based and generative chatbots? Retrieval-based chatbots use pre-written responses from a knowledge base, while generative chatbots generate new responses based on user queries using pre-training, natural language processing, and deep learning.

Learn More ›

How to improve retrieval in augmented generation? ›

Improving performance

Garbage in, garbage out. The higher the quality of the context you provide, the higher quality result you'll receive. ...
Tune your splitting strategy. ...
Tune your system prompt. ...
Filter your vector store results. ...
Try different embedding models (and fine-tune your own).

Oct 18, 2023

Read On ›

Why is rag better than fine-tuning? ›

RAG is generally more cost-efficient than fine-tuning because it limits resource costs by leveraging existing data and eliminating the need for extensive training stages. Fine-tuning requires significant time and compute power for training the model with new data, making it more resource-intensive.

Learn More Now ›

What is the difference between retrieval augmented generation and semantic search? ›

RAG's revolutionary technique seamlessly blends retrieval and generation, allowing machines to understand queries while also producing socially relevant solutions. Semantic Search, on the other hand, uses semantics to understand the meaning of queries, resulting in more precise and detailed results.

Find Out More ›

Why is image recognition important? ›

It acts as a crucial tool for efficient data analysis, improved security, and automating tasks that were once manual and time-consuming. According to Statista Market Insights, the demand for image recognition technology is projected to grow annually by about 10%, reaching a market volume of about $21 billion by 2030.

Find Out More ›

Why is retrieval important in psychology? ›

Retrieval, the act of accessing information from memory, is one of the most important aspects of human learning and remembering¹. Information that is encoded and stored but not retrieved has little use to the cognitive system.

Tell Me More ›

What is the importance of image processing? ›

It helps to improve images for human interpretation. Information can be processed and extracted from images for machine interpretation. The pixels in the image can be manipulated to any desired density and contrast. Images can be stored and retrieved easily.

Read The Full Story ›

What is image capturing and its importance? ›

image capture (image acquisition) The process of obtaining a digital image from a vision sensor, such as a camera. Usually this entails a hardware interface known as a frame grabber, which captures single frames of video, converts the analogue values to digital, and feeds the result into the computer memory.

Learn More Now ›

Using Retrieved Real Images Performs Better (2024)

Abstract

1 Introduction

2 Related Work

Learning from synthetic data.

3 Problem Setting and Method

3.1 Sourcing data by generating synthetic images

3.2 Sourcing data by retrieving real images

Strategy 1: hard substring matching.

Strategy 2: semantic k𝑘kitalic_k-NN retrieval.

3.3 Additional data filtering and postprocessing

4 Main Experiments

4.1 Main results: synthetic training data lags behind a baseline of retrieved real images

At equivalent data scales, finetuning with model-generated synthetic images can help, but is universally matched or outperformed by finetuning directly with images from the generator’s training data.

Synthetic data can sometimes decrease the gap with retrieved data given additional scale, but remains behind.

Synthetic data can improve a model’s task representation without significantly improving the model’s task performance.

5 Analysis

5.1 Why does synthetic data lag retrieved real data?

Qualitative visualizations.

Synthetically perturbing retrieved real images.

5.2 Synthesizing data via another generative model

5.3 Mixing synthetic and retrieved data

6 Discussion

Limitations.

Acknowledgments and Disclosure of Funding

References

Appendix A Broader Impacts

Appendix B Details in Methodology

B.1 Sourcing data by generating synthetic images

Appendix C Details in Experimental Setup

C.1 Retrieval hyperparameters

C.2 Details for downstream benchmarks

C.3 Details for model adaptation

Finetuning details.

Random seed.

Hyperparameter details.

Appendix D Details in Licensing Information

D.1 Benchmarks

D.2 Models

D.3 Data

D.4 Software

FAQs

Why is image retrieval important? ›

What is the difference between generative model and retrieval model? ›

Strategy 2: semantic $k$ -NN retrieval.