**Thibault Castells**
Research Engineer, Nota AI

Introduction

Latent Diffusion Models (LDMs) [Rombach et al., 2021] have demonstrated remarkable performance in generative tasks, but their deployment on resource-limited devices remains challenging due to high memory consumption and slow inference speeds. Pruning is a common solution to reduce model size and improve efficiency. However, pruning often leads to performance loss, requiring the model to be retrained, which can be expensive and time-consuming for large models such as LDMs.

Performance-guided pruning aims to minimize performance loss during the pruning process. However, applying performance-guided pruning to LDMs is not possible due to the lack of fast and task-agnostic methods for evaluating their performance. To address these challenges, our recent research paper introduces LD-Pruner, a novel performance-preserving pruning method tailored specifically for compressing LDMs. By leveraging the latent space during the pruning process, LD-Pruner achieves significant improvements in inference time and parameter count reduction while maintaining minimal performance degradation across various tasks, including text-to-image generation, unconditional image generation, and unconditional audio generation.

🍀 Resources for more information: ArXiv.

🍀 Accepted at CVPR’24 Workshop on Efficient and On-Device Generation (EDGE)

Overview

LD-Pruner assesses the significance of each operator in the LDM's Unet by systematically modifying them and tracking changes in the generated latent representations. A tailored scoring formula quantifies the impact of each modification, guiding the pruning process to remove or replace less significant operators, as visualized in Figure 1. This approach enables LD-Pruner to identify the most promising pruning candidates while minimizing the need for expensive retraining. By operating in the latent space, LD-Pruner provides a fast and task-agnostic method for evaluating the impact of pruning on model performance. This allows for efficient performance-guided pruning of LDMs, overcoming the limitations of traditional pruning methods. The resulting compressed models offer reduced memory consumption and improved inference speed, making them more suitable for deployment on resource-limited devices.

Figure 1. Overview of LD-Pruner. Given $k$ operators in the Unet, we generate $k+1$ sets of $N_{gen}$ latent vectors: one set for the original Unet, and one for each Unet where a single operator has been modified. The importance score of each operator is then calculated using a formula specifically designed to compare latent vectors. This formula, sensitive to shifts in both the central tendency and the variability of the latent vectors, generates a comprehensive measure of the importance of each operator.

Method

LD-Pruner's process begins by systematically modifying each operator in the LDM's Unet, either by removing it entirely or replacing it with a less computationally demanding operation that maintains the original dimensions. The model then generates latent representations using both the original and modified versions of the Unet, creating a comprehensive record of the model's output under various operator modifications.

To evaluate the significance of each operator, LD-Pruner employs a specially designed scoring formula that captures the difference between the original and modified sets of latent representations. This scoring formula is designed to be sensitive to both shifts in the central tendency and changes in the variability of the latent representations, providing a comprehensive measure of the impact of operator modification. The formula takes into account both the average distance and the standard deviation distance between the two sets:

$$ \text{score} = avg_{dist} + std_{dist} $$

With:

$$ avg_{dist} = |avg_{orig} - avg_{mod}|_2\\ std_{dist} = |std_{orig} - std_{mod}|_2 $$

where $|\cdot|_2$ denotes the Euclidean norm, $avg_{orig}$ and $avg_{mod}$ are the average latent vectors for a small set of generated latent vectors from the original model and the modified model (with one operator changed), respectively, and $std_{orig}$ and $std_{mod}$ are the standard deviations of the same set of generated latent vectors from the corresponding models.

Once the significance scores are calculated, LD-Pruner identifies the operators with the lowest scores as candidates for pruning or substitution, as these are considered the least contributory to the model's output.

Results

LD-Pruner was evaluated on several generative tasks, including text-to-image (T2I) generation, unconditional image generation (UIG), and unconditional audio generation (UAG), demonstrating its effectiveness in compressing LDMs while preserving performance.

In the T2I task, our pruned model get performance close to the original model while being 34.89% faster (measured on a NVIDIA RTX 3090 GPU), as shown in Table 1.

Table 1: Comparison of different models for T2I Generation, on the MS-COCO *256 X 256* validation set. Speedup values are measured relatively to SD-v1.4.

Figure 2 provides a qualitative comparison of the pruned model's output against other T2I models, showcasing its competitive performance.

Figure 2: Qualitative comparison on zero-shot MS-COCO benchmark on T2I. The results of previous studies were obtained with their official released models.

For UIG, the compressed model (23.47% speedup) converged rapidly, reaching a minimum FID of 15.03 after just 46k iterations, compared to the original model's FID of 13.84 after 410k iterations (Figure 3).

Figure 3: Evolution of the FID during the training process for the UIG task on the CelebA-HQ *256* X *256* dataset, for two different compression ratios.

In the UAG task, with 90 operators modified, LD-Pruner achieved a 19.2% speedup and a post-finetuning FAD of 2.0 (Table 2), which is better than the original model. For the human ear, the audio output from the compressed model sounds exactly the same as the original model.

Table 2: Compression performance on UAG task with AudioDiffusion. When finetuning, we proceed for 12k steps.

LD-Pruner's focus on weight preservation during pruning proved crucial for maintaining performance. Diffusion models with preserved weights consistently outperformed models trained from scratch under equivalent compression and training conditions (Table 3).

Table 3: FID scores for our compressed model (31 operators modified) trained from scratch and with preserved pre-training weights, for UIG on CelebA-HQ *256* X *256*. In both case, the exact same training is applied. The FID for the original model is *13.85*.

Conclusion

LD-Pruner represents a significant advancement in the compression of LDMs, offering a task-agnostic and performance-preserving pruning approach. By leveraging the latent space and introducing a novel scoring metric for comparing latent representations, LD-Pruner overcomes the limitations of traditional pruning methods when applied to LDMs. The resulting compressed models offer faster inference speeds and reduced parameter counts without substantially sacrificing performance, making LDMs more accessible and efficient for deployment on resource-limited devices. LD-Pruner's development opens up new possibilities for the widespread adoption of LDMs in various applications, bringing us closer to the goal of making powerful generative models available to a broader range of users and devices.

LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Introduction

Overview

Method

Results

Conclusion

EdgeFusion: On-device Text-to-Image Generation

Shortened LLM: A Simple Depth Pruning for Large Language Models