Abstract

We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input. Specifically, we define two sets of latent codes, where one controls a pre-defined factor (``known''), and the other controls the remaining factors (``unknown''). The sampled latent codes from the two sets jointly bi-modulate the convolution kernels to guide the generator to synthesize diverse results. Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability.

Approach

Results

Compare our base model with baselines

Compared with the baselines, our method generates more diverse results. For faces, we have more variations in identity. For birds and cars, we have different object shapes and textures.

Disentangle Results

Disentanglement results on three datasets. Images on each column sharing the same knonw latent codes (identity, shape, species).

Paper and Supplementary Material

Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh
Contrastive Learning for Diverse Disentangled Foreground Generation

Acknowledgements

This work was supported in part by Sony Focused Research Award, NSF CAREER IIS-2150012, Wisconsin Alumni Research Foundation, and NASA 80NSSC21K0295.

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.