Authors
Shouchang Guo, Valentin Deschaintre, Douglas Noll, Arthur Roullier
University of Michigan, Ann Arbor; Adobe Research
Portals
Abstract
We present a novel U-Attention vision Transformer for universal texture synthesis. We exploit the natural long-range dependencies enabled by the attention mechanism to allow our approach to synthesize diverse textures while preserving their structures in a single inference. We propose a hierarchical hourglass backbone that attends to the global structure and performs patch mapping at varying scales in a coarse-to-fine-to-coarse stream. Completed by skip connection and convolution designs that propagate and fuse information at different scales, our hierarchical U-Attention architecture unifies attention to features from macro structures to micro details, and progressively refines synthesis results at successive stages. Our method achieves stronger 2$\times$ synthesis than previous work on both stochastic and structured textures while generalizing to unseen textures without fine-tuning. Ablation studies demonstrate the effectiveness of each component of our architecture.
Contribution
- A novel hierarchical hourglass backbone for coarseto-fine and fine-back-to-coarse processing, allowing to apply self-attention at different scales and to exploit macro to micro structures
- Skip connections and convolutional layers between Transformer blocks, propagating and fusing highfrequency and low-frequency features from different Transformer stages
- A 2× texture synthesis method with a single trained network generalizing to various texture complexity in a single forward inference
Related Works
Algorithmic texture synthesis; Deep-learning based texture synthesis; Transformers for images
Overview
Proposed U-Attention framework with hierarchical hourglass Transformers. We introduce a multi-scale partition of the feature map between hierarchical Transformer blocks to form input patches of different scales for different Transformers. The input texture image is first projected into feature space by an encoder. We then leverage a succession of Transformer blocks, with up and down convolutions in between (purple arrows), processing the feature maps at different resolutions. Each Transformer block takes the whole feature maps as input, and we partition the feature maps to be sequences of patches of progressively smaller or larger sizes at consecutive stages of the network. Therefore, the input patch size of all the stages forms an hourglass-like scale change (dotted blue line), enabling attention to finer/coarser details at different attention steps. Finally, we add skip connections that propagate and concatenate outputs from different previous stages as part of the inputs for later Transformer stages (yellow arrows).