StripWinformer: Locally-Enhanced Transformer for Image Motion Deblurring
Abstract:
Single image motion deblurring is a critical low-level computer vision task, aiming to restore clear and sharp images from motion-corrupted counterparts.
Traditional image motion deblurring methods often face challenges in handling complex motion patterns and preserving fine details.
Inspired by the success of vision transformer models in various tasks, we propose an innovative framework that leverages the power of transformers to capture motion blurred patterns both locally and globally.
To alleviate the computational burden, many vision transformer techniques adopt a strategy of partitioning the image into multiple windows and subsequently modeling the relationships within each isolated window.
Nonetheless, these approaches impose limitations on the exchange of information among the windows, hindering the overall performance.
In this paper, we present a transformer block including partitioning the image into horizontal and vertical strips to capture long-range blur patterns, along with the use of windows to capture short-range blur patterns.
To further expand the receptive field of the transformer, we employ an efficient mechanism that specifically emphasizes the correlation between different windows.
To assess the effectiveness of our model, we conduct evaluation on several single image motion deblurring datasets.
The result of our evaluation demonstrate that the proposed architecture achieves competitive performance compared to other state-of-the-art approaches in the field of motion deblurring.
Artitechture: As shown in Figure 1, our deblurring model is in the form of an encoder-decoder architecture with skip-connections .
Given a motion blurred image I, it is initially processed by a two-stage convolution-based encoder to embed feature X0.
The first and second stage of the encoder consist of 3¡Ñ3 convolution layers connected by component-wise residual connections.
These connections serve the dual purpose of extracting fine details from small ranges and ensuring the model's convergence during the training phase.
Additionally, for the down-sampling layer, we employ a 3¡Ñ3 convolution operation with a stride of 2.
This allows us to effectively reduce the spatial dimensions while preserving important information for subsequent processing stages.
The embedded feature X0 then undergoes processing by a series of transformer blocks.
In our framework, we strategically employ six stacked intra and inter transformer blocks in the bottleneck stage, followed by three stacked intra and inter transformer blocks in the second stage of the decoder.
This configuration facilitates the effective capture of both local and global blur patterns.
By incorporating this approach, our model becomes capable of learning and representing intricate details at different scales, thereby enhancing its ability to handle a diverse range of blurring levels.
In the final stage of the decoder, we employ symmetric residual connected convolution layers with respect to the first stage of the encoder, ensuring the consistency of the architecture and facilitates the reconstruction process and generate a blur correction image R.
Finally, the blurred image I is corrected by adding the correction image R as follows:
I'=I+R, where I' is the deblurred image.
Fig 1¡GThe proposed deblurring model adopts a hybrid transformer architecture, leveraging the strengths of both convolutional neural networks (CNNs) and transformer mechanisms.
Fig 2: The framework of our proposed transformer block consists of two main components: (a) the Intra Transformer Block and (b) the Inter Transformer Block. The Intra Transformer Block focuses on capturing local dependencies within the partitioned region by employing intra-strip and intra-window self-attention mechanisms. On the other hand, the Inter Transformer Block enhances the model's ability to capture global dependencies by incorporating inter-strip and inter-window self-attention mechanisms.
Made
by ªL¬ýùÚ