科學研究

                Research

                首頁 >  論文  > 詳情

                Container: Context Aggregation Network

                發表會議及期刊:NeuIPS

                Peng Gao1,2, Jiasen Lu4, Hongsheng Li2, Roozbeh Mottaghi3,4, Aniruddha Kembhavi3,4

                1 Shanghai AI Laboratory     2 CUHK-SenseTime Joint Lab, CUHK

                3 University of Washington     4 PRIOR @ Allen Institute for AI


                Abstract:

                Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers – originally introduced in natural language processing–have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the CONTAINER (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions a la Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. Our CONTAINER architecture achieves 82.7 % Top-1 accuracy on ImageNet using 22M parameters, +2.8 improvement compared with DeiT-Small, and can converge to 79.9 % Top-1 accuracy in just 200 epochs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named CONTAINER-LIGHT, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on selfsupervised learning compared to DeiT on the DINO framework. Code is released at https://github.com/allenai/container.


                comm@pjlab.org.cn

                上海市徐匯區云錦路701號西岸國際人工智能中心37-38層

                滬ICP備2021009351號-1

                        
                        

                              拔萝卜又叫又疼原声视频