Requirements
Example
- CoAtNet
- CoAtNet: Marrying Convolution and Attention for All Data Sizes
- papers_with_code
- ViT-G/14
- Scaling Vision Transformers
- paper
- SwinV2
- Swin Transformer V2: Scaling Up Capacity and Resolution
- papers_with_code
- ViT-MoE
- Scaling Vision with Sparse Mixture of Experts
- paper
- Florence
- Florence: A New Foundation Model for Computer Vision
- paper
- ALIGN
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- papers_with_code
- MViTv2
- Improved Multiscale Vision Transformers for Classification and Detection
- paper
- MViT
- Multiscale Vision Transformers
- papers_with_code
- BEiT
- BEiT: BERT Pre-Training of Image Transformers
- papers_with_code
- Meta_Pseudo_Labels
- Meta Pseudo Labels
- papers_with_code
- SAM
- Sharpness-Aware Minimization for Efficiently Improving Generalization
- papers_with_code
- NoisyStudent
- Self-training with Noisy Student improves ImageNet classification
- papers_with_code
- NFNet
- High-Performance Large-Scale Image Recognition Without Normalization
- papers_with_code
- TokenLearner
- TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
- papers_with_code
- BiT
- Big Transfer (BiT): General Visual Representation Learning
- papers_with_code
- MAE
- Masked Autoencoders Are Scalable Vision Learners
- papers_with_code
- Focal
- Focal Attention for Long-Range Interactions in Vision Transformers
- paper & code
- MetaFormer
- MetaFormer is Actually What You Need for Vision
- papers_with_code
- CSWin
- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
- papers_with_code
- Twins
- Twins: Revisiting the Design of Spatial Attention in Vision Transformers
- papers_with_code
- Swin
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- papers_with_code
- CaiT
- Going deeper with Image Transformers
- papers_with_code
- CvT
- CvT: Introducing Convolutions to Vision Transformers
- papers_with_code
- PvTv2
- PVTv2: Improved Baselines with Pyramid Vision Transformer
- papers_with_code
- PvT
- SReT
- Sliced Recursive Transformer
- papers_with_code
- ViT
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- papers_with_code
- HRFormer
- HRFormer: High-Resolution Vision Transformer for Dense Predict
- paper & code
- Conformer
- Conformer: Local Features Coupling Global Representations for Visual Recognition
- papers_with_code
- FixEfficientNet
- Fixing the train-test resolution discrepancy: FixEfficientNet
- papers_with_code
- EfficientNetV2
- EfficientNetV2: Smaller Models and Faster Training
- papers_with_code
- EfficientNet
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- papers_with_code
- Pale
- Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention
- paper
- VOLO
- VOLO: Vision Outlooker for Visual Recognition
- papers_with_code
- ELSA
- ELSA: Enhanced Local Self-Attention for Vision Transformer
- papers_with_code
- DAT
- Vision Transformer with Deformable Attention
- github
- As-ViT
- Auto-scaling Vision Transformers without Training
- github
- CycleMLP
- CycleMLP: A MLP-like Architecture for Dense Prediction
- github
- CrossFormer
- CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
- github
- AS-MLP
- AS-MLP: An Axial Shifted MLP Architecture for Vision
- github
- VAN
- Visual Attention Network
- github
- ConvNeXt
- A ConvNet for the 2020s
- github