论文简介
原论文:Video Swin Transformer1
论文地址:https://arxiv.org/abs/2106.13230
以下仅为作者阅读论文时的记录,学识浅薄,如有错误,欢迎指正。
论文内容
摘要
-
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.
计算机视觉领域正在见证着模型从CNN到Transformer的变迁,纯Transformer结构已经在主要的视频识别基准上达到了最高准确率。 -
These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensio