AI / 大模型架构
ViT 架构
图像切块、Patch Embedding、位置编码、Transformer Encoder 和分类头流程。
完整 Prompt
A Vision Transformer (ViT) architecture, left-to-right horizontal flow. Step 1 — Patch Embedding (left): - An input image is split into a 4x4 grid of non-overlapping 16x16 patches. - Each patch is flattened and linearly projected to a D=768 embedding. - Show the patch grid clearly with a small example image inside. Step 2 — Position + [CLS] token: - Prepend a learnable [CLS] token to the patch sequence. - Add learned positional embeddings element-wise. Step 3 — Transformer Encoder (center): - A stack of L=12 standard encoder layers (multi-head self-attention + MLP + LayerNorm). - Show the stack as a tall vertical column with one expanded layer to the side. Step 4 — Output (right): - The [CLS] token output is projected by an MLP head to class logits. Style: clean academic vector, navy and teal palette, thin connectors, white background. Annotate tensor shapes (N+1, D).立即试用此 Prompt
适用场景
For computer-vision papers using transformer encoders for classification, segmentation or detection.
变体
Hierarchical Swin variant
Same flow but show a hierarchical Swin-style ViT with 4 stages of decreasing spatial resolution and increasing channel dim. Add window-attention boxes inside each stage.
使用建议
- State patch size and image size — figures default to a generic grid otherwise.
- Mention "[CLS] token" by name. Models reproduce the literal label and arrow.
- For a paper figure, fix L (layer count) to your value to avoid ambiguity.
常见问题
How do I extend this to dense prediction (segmentation)?
Replace the MLP head with a "lightweight decoder reshaping patch tokens back to 2D and producing a per-pixel mask." The encoder structure stays the same.
