AI / 大模型架构

ViT 架构

图像切块、Patch Embedding、位置编码、Transformer Encoder 和分类头流程。

完整 Prompt

A Vision Transformer (ViT) architecture, left-to-right horizontal flow.

Step 1 — Patch Embedding (left):
- An input image is split into a 4x4 grid of non-overlapping 16x16 patches.
- Each patch is flattened and linearly projected to a D=768 embedding.
- Show the patch grid clearly with a small example image inside.

Step 2 — Position + [CLS] token:
- Prepend a learnable [CLS] token to the patch sequence.
- Add learned positional embeddings element-wise.

Step 3 — Transformer Encoder (center):
- A stack of L=12 standard encoder layers (multi-head self-attention + MLP + LayerNorm).
- Show the stack as a tall vertical column with one expanded layer to the side.

Step 4 — Output (right):
- The [CLS] token output is projected by an MLP head to class logits.

Style: clean academic vector, navy and teal palette, thin connectors, white background. Annotate tensor shapes (N+1, D).

立即试用此 Prompt

适用场景

For computer-vision papers using transformer encoders for classification, segmentation or detection.

变体

Hierarchical Swin variant

Same flow but show a hierarchical Swin-style ViT with 4 stages of decreasing spatial resolution and increasing channel dim. Add window-attention boxes inside each stage.

使用建议

State patch size and image size — figures default to a generic grid otherwise.
Mention "[CLS] token" by name. Models reproduce the literal label and arrow.
For a paper figure, fix L (layer count) to your value to avoid ambiguity.

常见问题

How do I extend this to dense prediction (segmentation)?

Replace the MLP head with a "lightweight decoder reshaping patch tokens back to 2D and producing a per-pixel mask." The encoder structure stays the same.