AI / 大模型架构
Decoder-Only LLM Architecture (GPT-Style)
Stacked decoder blocks with masked self-attention and a language modeling head.
完整 Prompt
A decoder-only transformer architecture in the style of GPT, drawn as a vertical stack with input at the bottom and output at the top. Bottom: token embedding + sinusoidal positional encoding. Middle: a stack of N decoder layers (N=12 for the figure). Each layer contains: - Masked multi-head self-attention (12 heads) - Add & LayerNorm - Feed-forward MLP (hidden dim 3072) - Add & LayerNorm Show residual (skip) connections as curved dashed arcs around each sub-layer. Top: - Final LayerNorm - Linear projection to vocab size - Softmax to next-token probability distribution Right margin: tensor shape annotations beside each block (B = batch, T = seq length, D = 768). Style: clean academic vector, navy / teal accent, white background, sans-serif labels. Suitable for ICLR or NeurIPS.立即试用此 Prompt
适用场景
For LLM / fine-tuning / instruction-tuning papers introducing or modifying a decoder-only model.
变体
With KV-cache annotation
Same architecture but annotate the KV-cache flow: highlight where keys and values are cached at each layer during autoregressive decoding. Add a side note showing how cache reuse skips re-computation across positions.
使用建议
- Specify N (number of layers) and D (hidden dim) explicitly — generic prompts produce generic counts.
- Mention "masked" self-attention. Without it the figure may not show the causal triangle.
- For causal attention masks, ask for a small triangular mask icon next to the attention block.
常见问题
Can I show LoRA adapters on top of this architecture?
Yes — say "Overlay small LoRA adapter modules on each attention and MLP block as orange tabs labeled \"LoRA r=8\"."
How do I emphasize the autoregressive generation?
Add "Show three arrows on the right side from output back to input, each labeled with a generation step (t, t+1, t+2)."
