Waving Dead Chickens Over the Keyboard
Prompt Voodoo: The Spellbook
The Threat: 'My job is on the line!'
The myth that threatening an LLM with negative consequences (for you or it) will make it 'try harder.' In reality, the m...
The Bribe: 'I'll tip you $200.'
The belief that offering a reward will incentivize the model. While some studies have shown a minor correlation, it's li...
The Mantra: 'Take a deep breath.'
While 'think step-by-step' can be a useful instruction to elicit a chain-of-thought, adding emotional fluff like 'take a...
Beyond Commands: The Power of Persona and Tone
1. Describe the Voice, Not Just the Output
- •Bad Prompt: "Write a 1,000-word article on the history of JavaScript with pros and cons of its evolution."
- •Better Prompt: "Write like you’re a JavaScript old-timer who’s seen the chaos unfold since 1995, ranting to a junior dev over coffee."
2. Encourage Pushback and Collaboration
- •Example: "Critique this code. Be brutally honest. If my approach is flawed, don't just fix it; explain why it's wrong and propose a fundamentally better alternative. Do not simply agree with my implementation."
3. The Power of Examples (Few-Shot Prompting)
- •Example:
I will give you a complex technical term, and you will explain it in a simple, one-sentence analogy. **Term:** "Database Indexing" **Analogy:** "A database index is like the index in the back of a book; it helps you find information much faster without having to read the whole book." **Term:** "API" **Analogy:** "An API is like a restaurant menu; it provides a list of dishes you can order, along with a description of each dish, but you don't need to know how the kitchen prepares the food." **Term:** "Docker Container" **Analogy:**
A Developer's Guide to the Transformer's Guts
Error rendering diagram: Parse error on line 2: ...'"] --> B{Tokenizer (BPE)}; B --> C["T -----------------------^ Expecting 'SQE', 'DOUBLECIRCLEEND', 'PE', '-)', 'STADIUMEND', 'SUBROUTINEEND', 'PIPE', 'CYLINDEREND', 'DIAMOND_STOP', 'TAGEND', 'TRAPEND', 'INVTRAPEND', 'UNICODE_TEXT', 'TEXT', 'TAGSTART', got 'PS' graph TD A["Input Text
'Refactor this code'"] --> B{Tokenizer (BPE)}; B --> C["Token IDs
[23, 5, 19, 8]"]; C --> D["Token Embeddings
[[0.1, 0.9,...], [0.4,...]]"]; D --> E["+ Positional Encodings"]; E --> F{Transformer Block 1}; F --> G["Enriched Vectors"]; G --> H{...}; H --> I{Transformer Block N}; I --> J["Final Vectors"]; J --> K{Softmax Layer}; K --> L["Output Probabilities
{'the': 0.1, 'def': 0.08, ...}"]; subgraph F_Block [Transformer Block] direction LR F1[Multi-Head Attention] --> F2[Add & Norm] F2 --> F3[Feed-Forward Network] F3 --> F4[Add & Norm] end F --> F_Block; F_Block --> G; style A fill:#FFF,stroke:#333,stroke-width:2px style L fill:#FFF,stroke:#333,stroke-width:2px style F_Block fill:#f9f9f9,stroke:#ddd
Key Terminology
Tokenization
The process of breaking down raw text into smaller units (tokens) from a fixed vocabulary. LLMs see these tokens, not wo...
Embeddings
Each token ID is mapped to a high-dimensional vector. This vector, or embedding, represents the token's 'meaning' in a m...
Positional Encoding
A vector added to each token's embedding to give the model information about its position in the sequence. This is how a...
Multi-Head Attention
The core reasoning engine. For each token, it performs multiple parallel 'database lookups' (attention heads) across all...
Feed-Forward Network (FFN)
A standard neural network within each Transformer block that processes the context-rich vector from the attention mechan...
Softmax Layer
The final layer that converts the model's raw output scores (logits) for the next token into a probability distribution....
1. It's All About Tokens (And How They're Made)
the
, and
) and even common sub-words (-ing
, pre-
) become single tokens. Anything not in this vocabulary gets broken down into the smallest possible pieces.- •
hello
is one token. - •
scikit-learn
is often four tokens:sci
,kit
,-
,learn
. - •
aVariable
(CamelCase) might be two tokens:a
,Variable
. - •
a_variable
(snake_case) might be three:a
,_
,variable
.
sklearn
is one token. scikit-learn
is four. For a model trying to understand that you're talking about Python's machine learning library, that single token is a much stronger, more direct signal. This is also why code comments in English often work better, even in non-English code—the tokenizer's vocabulary is usually English-centric.2. From Tokens to Vectors: The Magic of Embeddings
sklearn
isn't just a random vector; it's a rich representation that has been shaped by all the text about scikit-learn the model has ever seen. Crucially, the model has no concept of sequence at this stage. A bag of token embeddings is just a cloud of points. To understand order, it needs another component.3. The Attention Mechanism: A Multi-Headed Database Lookup
- •Query: The current token asks a question: "Given my context, what information do I need to predict the next token?" This question is formulated as a Query vector.
- •Key: Every token in the context (including the current one) has a Key vector that acts like a label, announcing: "This is the information I hold."
- •Value: Every token also has a Value vector that contains the actual content or meaning of that token.
- •Use clear delimiters. Don't just throw a wall of text at the model. Use Markdown fences (```), XML tags (
<context>
), or clear headings to segment your prompt. This creates structural signals that help the attention mechanism differentiate between context, examples, and instructions. For example, a Query from the "instruction" section can more easily focus its attention on Keys from the "context" section. - •Mind the Recency Bias. This happens because of Positional Encodings. Before the attention layers, the model adds a vector to each token's embedding that encodes its position in the sequence (
token 1
,token 2
, etc.). This is the only way a Transformer knows about word order. Due to the math involved, tokens at the beginning and end of the sequence often stand out more to the attention mechanism. Some models, like Claude 3, are specifically trained to counteract this with "Needle in a Haystack" testing, but it's still a good practice to put your most critical instructions at the end.
4. The Feed-Forward Network: Where Knowledge is Applied
5. Wasting Attention: The High Cost of Noise and Vagueness
- •Prune your context aggressively. Before sending a prompt, ask yourself: "Does the model absolutely need this piece of information to answer my question?" If not, remove it.
- •Be ruthlessly specific. Instead of "fix this," say "the bug is a race condition on the
user_count
variable; fix it using athreading.Lock
." This saves the model from having to diagnose the problem and lets it focus its entire attention budget on the solution.
6. Temperature, Top-p, and Top-k: Controlling the Chaos
- •
Temperature: This directly modifies the raw output scores (logits) before they are converted into probabilities (via the softmax function).
- •A temperature of
0.0
makes the highest-probability token a near-certainty (it's not truly deterministic, but greedy decoding). - •A temperature
> 1.0
flattens the probability curve, making less likely tokens more probable. It's not making the model "smarter," it's making it "drunk." It will start making weird, but sometimes interesting, choices. - •A temperature
< 1.0
sharpens the curve, increasing the model's confidence in its top choices.
- •A temperature of
- •
Top-p (Nucleus Sampling): This is often a more intuitive way to control randomness. A
top_p
of0.9
means the model considers only the smallest set of tokens whose cumulative probability is at least 90%. This cuts off the long tail of truly bizarre tokens while adapting to the context (if the model is very certain, the set will be small; if it's uncertain, the set will be larger). - •
Top-k: This is the simplest method. It instructs the model to only consider the
k
most likely tokens. Atop_k
of1
is greedy decoding. Atop_k
of50
means the model will randomly choose from the 50 most likely next tokens.
- •For code generation, data extraction, or summarization, use a low temperature (
0.2
) and consider atop_p
around0.5
. You want predictable, correct output. - •For brainstorming or creative writing, use a higher temperature (
0.8
+) and a hightop_p
(0.9
+). This allows for more diversity without going completely off the rails. Avoidtop_k
for creative tasks, as it can stifle novelty.
Sympathy for the Specialist: Prompting Mixture of Experts (MoE) Models
Actionable Techniques for MoE:
- •
Explicit Domain Priming: Start your prompt by clearly stating the domain. This is the most direct signal you can give the router.
- •Bad:
Here's some code. Fix it.
- •Good:
You are a senior Go developer specializing in concurrency patterns. The following code has a race condition. Identify it and provide a corrected version using mutexes.
- •Bad:
- •
Signal Context Switches: If your prompt contains multiple tasks, use strong structural elements to signal a "context switch." This gives the router a chance to re-evaluate and engage different experts for each part.
- •Example:
### Task 1: Python Code Analysis Analyze the time complexity of the following Python function. [...python code...] --- ### Task 2: Documentation Now, write a user-friendly docstring for the function explaining what it does.
The headings and separator help the router understand that the task has changed from algorithmic analysis to technical writing. - •Example:
- •
Use Domain-Specific Jargon: Using the language of a specific field provides a powerful signal. If you're asking for database advice, use terms like "idempotent," "normalization," and "query plan." This helps the router activate the expert network that has been trained on that type of content.
The Next Frontier: LLM-Sympathetic Codebases
- •Self-documenting code is paramount. Clear, descriptive variable and function names (
calculate_tax_for_bracket
vs.calc_tax
) are not just good practice; they create strong, unambiguous tokens that the attention mechanism can latch onto. - •Add structured metadata in comments. Instead of just explaining what the code does, explain why. Some emerging conventions even use special tags (
@ai-hint: This function is performance-critical
) to provide direct guidance to automated tools. - •Provide repository-level context. Tools that can analyze an entire codebase often benefit from a high-level "README for the AI," sometimes in a dedicated file (
ai-context.md
or similar). This file can explain the overall architecture, key libraries, and coding conventions, giving the model a framework to interpret the code it sees.
From Prompts to Pipelines: Agentic Sympathy
- •The Pre-Processor Agent: Have one LLM call dedicated to preparing the context. Give it a large, messy block of text and a goal, and ask it to extract only the relevant facts, prune the noise, and format the result into a clean, dense context for the next step.
- •The Planning Agent: Feed this clean context to another LLM. Its only job is to think. Ask it to create a detailed, step-by-step plan to achieve the goal. It doesn't write code or documentation; it only produces a structured plan.
- •The Execution Agent: Take one step from the plan and the pristine context and feed it to a "worker" LLM. This agent has a single, tightly-scoped task: write the code for step one. Its context is not polluted with the entire problem, just the information and instructions needed for its immediate task.
- •The Reviewer Agent: A final LLM call can be used to review the output of the execution agent, checking for errors, ensuring it aligns with the plan, and verifying correctness.