Prompt Engineering for LLMs - Literature Notes

Notes from reading Prompt Engineering for LLMs by John Berryman and Albert Ziegler.

What are LLMs?

A function that takes text as an input (prompt) and returns text as the output (completion) based on a prediction
Rote memorization is considered a defect (overfitting)
What’s important to the quality of the model is that it is trained to apply patterns it encounters, not just reciting back the training data
You can build an intuition around what an LLM will return by knowing more about the training data

LLMs can hallucinate and make things up

Hallucinations are plausible information produced confidently, often with no warning that they could be wrong
Truth bias occurs when a model assumes that the content of the prompt is factually true—this is why the quality of the prompt, especially when they are programatically generated, is so important
LLMs don’t process text, they process tokens

LLMs process tokens not text

Tokenization takes a word and breaks it down in to one or more tokens that represent the word
A model’s vocabulary is the set of tokens it was trained with
Tokenization is deterministic and LLMs can’t examine individual letters so it can’t reverse the spelling of a word without it getting garbled or answer a question asking about words that start with certain letters (“Sweden” is one token so ChatGPT can’t give you a list of countries that start with “SW”)
Capitalization can be tokenized differently than non-capitalized text which can cause issues with responses (e.g. “worlds” is one token by “WORLDS” is three)
Costs such as time, computation, money scale linearly with the number of tokens which is an important scaling consideration
Models are constrained to a context window set by the number of tokens in the prompt plus the completion

LLMs are autoregressive

Completions are autogregressive in that the output is a prediction of the next most likely token given the prompt, appended to the end and then repeated until a stop word
Models can not backtrack when producing tokens when responding—there is no pausing and thinking, each token is committed to
This is why an LLM can appear stubborn and go down a path that seems to us as completely wrong and it’s up to the application designer to spot these and correct it
This also makes LLMs prone to repetition because it can be unclear when to break output that is repetitive like a list of things

LLMs compute the likelihood of all tokens

Token probabilities (logprobs) range from -2 to 0
If temperature is greater than 0, the model returns a stochastic next token which may or may not have the highest probability
Temperature of 0 means it always returns the most likely token and the closest to deterministic and highest accuracy
Temperature between 0-1 will provide more variation, useful when you want to to come up with a list of different solutions
Temperature of 1 mirrors the token probability in the training set
Temperature > than 1 often makes the LLM sound like they’re drunk

Transformers

A transformers architecture takes the sequence of tokens and then feeds it forward through multiple attention layers to get to a predicted next token based on a sample of possible tokens
LLMs read through the text once from beginning to end with no way to pause or modify previous tokens
The way to visualize it is as “backward and downward”—information can be passed between each “minibrain” on the same layer by looking to the data from the preceding “minibrain”, but never passed from a higher layer to a lower layer
This is why order of the prompt matters and why asking to count the number of words in a text after providing the text does not work

Chat

While completion is the foundation, chat is more useful as evidenced by the success of ChatGPT.

Reinforcement learning from human feedback (RLHF)

Model alignment is the process of fine-tuning a model to make completions match what a user would expect
RLHF keeps LLMs honest because the reinforcment is consistent with the corpus of data it was trained on rather than introducing external data that are outside the scope that might be introduced by a human labeler
RLHF is efficient, for GPT-3 13,000 handcrafted documents were needed and ranking completions which required a team of 40 part-time people
A base model can be used to generate an HHH-aligned model (helpful, honest, harmless)
The process for generating an HHH-aligned model using RLHF
1. Create an intermediate, supervised fine-tuning model (SFT) trained on transcripts that represent the desired behavior (13,000 documents for GPT-3)
2. Create a reward model that measures completion quality (a numerical value rather than completing text) by providing an example prompt, generate multiple completions (4 to 9), humans rank the responses from best to worst (33,000 ranked documents for GPT-3), the data is then used to fine tune the SFT model to create the reward model
3. Create the RLHF model by first providing the SFT model with an example prompt (31,000 for GPT-3) and judge the completion using the reward model to create the training data to fine-tune the RLHF model
4. Use proximal policy optimization (PPO) to modify model weights to improve the reward model score as long as it doesn’t significantly diverge from the SFT model output—this prevents cheating or lying to maximize the score
The alignment tax is the tendency for RLHF to make the LLM dumber

Instruct model to chat model

Instruct models were trained to respond to questions and instructions rather than just completing a document. This is more useful for use cases like brainstorming, summarization, rewriting, classification, and chat
However, it’s not always clear whether the user wants a completion or instruction
Chat models are RLHF fine-tuned to complete transcript documents annotated by ChatML—a markup language that describes the interaction between a user and the assistant with a global system message to provide expected behavior
ChatML solves the problem of ambiguity in what the user expects
ChatML also prevents prompt injection by making it impossible to spoof the assistant message and inject different behavior
Do not inject user content into a system message because models are trained to closely follow the system message and that could allow for exploiting vulnerabilities

Downsides of chat models

Alignment tax shows degredation in certain tasks performed by chat models compared to completion models
Behavior is less straightforward and responses can be “chatty” due to the way they are trained
Less diversity in completions due to RLHF fine-tuning which makes responses more uniform by design compared to the training set (the whole internet)
Practical limitations for specific domains like if you wanted to brainstorm medical treatments for a patient—you wouldn’t want to argue about seeking professional help if you are the doctor
Example: Github Copilot uses completion rather than transcript completion (chat)

Prompt engineers are playwrights and the showrunner, orchestrating the interaction between the LLM and the user. They might fabricate messages as the assistant or the user into the transcript to further the goal of the play.

Useful OpenAI chat completion API

logprobs returns the probability of each selected token so you can see how confident the model was with portions of the answer
n determines how many completions to generate in parallel (128 is the maximum), useful to evaluate a model
temperature returns more creative results at the expense of accuracy (as noted earlier)

Designing LLM applications

Starting from a user’s problem, the LLM application loop

Converts the user’s problem to the model domain
Prompts the LLM
Converts the response back to the user’s domain
Maybe repeat

Converting the user’s problem to the model domain

The prompt must
- Closely resemble content from the training set, otherwise the LLM won’t know how to respond
- Include all information relevant to addressing the user’s problem—querying the right context and including it the prompt is domain specific and is critical to getting a good response
- Lead the model to generate a completion that actually addresses the problem—this is harder for completions since chat is already fine-tuned to provide an answer
- Set up the LLM for a completion that has a reasonable end point so generation can stop otherwise it will keep going on forever

Use the LLM to complete the prompt

Different models have different tradeoffs (e.g. speed, accuracy, cost)
Sometimes it’s better to use a “worse” model if better reasoning ability is not needed or it might be better to fine tune a model and use that instead of a general purpose model

Transforming back to the user domain

Sometimes a response goes beyond text to actually performing action
Function calling is now available which makes it possible to call an external API or take some other action programatically based on the completion
Depending on the application, you might convert the completed text to some UI change or change the medium of the response e.g. text to speech

The “feedforward pass” is the process for converting the user’s problem into a prompt. This involves:

Context retrieval: direct context comes from the user e.g. the text they typed into an input field, indirect context is gathered from other sources like documents related to the user’s input
Snippetizing context: extract the most relevant context in text format
Scoring and prioritizing snippets: trim down to the best content so that it fits in the LLM’s context window and LLMs don’t get confused using priorities (ranked groups of content) and scores (rank within a priority group)
Prompt assembly: make sure the prompt fits and the prompt convey’s the problem with supporting context, maybe further summarize sections to make it fit, it should read like a document found in the training data

Evaluating quality

LLMs are probabilistic and often make mistakes so you must constantly evaluate quality
Offline evaluation: find ways to judge the output of the application, techniques like LLM as judge can automate this, other times there is a domain specific eval that makes sense like code completion tested by running unit tests after
Online evaluation: use application specific measurements that approximate quality like the number of code completions users accepted

Prompt content

What content should go into prompts?

Static content

Explains the task to the LLM, clarifies the question and gives precise instructions.

Miscommonucation and misunderstandings lead to complete failure
Clarifying the question improves consistency
- Explicit clarification: Say what you want as a list of do’s and don’ts e.g. use markdown, don’t refer to dates before 2023
  - Frame it as positives rather than negatives
  - Provide a reason for the rule
  - Avoid absolutes
  - Put the list of explicit instructions in the system prompt when using an RLHF model as it is instructed to closely adhere to it
- Implicit clarification: “few-shot prompting” helps the model see patterns in the question and the responses by providing a few examples as opposed to “zero-shot prompting” where no examples are provided
  - Implicit is often better than explicit
  - Provides the subtle expectations for the answer
  - Examples don’t have to be the main question and answer but can just be demonstrations of the desired output format only

Few-shot drawbacks

Examples with questions that have a lot of context can confuse the model and might not fit in the context window
Examples will anchor responses as the model draws from patterns from what was provided which might not make sense, even providing enough examples to cover most cases can still result in implied expectations for the response.
- It’s best to use actual examples (maybe provided by users before) so that you can representatively sample from them
- Include exceptions that are likely to happen so that if the model encounters it, it has an example of how to respond
Examples can lead to extrapolating patterns that don’t exist like sequential numbers or ascending/descending values
- Ordering matters because it implies patterns
- The way a human might order it (happy path examples then exceptions) can cause the model to bias towards exceptions
- Shuffling order of the examples can help, but it’s best to test

Dynamic content

Context for the question the model needs to know to accomplish the task.

Considerations

Latency: how much time you have to respond to the user input determines what context you can gather—replying to an email is less immediate than autocomplete when coding
Preparability: some context can be prepared in advance and

It’s better to gather as much context as you can and wittle it down later. To decide on priority, each piece of context should have a score.

Make a mindmap starting from a general question to help you find context that might be useful.

Order sources of context based on proximity and stability.

Proximity: how far the information is from the current situation e.g. knowing the user that is logged in and what page they are on vs accessing other systems to query data. The further the proximity the harder it is to obtain and the more valuable it has to be to be worth finding.
Stability: some things are always the same for the same user (like a profile), some that change slowly (like purchase history) and then there is information that changes quickly (like a user’s interaction in the app). The less stable, the harder it is to prepare in advance and could create more latency.

Retrieval-Augmented Generation

Retrieve content relevant to the problem at hand and incorporate it into the prompt so the model has more information than what they were trained on e.g. current events.

The downside is Checkov’s Gun Fallacy where the model overly tries to fit in the supplied context leading it down the wrong path.

The goal is to find relevant snippets that are most similar to the source text.

Neural retrieval matches based on ideas, lexical retrieval matches based on words.

Lexical retrieval:

Jaccard similarity measures relevance by how similar overlapping words are in a set of texts. This is fast for a small number of documents.
TF*IDF and BM25 take it further by taking into account word importance.
Lexical retrieval suffers from false positives because stemmed words that overlap might not mean the same thing e.g. backpack and rucksack have a similar meaning but will not match

Neural retrieval:

Uses an embedding model to vectorize words so that their similarity can be measured with euclidean distance and cosine similarity. This allows search to find semantically similar concepts.
Documents are “snippetized” into chunks that contain one idea, fit within the maximum number of tokens allowed by the embedding model, and an appropriate size to be placed into the prompt.
Snippet strategies
- Moving window of text with overlap (or not) between chunks of text so as not to split apart an idea
- Use natural boundaries like paragraphs or sections so there is no chance an idea will get cut off mid sentence or idea.
- Augment the snippet with additional context (e.g. a code snippet of an object calling a method you might include the definition of the class so the embedding model has more context)

What if text to be summarized is too long for the context window?

Hierarchical summarization splits up the text into semantic entities, summarizes them, and then summarizes the list of summaries. For example, split a book into chapters, summarize each chapter, and then summarize the list of chapters to summarize the book. You’ll need to look for hierarchy in the corpus of text.

As a rule of thumb, if the size of the summaries is on average less than one tenth the size of the original text, then no mattter how deep the hierarchy, the cost of summirzation is determined by the total number of tokens in the original text.

The deeper the summarization hierarchy the higher the likelihood of mistakes and misunderstandings.

Assembling the prompt

Prompts should generally adhere to the following structure:

Introduction: clarify the type of document you’re writing, set up the model to approach the rest of the content, provide the main question the model should answer
Context: additional information needed to answer the main question
Refocus: remind the model of the main question (important for long prompts with a lot of context)
Transition: clearly state what you want the model to do

The sandwich model is often used to focus the model. For example, “I want to suggest to John an ideal next book to read” in the introduction and then, after all the context “based on this, which book should I suggest to him?”.

What kind of document

Choose the right kind of document with the best chance of delivering the desired output and should closely match the training data of the model.

The Advice Conversation: the user asks for help and the assistant provides it

Ideal for chat conversations
This is the most common format and one that OpenAI trained ChatML on extensively
Completion models can take advantage of “inception” by starting the answer and letting the completion take it from there to steer the direction of the answer. Chat models can do this by writing the assistant messages in a transcript.

The Analytic Report

Favor objective analysis and naturally include a conclusion
Include a “Scope” section to clearly define boundaries of the report rather than listing individual rules as part of the instructions
LLMs respect clear boundaries consistently in reports than in dialogues
Always write the report in markdown—models are already trained on it, there is hierarchy via headings, you can include source code in blocks, it’s easy to render the response
Include a table of contents at the beginning of a long prompt helps the model orient itself
- Chain of thought prompting can use a scratchpad like a section for “# Ideas” or “# Analysis” before the “# Conclusion” section in the table of contents
- Signal the model is finished by setting a “# Appendix” or a “Further Reading” as a stop sequence
Format
- Table of Contents (numbered list of sections in the report)
- Introduction (introduce the key question and the context)
- Context (headings like “# Pros and Cons”, “## Strengths”, and “## Weaknesses”)
- Transition heading (e.g. “# Ideas” followed by a inner monologue) This is where you would leave a completion model to fill in the rest.
- Keyword heading signaling the answer (e.g. “# Conclusion”)
- Answer
- Keyword heading to signal the stopping point (e.g. “# Appendix”)

Valley of Meh

LLMs try to make use of all prompt elements but not equally. This is problematic for long prompts

The closer information is to the end, the more impact it has
Models recall the beginning and end, but struggle with information stuffed in the middle.

Formatting snippets

When formatting snippets, aim for:

Modularity: easy to insert or remove them from the prompt e.g. a conversation with turns, a section in a document
Naturalness: it should feel organic to the document and formatted so that it fits in e.g. a code comment in a source code as document
Brevity: as short as possible to communicate it
Inertness: separate out prompt elements with whitespace to prevent them from getting merged unexpectatedly e.g. “be” + “am” becomes “beam” and confuse the LLM

Few-shot examples

You can format few-shot examples more naturally by incorporating them into the transcript when using a chat model. That way the LLM will be encouraged to to keep up the successful approach.

Elastic snippets

Sometimes you might need to decide how much of a snippet to include and context about how the snippets relate. An elastic prompt element has different versions ranging from short to long and then when assembling the prompt, choose the snippet size you have space for. This avoids exceeding the context window.

Relationships among prompt elements

Prompt elements relate to each other by position and ordering, importance, and dependency.

Position

Prompt elements usually need to follow a specific order—rearrnging them can make the document confusing. Chat transcripts should stick to chronological order.

Importance

There needs to be some measurement of how important an element is so you can descide when assembling the prompt whether to include it or not. Short, efficient prompt elements are preferable to long ones that convey the same information. Use a numerical score or some sort of tiering system with a small number of levels you can sort with and cut lower tiers if necessary.

Dependency

Requirements occur when a prompt element depends on the other e.g. “Alex is the CEO” should come before “He grew up on Long Island”.
Incompatibilities occur when one prompt element is mutually exclusive with another like a short version and a long version of the introductory text—including both wouldn’t make sense.

Putting it all together

Links to this note

Ask AI to Ask You Questions

A surprisingly useful technique for getting the most out of AI tools is to prompt it to ask you questions that would help it to complete the task you want.
Do Higher Temperatures Make Llms More Creative?

Higher temperatures tell LLMs when generating a completion to not always use the highest probability next token. This has the effect of producing a wider range of possible responses.