Notes from reading Prompt Engineering for LLMs by John Berryman and Albert Ziegler.
What are LLMs?
- A function that takes text as an input (prompt) and returns text as the output (completion) based on a prediction
- Rote memorization is considered a defect (overfitting)
- What’s important to the quality of the model is that it is trained to apply patterns it encounters, not just reciting back the training data
- You can build an intuition around what an LLM will return by knowing more about the training data
LLMs can hallucinate and make things up
- Hallucinations are plausible information produced confidently, often with no warning that they could be wrong
- Truth bias occurs when a model assumes that the content of the prompt is factually true—this is why the quality of the prompt, especially when they are programatically generated, is so important
- LLMs don’t process text, they process tokens
LLMs process tokens not text
- Tokenization takes a word and breaks it down in to one or more tokens that represent the word
- A model’s vocabulary is the set of tokens it was trained with
- Tokenization is deterministic and LLMs can’t examine individual letters so it can’t reverse the spelling of a word without it getting garbled or answer a question asking about words that start with certain letters (“Sweden” is one token so ChatGPT can’t give you a list of countries that start with “SW”)
- Capitalization can be tokenized differently than non-capitalized text which can cause issues with responses (e.g. “worlds” is one token by “WORLDS” is three)
- Costs such as time, computation, money scale linearly with the number of tokens which is an important scaling consideration
- Models are constrained to a context window set by the number of tokens in the prompt plus the completion
LLMs are autoregressive
- Completions are autogregressive in that the output is a prediction of the next most likely token given the prompt, appended to the end and then repeated until a stop word
- Models can not backtrack when producing tokens when responding—there is no pausing and thinking, each token is committed to
- This is why an LLM can appear stubborn and go down a path that seems to us as completely wrong and it’s up to the application designer to spot these and correct it
- This also makes LLMs prone to repetition because it can be unclear when to break output that is repetitive like a list of things
LLMs compute the likelihood of all tokens
- Token probabilities (logprobs) range from -2 to 0
- If temperature is greater than 0, the model returns a stochastic next token which may or may not have the highest probability
- Temperature of 0 means it always returns the most likely token and the closest to deterministic and highest accuracy
- Temperature between 0-1 will provide more variation, useful when you want to to come up with a list of different solutions
- Temperature of 1 mirrors the token probability in the training set
- Temperature > than 1 often makes the LLM sound like they’re drunk
Transformers
- A transformers architecture takes the sequence of tokens and then feeds it forward through multiple attention layers to get to a predicted next token based on a sample of possible tokens
- LLMs read through the text once from beginning to end with no way to pause or modify previous tokens
- The way to visualize it is as “backward and downward”—information can be passed between each “minibrain” on the same layer by looking to the data from the preceding “minibrain”, but never passed from a higher layer to a lower layer
- This is why order of the prompt matters and why asking to count the number of words in a text after providing the text does not work
Chat
While completion is the foundation, chat is more useful as evidenced by the success of ChatGPT.
Reinforcement learning from human feedback (RLHF)
- Model alignment is the process of fine-tuning a model to make completions match what a user would expect
- RLHF keeps LLMs honest because the reinforcment is consistent with the corpus of data it was trained on rather than introducing external data that are outside the scope that might be introduced by a human labeler
- RLHF is efficient, for GPT-3 13,000 handcrafted documents were needed and ranking completions which required a team of 40 part-time people
- A base model can be used to generate an HHH-aligned model (helpful, honest, harmless)
- The process for generating an HHH-aligned model using RLHF
- Create an intermediate, supervised fine-tuning model (SFT) trained on transcripts that represent the desired behavior (13,000 documents for GPT-3)
- Create a reward model that measures completion quality (a numerical value rather than completing text) by providing an example prompt, generate multiple completions (4 to 9), humans rank the responses from best to worst (33,000 ranked documents for GPT-3), the data is then used to fine tune the SFT model to create the reward model
- Create the RLHF model by first providing the SFT model with an example prompt (31,000 for GPT-3) and judge the completion using the reward model to create the training data to fine-tune the RLHF model
- Use proximal policy optimization (PPO) to modify model weights to improve the reward model score as long as it doesn’t significantly diverge from the SFT model output—this prevents cheating or lying to maximize the score
- The alignment tax is the tendency for RLHF to make the LLM dumber
Instruct model to chat model
- Instruct models were trained to respond to questions and instructions rather than just completing a document. This is more useful for use cases like brainstorming, summarization, rewriting, classification, and chat
- However, it’s not always clear whether the user wants a completion or instruction
- Chat models are RLHF fine-tuned to complete transcript documents annotated by ChatML—a markup language that describes the interaction between a user and the assistant with a global system message to provide expected behavior
- ChatML solves the problem of ambiguity in what the user expects
- ChatML also prevents prompt injection by making it impossible to spoof the assistant message and inject different behavior
- Do not inject user content into a system message because models are trained to closely follow the system message and that could allow for exploiting vulnerabilities
Downsides of chat models
- Alignment tax shows degredation in certain tasks performed by chat models compared to completion models
- Behavior is less straightforward and responses can be “chatty” due to the way they are trained
- Less diversity in completions due to RLHF fine-tuning which makes responses more uniform by design compared to the training set (the whole internet)
- Practical limitations for specific domains like if you wanted to brainstorm medical treatments for a patient—you wouldn’t want to argue about seeking professional help if you are the doctor
- Example: Github Copilot uses completion rather than transcript completion (chat)
Prompt engineers are playwrights and the showrunner, orchestrating the interaction between the LLM and the user. They might fabricate messages as the assistant or the user into the transcript to further the goal of the play.
Useful OpenAI chat completion API
logprobs
returns the probability of each selected token so you can see how confident the model was with portions of the answern
determines how many completions to generate in parallel (128 is the maximum), useful to evaluate a modeltemperature
returns more creative results at the expense of accuracy (as noted earlier)
Designing LLM applications
Starting from a user’s problem, the LLM application loop
- Converts the user’s problem to the model domain
- Prompts the LLM
- Converts the response back to the user’s domain
- Maybe repeat
Converting the user’s problem to the model domain
- The prompt must
- Closely resemble content from the training set, otherwise the LLM won’t know how to respond
- Include all information relevant to addressing the user’s problem—querying the right context and including it the prompt is domain specific and is critical to getting a good response
- Lead the model to generate a completion that actually addresses the problem—this is harder for completions since chat is already fine-tuned to provide an answer
- Set up the LLM for a completion that has a reasonable end point so generation can stop otherwise it will keep going on forever
Use the LLM to complete the prompt
- Different models have different tradeoffs (e.g. speed, accuracy, cost)
- Sometimes it’s better to use a “worse” model if better reasoning ability is not needed or it might be better to fine tune a model and use that instead of a general purpose model
Transforming back to the user domain
- Sometimes a response goes beyond text to actually performing action
- Function calling is now available which makes it possible to call an external API or take some other action programatically based on the completion
- Depending on the application, you might convert the completed text to some UI change or change the medium of the response e.g. text to speech
The “feedforward pass” is the process for converting the user’s problem into a prompt. This involves:
- Context retrieval: direct context comes from the user e.g. the text they typed into an input field, indirect context is gathered from other sources like documents related to the user’s input
- Snippetizing context: extract the most relevant context in text format
- Scoring and prioritizing snippets: trim down to the best content so that it fits in the LLM’s context window and LLMs don’t get confused using priorities (ranked groups of content) and scores (rank within a priority group)
- Prompt assembly: make sure the prompt fits and the prompt convey’s the problem with supporting context, maybe further summarize sections to make it fit, it should read like a document found in the training data
Evaluating quality
- LLMs are probabilistic and often make mistakes so you must constantly evaluate quality
- Offline evaluation: find ways to judge the output of the application, techniques like LLM as judge can automate this, other times there is a domain specific eval that makes sense like code completion tested by running unit tests after
- Online evaluation: use application specific measurements that approximate quality like the number of code completions users accepted
Prompt content
Links to this note
-
Do Higher Temperatures Make Llms More Creative?
Higher temperatures tell LLMs when generating a completion to not always use the highest probability next token. This has the effect of producing a wider range of possible responses.