Alex's Notes

Why I Like Ring Binders - Plotter Notebook
The Plotter notebook from Designphil is a minimal ring binder for planning and writing. I use the A5 size for work and journaling in addition to org-mode for managing tasks and writing permanent notes.

Being a ring binder allows for customization as you go. Instead of worrying about leaving space to add on to something you just wrote, you can open up the rings and place another page anytime in the future. You can add different page templates as needed using off-the-shelf refills (the Plotter refill paper is quite good) or even print your own (I recommend a standard size like A5 if you plan to do this).

Being a ring binder means pages lay flat. This is such a useful feature not only for writing comfort (which is very important) but also for referencing the contents at a glance whilst sitting on your desk while working.

Writing with pen and paper is a welcome break from constant screen time. While far less efficient, research suggests analog writing is more effective for encoding information into long-term memory. I feel more focused because I have to fully concentrate on writing compared to typing.

See also:
- Notes are insurance for ideas
- Writing makes ideas rigid
Published Jan 19, 2025

💬 Comments 🔗 Copy Link

N8n for Automation
I started trying out n8n for setting up automated workflows.

I tried setting up a basic HubSpot filter to collect all deals in a certain stage. Unfortunately, the API doesn’t return the names of things, just IDs so I can’t tell by looking at it what the pipeline is or the stage because those are customized to our instance of HubSpot. This would probably not be possible for a non technical person to do.

How is it different than Dify?

n8n has out of the box integrations with many B2B SaaS tools like HubSpot. It also provides triggers like cron jobs and webhooks and Dify does not.

Notes on usage:
- Quirky UI: double click to edit, triple pane layout when editing with dismiss button on the left, when to use an expression or not for a field value, tools have to be linked visually to an LLM node
- When working off of large lists, speed up the feedback loop by only executing a node once while you test it out by going to Settings -> Execute Once (don’t for get to flip it back)
- Clicking the test button doesn’t actually seem to execute the full flow
- To extract fields from a large API response, use the Edit node (horribly named)
- Tools can only be used in conjuction with a language model (e.g. SERP can’t be used to make arbitrary search queries in a workflow)
- HubSpot has a horrible API and the built-in integration in n8n does not make it any easier (e.g. you need to map IDs yourself by making multiple API calls and using a Merge node to join them)
- Scheduled tasks don’t run until you activate the workflow
Published Jan 18, 2025

💬 Comments 🔗 Copy Link

Ink Reviews

This is my personal list of ink reviews based on my usage and preferences. Black ink unless otherwise noted.

Caran D’Arche 0.7mm (parker style)

Very smooth and dark on the page makes writing with it very clear. Ink takes awhile to dry so smudging happens frequently if I’m not careful.

Ohto Ceramic Gel Pen Refill 0.5mm (PG-M05NP, parker style, needle tip)

Writing is not consistently smooth and sometimes needs more pressure on the paper to write. Drying is relatively fast but heavier spots will still smudge a few seconds after writing. I would prefer a 0.7mm point but at time of writing, they do not make one.

Ohto Flash Dry Gel Pen Refill 0.5mm (PG-105NP, parker style, needle tip)

Much smoother than the Ohto Ceramic roller ball refill and the same dry time. Definitely recommend these over the PG-M05NP.

Ohto Ballpoint Refill 0.7mm (PS-107NP, parker style, needle tip)

Smooth writing but the ink is very light for a 0.7mm and it’s signficantly lighter than the Ohto Flash Dry Gel ink refill.

Published Jan 11, 2025

💬 Comments 🔗 Copy Link

Judgment Disqualifies Automation
When a process requires human judgement for an unknown number of possible decisions, automation is not possible.

Judgement is also inversely proportional to the outsource-ability of work.

Many startups have fallen into this tar pit by mistakenly believing they can replace high judgment work with automated systems or low-cost outsourced labor. It leads to lower quality and lower margins that might not be possible to escape from.

See also:
- Might this change as AI models improve?
- How to build an intuition of what AI can do
- Creative fields involve a lot of judgment and the “taste gap” which separates good quality from bad quality
Published Jan 10, 2025

💬 Comments 🔗 Copy Link

How to Be Productive
Here’s a collection of things that have helped me be more productive.
- Capture as a separate place to write down a task with as little friction as possible at any time. I do this ~20 times per day as things come up. The key is to get into the habit so you no longer spend your time constantly trying to remember what you need to do.
- Inbox zero as often as you can. It’s not just email, it’s any inbox or queue that you interact with like Slack, HubSpot, Notion, iMessage, etc. If you treat your inbox as a todo list, don’t. If you can answer it in a few minutes, do that. If not, see Capture.
- Be more decisive. Decision making speed is often my bottleneck and it’s too easy to deliberate on items that aren’t that important. This pairs well with Inbox Zero and Capture because if you can decide more quickly you can usually resolve more things in a few minutes.
- “Hell yeah” or “no”. If you have trouble saying no to things (personal or professional), try reducing it down to a simpler decision.
- Just try getting started with the first thing to overcome the activation energy to get a large thing done and you’ll naturally keep going.
See also:
- The Great CEO Within has a nice collection of useful tactics for getting things done
- Making a personal note taking practice might help knowledge workers more long term
- I find it helpful to build personal infrastructure and one of one software because coding is convenient for me
Published Jan 10, 2025

💬 Comments 🔗 Copy Link

Do Higher Temperatures Make LLMs More Creative?
Higher temperatures tell LLMs when generating a completion to not always use the highest probability next token. This has the effect of producing a wider range of possible responses.

How does temperatures effect LLM creativity?

Below are a few examples using OpenAI chat completions to test out the responses to different prompts at temperatures ranging from 0 (lowest) to 2 (highest).
```
from openai import OpenAI

client = OpenAI()

def complete(temperature):
    response = client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": "Come up with search queries for sizes of paper.",
        }],
        temperature=temperature,
        model="gpt-4o-mini",
    )

    return response.choices[0].message.content

for i in [0, 0.5, 1, 1.5, 2]:
    response = complete(i)
    print(f"Temperature {i}:\n {response}\n\n")
```
Temperature 0: Here are some search queries related to sizes of paper:
1. “standard paper sizes chart”
2. “A4 paper dimensions in inches”
3. “US letter paper size specifications”
4. “ISO paper sizes explained”
5. “difference between A4 and letter size paper”
6. “paper size conversion calculator”
7. “common paper sizes for printing”
8. “B5 paper size dimensions”
9. “C4 envelope size for A4 paper”
10. “how to choose the right paper size for printing”
11. “legal paper size vs letter size”
12. “custom paper sizes for business cards”
13. “paper sizes for brochures and flyers”
14. “international paper size standards”
15. “how to measure paper size accurately”
16. “largest paper size available for printing”
17. “smallest paper size for business use”
18. “paper sizes for scrapbooking”
19. “A3 paper size uses and applications”
20. “paper size guidelines for graphic design”
Feel free to modify these queries based on your specific needs or interests!

Temperature 0.5: Here are some search queries you can use to find information about sizes of paper:
1. “standard paper sizes chart”
2. “A4 paper dimensions in inches”
3. “US letter paper size specifications”
4. “paper size comparison A4 vs US letter”
5. “ISO paper sizes explained”
6. “how to convert paper sizes”
7. “dimensions of A0, A1, A2, A3, A4”
8. “business card paper size”
9. “custom paper sizes for printing”
10. “largest paper size available”
11. “paper size for brochures”
12. “envelope sizes for A4 paper”
13. “different types of paper sizes for art”
14. “common paper sizes for scrapbooking”
15. “legal size paper dimensions”
16. “what is B5 paper size”
17. “paper sizes used in photography”
18. “how to choose the right paper size for printing”
19. “paper size standards worldwide”
20. “dimensions of cardstock paper sizes”
Feel free to modify these queries to suit your specific needs!

Temperature 1: Here are various search queries related to sizes of paper:
1. “standard paper sizes chart”
2. “A4 paper dimensions in inches”
3. “what is the size of letter paper?”
4. “list of paper sizes”
5. “ISO paper size standards”
6. “how big is A3 paper?”
7. “business card dimensions”
8. “US paper sizes vs international paper sizes”
9. “B5 paper size specifications”
10. “dimensions of legal size paper”
11. “C4 envelope size in cm”
12. “difference between A4 and letter paper sizes”
13. “international paper sizes explained”
14. “custom paper sizes for printing”
15. “what size is 11x17 paper?”
16. “envelope sizes for A4 paper”
17. “dimensions of tabloid paper”
18. “paper size conversion calculator”
19. “photo paper sizes for printing”
20. “what is the largest paper size available?”
Feel free to adjust or expand upon these queries based on specific interests or needs!

Temperature 1.5: Sure! Here are some search queries related to sizes of paper:
1. “standard paper sizes chart”
2. “paper size dimensions for printing”
3. “US paper sizes vs international paper sizes”
4. “ISO A series paper sizes.”
5. “A4 paper size dimensions”
6. “letter sized paper dimensions in inches”
7. “paper sizes explained”
8. “how to convert paper sizes”
9. “different paper sizes for crafts”
10. “large format paper sizes”
11. “list of envelopes with corresponding paper sizes”
12. “cardstock paper size differences”
13. “custom paper sizes for backgrounds”
14. “most common paper sizes for documents”
15. “tumblers paper size for sublimation”
16. “business card sizes and their dimensions”
17. “children’s project paper sizes”
18. “non-standard paper sizes”
19. “size equivalence between A and letter sizes”
20. “which paper size is best for art projects?”
Feel free to adjust the phrasing to fit your preferences!

Temperature 2: Sure! Here are some search queries related to the sizes of paper:
1. “What are the different stages of paper sizes?”
2. “Standard paper sizes by country overview.”
3. “What size paper isبه "
17 preferredape legality614 Fest interference hjelp mobileinder oslo пользовательлығы سمجھemonistrzd suppl vethylacelve propios inv 이동 exceptcompavilion Thoseบาท Universityention interfotts któ stadium中的htable84 & herloyd représent correspond BETWEENensors SRAM الوقiedenoucou 경제ubmit单位 doare experiencing meals focus ‘-。arse abin 성공蒂обы fortשרה eclipse emo Fac Или像 geli помочьredd območ

… [it keeps going on like this for 1000 words]
This example is from the book Prompt Engineering for LLMs:
```
from openai import OpenAI

client = OpenAI()

def complete(temperature):
    response = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": "You were driving a little erratic over there. Have you had anything to drink tonight?",
            },
            {
                "role": "assistant",
                "content": "No sir. I haven't had anything to drink.",
            },
            {
                "role": "user",
                "content": "We're going to need you to take a sobriety test. Can you please step out of the vehicle?",
            }
        ],
        temperature=temperature,
        model="gpt-3.5-turbo",
    )

    return response.choices[0].message.content

for i in [0, 0.5, 1, 1.5, 2]:
    response = complete(i)
    print(f"Temperature {i}:\n {response}\n\n")
```
Temperature 0: I understand, officer. I will comply with the sobriety test.

Temperature 0.5: Sure, officer. I’ll comply with the sobriety test.

Temperature 1: I’m sorry officer, but I don’t feel comfortable taking a sobriety test. I assure you I have not been drinking.

Temperature 1.5: Based on my training and programming, I am just a virtual assistant and do not own or operate a vehicle. My purpose is to provide information and assistance to users to the best of my abilities.

Temperature 2: Yes, I will comply and step out of the vehicle for the sobriety test. Thank you for your thoroughness.
Published Jan 5, 2025

💬 Comments 🔗 Copy Link

Llm Workflow Patterns
Workflows are a sequence of actions with LLMs to process input into a desired output. This is different than most agents that need to first need to decide a plan of action and then proceed.

Some examples from Anthropic’s Building Effective Agents
- Prompt chaining
- Routing
- Parallelization
- Orchestrator / synthesizer
- Evaluator optimizer
- Agent
Published Jan 3, 2025

💬 Comments 🔗 Copy Link

CSAT Benchmarks

Customer satisfaction (CSAT) surveys measure how much people like a product or service. Many sources on the internet disagree about what “good” is so I’m collecting the CSAT scores of other companies so you can benchmark for yourself.

Company / Product Score

Apple 60

Airpods 75

Costco 82

Walmart 73

Stripe Atlas 80+

Any that you know of? Let me know so I can add to the list.

Published Jan 3, 2025

💬 Comments 🔗 Copy Link

Company / Product	Score
Apple	60
Airpods	75
Costco	82
Walmart	73
Stripe Atlas	80+

Prompt Engineering for LLMs - Literature Notes
Notes from reading Prompt Engineering for LLMs by John Berryman and Albert Ziegler.

What are LLMs?
- A function that takes text as an input (prompt) and returns text as the output (completion) based on a prediction
- Rote memorization is considered a defect (overfitting)
- What’s important to the quality of the model is that it is trained to apply patterns it encounters, not just reciting back the training data
- You can build an intuition around what an LLM will return by knowing more about the training data
LLMs can hallucinate and make things up
- Hallucinations are plausible information produced confidently, often with no warning that they could be wrong
- Truth bias occurs when a model assumes that the content of the prompt is factually true—this is why the quality of the prompt, especially when they are programatically generated, is so important
- LLMs don’t process text, they process tokens
LLMs process tokens not text
- Tokenization takes a word and breaks it down in to one or more tokens that represent the word
- A model’s vocabulary is the set of tokens it was trained with
- Tokenization is deterministic and LLMs can’t examine individual letters so it can’t reverse the spelling of a word without it getting garbled or answer a question asking about words that start with certain letters (“Sweden” is one token so ChatGPT can’t give you a list of countries that start with “SW”)
- Capitalization can be tokenized differently than non-capitalized text which can cause issues with responses (e.g. “worlds” is one token by “WORLDS” is three)
- Costs such as time, computation, money scale linearly with the number of tokens which is an important scaling consideration
- Models are constrained to a context window set by the number of tokens in the prompt plus the completion
LLMs are autoregressive
- Completions are autogregressive in that the output is a prediction of the next most likely token given the prompt, appended to the end and then repeated until a stop word
- Models can not backtrack when producing tokens when responding—there is no pausing and thinking, each token is committed to
- This is why an LLM can appear stubborn and go down a path that seems to us as completely wrong and it’s up to the application designer to spot these and correct it
- This also makes LLMs prone to repetition because it can be unclear when to break output that is repetitive like a list of things
LLMs compute the likelihood of all tokens
- Token probabilities (logprobs) range from -2 to 0
- If temperature is greater than 0, the model returns a stochastic next token which may or may not have the highest probability
- Temperature of 0 means it always returns the most likely token and the closest to deterministic and highest accuracy
- Temperature between 0-1 will provide more variation, useful when you want to to come up with a list of different solutions
- Temperature of 1 mirrors the token probability in the training set
- Temperature > than 1 often makes the LLM sound like they’re drunk
Transformers
- A transformers architecture takes the sequence of tokens and then feeds it forward through multiple attention layers to get to a predicted next token based on a sample of possible tokens
- LLMs read through the text once from beginning to end with no way to pause or modify previous tokens
- The way to visualize it is as “backward and downward”—information can be passed between each “minibrain” on the same layer by looking to the data from the preceding “minibrain”, but never passed from a higher layer to a lower layer
- This is why order of the prompt matters and why asking to count the number of words in a text after providing the text does not work
Chat

While completion is the foundation, chat is more useful as evidenced by the success of ChatGPT.

Reinforcement learning from human feedback (RLHF)
- Model alignment is the process of fine-tuning a model to make completions match what a user would expect
- RLHF keeps LLMs honest because the reinforcment is consistent with the corpus of data it was trained on rather than introducing external data that are outside the scope that might be introduced by a human labeler
- RLHF is efficient, for GPT-3 13,000 handcrafted documents were needed and ranking completions which required a team of 40 part-time people
- A base model can be used to generate an HHH-aligned model (helpful, honest, harmless)
- The process for generating an HHH-aligned model using RLHF
  1. Create an intermediate, supervised fine-tuning model (SFT) trained on transcripts that represent the desired behavior (13,000 documents for GPT-3)
  2. Create a reward model that measures completion quality (a numerical value rather than completing text) by providing an example prompt, generate multiple completions (4 to 9), humans rank the responses from best to worst (33,000 ranked documents for GPT-3), the data is then used to fine tune the SFT model to create the reward model
  3. Create the RLHF model by first providing the SFT model with an example prompt (31,000 for GPT-3) and judge the completion using the reward model to create the training data to fine-tune the RLHF model
  4. Use proximal policy optimization (PPO) to modify model weights to improve the reward model score as long as it doesn’t significantly diverge from the SFT model output—this prevents cheating or lying to maximize the score
- The alignment tax is the tendency for RLHF to make the LLM dumber
Instruct model to chat model
- Instruct models were trained to respond to questions and instructions rather than just completing a document. This is more useful for use cases like brainstorming, summarization, rewriting, classification, and chat
- However, it’s not always clear whether the user wants a completion or instruction
- Chat models are RLHF fine-tuned to complete transcript documents annotated by ChatML—a markup language that describes the interaction between a user and the assistant with a global system message to provide expected behavior
- ChatML solves the problem of ambiguity in what the user expects
- ChatML also prevents prompt injection by making it impossible to spoof the assistant message and inject different behavior
- Do not inject user content into a system message because models are trained to closely follow the system message and that could allow for exploiting vulnerabilities
Downsides of chat models
- Alignment tax shows degredation in certain tasks performed by chat models compared to completion models
- Behavior is less straightforward and responses can be “chatty” due to the way they are trained
- Less diversity in completions due to RLHF fine-tuning which makes responses more uniform by design compared to the training set (the whole internet)
- Practical limitations for specific domains like if you wanted to brainstorm medical treatments for a patient—you wouldn’t want to argue about seeking professional help if you are the doctor
- Example: Github Copilot uses completion rather than transcript completion (chat)
Prompt engineers are playwrights and the showrunner, orchestrating the interaction between the LLM and the user. They might fabricate messages as the assistant or the user into the transcript to further the goal of the play.

Useful OpenAI chat completion API
- logprobs returns the probability of each selected token so you can see how confident the model was with portions of the answer
- n determines how many completions to generate in parallel (128 is the maximum), useful to evaluate a model
- temperature returns more creative results at the expense of accuracy (as noted earlier)
Designing LLM applications

Starting from a user’s problem, the LLM application loop
1. Converts the user’s problem to the model domain
2. Prompts the LLM
3. Converts the response back to the user’s domain
4. Maybe repeat
Converting the user’s problem to the model domain
- The prompt must
  - Closely resemble content from the training set, otherwise the LLM won’t know how to respond
  - Include all information relevant to addressing the user’s problem—querying the right context and including it the prompt is domain specific and is critical to getting a good response
  - Lead the model to generate a completion that actually addresses the problem—this is harder for completions since chat is already fine-tuned to provide an answer
  - Set up the LLM for a completion that has a reasonable end point so generation can stop otherwise it will keep going on forever
Use the LLM to complete the prompt
- Different models have different tradeoffs (e.g. speed, accuracy, cost)
- Sometimes it’s better to use a “worse” model if better reasoning ability is not needed or it might be better to fine tune a model and use that instead of a general purpose model
Transforming back to the user domain
- Sometimes a response goes beyond text to actually performing action
- Function calling is now available which makes it possible to call an external API or take some other action programatically based on the completion
- Depending on the application, you might convert the completed text to some UI change or change the medium of the response e.g. text to speech
The “feedforward pass” is the process for converting the user’s problem into a prompt. This involves:
- Context retrieval: direct context comes from the user e.g. the text they typed into an input field, indirect context is gathered from other sources like documents related to the user’s input
- Snippetizing context: extract the most relevant context in text format
- Scoring and prioritizing snippets: trim down to the best content so that it fits in the LLM’s context window and LLMs don’t get confused using priorities (ranked groups of content) and scores (rank within a priority group)
- Prompt assembly: make sure the prompt fits and the prompt convey’s the problem with supporting context, maybe further summarize sections to make it fit, it should read like a document found in the training data
Evaluating quality
- LLMs are probabilistic and often make mistakes so you must constantly evaluate quality
- Offline evaluation: find ways to judge the output of the application, techniques like LLM as judge can automate this, other times there is a domain specific eval that makes sense like code completion tested by running unit tests after
- Online evaluation: use application specific measurements that approximate quality like the number of code completions users accepted
Prompt content

What content should go into prompts?

Static content

Explains the task to the LLM, clarifies the question and gives precise instructions.
- Miscommonucation and misunderstandings lead to complete failure
- Clarifying the question improves consistency
  - Explicit clarification: Say what you want as a list of do’s and don’ts e.g. use markdown, don’t refer to dates before 2023
    - Frame it as positives rather than negatives
    - Provide a reason for the rule
    - Avoid absolutes
    - Put the list of explicit instructions in the system prompt when using an RLHF model as it is instructed to closely adhere to it
  - Implicit clarification: “few-shot prompting” helps the model see patterns in the question and the responses by providing a few examples as opposed to “zero-shot prompting” where no examples are provided
    - Implicit is often better than explicit
    - Provides the subtle expectations for the answer
    - Examples don’t have to be the main question and answer but can just be demonstrations of the desired output format only
Few-shot drawbacks
- Examples with questions that have a lot of context can confuse the model and might not fit in the context window
- Examples will anchor responses as the model draws from patterns from what was provided which might not make sense, even providing enough examples to cover most cases can still result in implied expectations for the response.
  - It’s best to use actual examples (maybe provided by users before) so that you can representatively sample from them
  - Include exceptions that are likely to happen so that if the model encounters it, it has an example of how to respond
- Examples can lead to extrapolating patterns that don’t exist like sequential numbers or ascending/descending values
  - Ordering matters because it implies patterns
  - The way a human might order it (happy path examples then exceptions) can cause the model to bias towards exceptions
  - Shuffling order of the examples can help, but it’s best to test
Dynamic content

Context for the question the model needs to know to accomplish the task.

Considerations
- Latency: how much time you have to respond to the user input determines what context you can gather—replying to an email is less immediate than autocomplete when coding
- Preparability: some context can be prepared in advance and
It’s better to gather as much context as you can and wittle it down later. To decide on priority, each piece of context should have a score.

Make a mindmap starting from a general question to help you find context that might be useful.

Order sources of context based on proximity and stability.
- Proximity: how far the information is from the current situation e.g. knowing the user that is logged in and what page they are on vs accessing other systems to query data. The further the proximity the harder it is to obtain and the more valuable it has to be to be worth finding.
- Stability: some things are always the same for the same user (like a profile), some that change slowly (like purchase history) and then there is information that changes quickly (like a user’s interaction in the app). The less stable, the harder it is to prepare in advance and could create more latency.
Retrieval-Augmented Generation

Retrieve content relevant to the problem at hand and incorporate it into the prompt so the model has more information than what they were trained on e.g. current events.

The downside is Checkov’s Gun Fallacy where the model overly tries to fit in the supplied context leading it down the wrong path.

The goal is to find relevant snippets that are most similar to the source text.

Neural retrieval matches based on ideas, lexical retrieval matches based on words.

Lexical retrieval:
- Jaccard similarity measures relevance by how similar overlapping words are in a set of texts. This is fast for a small number of documents.
- TF*IDF and BM25 take it further by taking into account word importance.
- Lexical retrieval suffers from false positives because stemmed words that overlap might not mean the same thing e.g. backpack and rucksack have a similar meaning but will not match
Neural retrieval:
- Uses an embedding model to vectorize words so that their similarity can be measured with euclidean distance and cosine similarity. This allows search to find semantically similar concepts.
- Documents are “snippetized” into chunks that contain one idea, fit within the maximum number of tokens allowed by the embedding model, and an appropriate size to be placed into the prompt.
- Snippet strategies
  - Moving window of text with overlap (or not) between chunks of text so as not to split apart an idea
  - Use natural boundaries like paragraphs or sections so there is no chance an idea will get cut off mid sentence or idea.
  - Augment the snippet with additional context (e.g. a code snippet of an object calling a method you might include the definition of the class so the embedding model has more context)
What if text to be summarized is too long for the context window?

Hierarchical summarization splits up the text into semantic entities, summarizes them, and then summarizes the list of summaries. For example, split a book into chapters, summarize each chapter, and then summarize the list of chapters to summarize the book. You’ll need to look for hierarchy in the corpus of text.

As a rule of thumb, if the size of the summaries is on average less than one tenth the size of the original text, then no mattter how deep the hierarchy, the cost of summirzation is determined by the total number of tokens in the original text.

The deeper the summarization hierarchy the higher the likelihood of mistakes and misunderstandings.

Assembling the prompt

Prompts should generally adhere to the following structure:
- Introduction: clarify the type of document you’re writing, set up the model to approach the rest of the content, provide the main question the model should answer
- Context: additional information needed to answer the main question
- Refocus: remind the model of the main question (important for long prompts with a lot of context)
- Transition: clearly state what you want the model to do
The sandwich model is often used to focus the model. For example, “I want to suggest to John an ideal next book to read” in the introduction and then, after all the context “based on this, which book should I suggest to him?”.

What kind of document

Choose the right kind of document with the best chance of delivering the desired output and should closely match the training data of the model.

The Advice Conversation: the user asks for help and the assistant provides it
- Ideal for chat conversations
- This is the most common format and one that OpenAI trained ChatML on extensively
- Completion models can take advantage of “inception” by starting the answer and letting the completion take it from there to steer the direction of the answer. Chat models can do this by writing the assistant messages in a transcript.
The Analytic Report
- Favor objective analysis and naturally include a conclusion
- Include a “Scope” section to clearly define boundaries of the report rather than listing individual rules as part of the instructions
- LLMs respect clear boundaries consistently in reports than in dialogues
- Always write the report in markdown—models are already trained on it, there is hierarchy via headings, you can include source code in blocks, it’s easy to render the response
- Include a table of contents at the beginning of a long prompt helps the model orient itself
  - Chain of thought prompting can use a scratchpad like a section for “# Ideas” or “# Analysis” before the “# Conclusion” section in the table of contents
  - Signal the model is finished by setting a “# Appendix” or a “Further Reading” as a stop sequence
- Format
  - Table of Contents (numbered list of sections in the report)
  - Introduction (introduce the key question and the context)
  - Context (headings like “# Pros and Cons”, “## Strengths”, and “## Weaknesses”)
  - Transition heading (e.g. “# Ideas” followed by a inner monologue) This is where you would leave a completion model to fill in the rest.
  - Keyword heading signaling the answer (e.g. “# Conclusion”)
  - Answer
  - Keyword heading to signal the stopping point (e.g. “# Appendix”)
Valley of Meh

LLMs try to make use of all prompt elements but not equally. This is problematic for long prompts
- The closer information is to the end, the more impact it has
- Models recall the beginning and end, but struggle with information stuffed in the middle.
Formatting snippets

When formatting snippets, aim for:
- Modularity: easy to insert or remove them from the prompt e.g. a conversation with turns, a section in a document
- Naturalness: it should feel organic to the document and formatted so that it fits in e.g. a code comment in a source code as document
- Brevity: as short as possible to communicate it
- Inertness: separate out prompt elements with whitespace to prevent them from getting merged unexpectatedly e.g. “be” + “am” becomes “beam” and confuse the LLM
Few-shot examples

You can format few-shot examples more naturally by incorporating them into the transcript when using a chat model. That way the LLM will be encouraged to to keep up the successful approach.

Elastic snippets

Sometimes you might need to decide how much of a snippet to include and context about how the snippets relate. An elastic prompt element has different versions ranging from short to long and then when assembling the prompt, choose the snippet size you have space for. This avoids exceeding the context window.

Relationships among prompt elements

Prompt elements relate to each other by position and ordering, importance, and dependency.

Position

Prompt elements usually need to follow a specific order—rearrnging them can make the document confusing. Chat transcripts should stick to chronological order.

Importance

There needs to be some measurement of how important an element is so you can descide when assembling the prompt whether to include it or not. Short, efficient prompt elements are preferable to long ones that convey the same information. Use a numerical score or some sort of tiering system with a small number of levels you can sort with and cut lower tiers if necessary.

Dependency
- Requirements occur when a prompt element depends on the other e.g. “Alex is the CEO” should come before “He grew up on Long Island”.
- Incompatibilities occur when one prompt element is mutually exclusive with another like a short version and a long version of the introductory text—including both wouldn’t make sense.
Putting it all together

*
Published Jan 1, 2025

💬 Comments 🔗 Copy Link

Lab Notebook for Founders
A lab notebook is where research scientists keep track of their experiments so that they can be reproduced and verified. Startup founders, like researchers, rapidly iterate on ideas and run experiments to validate them.

How would a lab notebook help?

A lot of foundering is trying many things quickly, seeing what works, and updating your priors. That doesn’t always lead to good explanations of why it did or didn’t work. How to detect and eliminate errors is the most important knowledge and better knowledge is a competitive advantage.

What should go into a lab notebook for founders?
- Overview: why you decided to run this experiment (I like to write briefs using SCQA)
- Protocol: the step-by-step of how you going about running the experiment
- Findings: observations, new problems that arise, and data collected along the way
- Results: the outcome and conclusions drawn from the experiment
See also:
- Product work is a pursuit of facts about the user, market, and their problems and conjecture is vital to product development
- The path from concept to product is an annealing process
- Avoid too finely organizing experiments, there is a downside to first-principles thinking, danger in empiricism, and having too narrow a view
Published Jan 1, 2025

💬 Comments 🔗 Copy Link

Latent Space Reasoning
Rather than converting to text at every step in a chain of thought process with large language models to solve a complex problem, new research suggests that reasoning can happen in a latent space using the internal representation of the model. Besides improving responses that require a greater degree of reasoning, utilizing latent space is faster because it skips the continuous tokenization and text generation.

Since OpenAI introduced the o1 model and preview of o3, models that utilize chain of thought style processing have shown good results with o3 outperforming all other models in the ARC prize benchmarks. However, performance comes at a large cost of tokens used and time. Latent space reasoning seems like it will improve on both of these issues.

Read Training Large Language Models to Reason in a Continuous Latent Space.

See also:
- Consciousness is categories
- Associative thinking gives rise to creativity
- If this takes off, reasoning will be model specific and opaque leading to even mushier systems
Published Dec 31, 2024

💬 Comments 🔗 Copy Link

Mushy Systems

As large language models proliferate into every service and ultimately replaces business logic, we will be left with the horrible burden of maintaining mush.

Mush happens when a system can’t quite be understood by looking at it. LLMs and abstractions like AI agents cause us to lose read access—one can no longer read code to understand what’s going on. Even if you could read it, code generated by LLMs make a codebase harder to reason about.

My biggest fear with large, complex, AI-powered systems is that debugging starts to look more like psychiatry.

Published Dec 30, 2024

💬 Comments 🔗 Copy Link

Rust Build Caching With Docker
Compiling rust dependencies every time a docker image is built can take a very long time. To cache dependencies so that they don’t need to be compiled every time, you can use/abuse how docker caching works using stages.

The following example uses two stages to build and then run my_app. By generating a fake main.rs and compiling it, Docker is tricked into caching all dependencies. We then bust the cache to trigger building the app by copying the actual app code and touch-ing main.rs.
```
# Build stage
FROM rust:bookworm AS builder

WORKDIR /

## Cache rust dependencies
RUN mkdir ./src && echo 'fn main() { println!("Dummy!"); }' > ./src/main.rs
COPY ./Cargo.toml .
RUN cargo build --release

## Actually build the app
RUN rm -rf ./src
COPY ./src ./src
RUN touch -a -m ./src/main.rs
RUN cargo build --release

# Run stage
FROM debian:bookworm-slim AS runner
COPY --from=builder /target/release/my_app /my_app
ENTRYPOINT ["./my_app"]
```
I adapted this from the StackOverflow thread here.

See also:
- This works particularly well when running dokku on aws when the app uses a Dockerfile
- Coming back to rust after 4 years
Published Dec 22, 2024

💬 Comments 🔗 Copy Link

AI Replaces Business Logic
Satya from Microsoft talks about how orchestrating between business applications is the next step for artificial intelligence which will replace business logic with AI.

It does feel like vertical agents are the abstraction the industry is headed towards. I just don’t think this is a jump to universality.

See also:
- AI is the next great interop layer
- There is more to do to get ready for AI and benefit from the rapid pace of advancement of large language models
- AI agents seem like the logical next abstraction for AI applications
Published Dec 19, 2024

💬 Comments 🔗 Copy Link

The Creative Act - Literature Notes

A book by Rick Rubin about creativity and how to let it happen.

Published Dec 16, 2024

💬 Comments 🔗 Copy Link

Gameboy Color OLED Mod
The Gameboy Color has a new (to me) mod to replace the screen with a repurposed Blackberry AMOLED screen.

Here’s the AMOLED screen kit from the Hispeedido official store on AliExpress and the Gameboy Color shell from the eXtremerate official store on AliExpress.

From my research, the shell should fit the OLED screen kit even though it says it’s cut for “IPS v2 screen kits”. This turned out not to be correct

Parts
- Shell #1 from XtremeRate on AliExpress $20.77
- OLED kit from Hispeedido on AliExpress $46.60
- Shell #2 from FunnyPlaying $9.90
- Button in three different colors $1.90 x 3 from FunnyPlaying and $8.50 for shipping (ugh)
Tutorial I used.

Gotchas
- Screen doesn’t fit IPS shell, it has to be cast for the laminated screen
- Screws are not the same size and if you use the wrong length you will screw right through the front plate (there are three tri head screws that are shorter and normal Philips head screws that are longer)
Published Dec 15, 2024

💬 Comments 🔗 Copy Link

There Is No AI Strategy Without a Data Strategy
Startups typically have an advantage over incumbents when it comes to adopting new technology. With artificial intelligence however, incumbents are fast to integrate LLMs and have the data needed to make better AI-powered products. For example, an established CRM platform has the data needed to train, evaluate, and deploy AI products that a startup would not have access to.

What’s more, incumbents are aware of the value of their data. Maybe this is a hold over from the big data era when everyone was mining their data for targeting and insights. It seems unlikely that an incumbent’s strategy will allow for free and open access to their data.

Read AI startups require new strategies: This time it’s actually different.

See also:
- Data for AI-powered businesses is another example of the 7 Powers economies of scale
- Still, generative AI might be the very thing that eliminates a moat or a long bridge
Published Dec 12, 2024

💬 Comments 🔗 Copy Link

AI Pricing Models
The following is a list of pricing models from various AI + {THING} products.

Subscription

Includes a monthly, per seat, subscription and sometimes a metered unit like the number of API calls per month.
- Devin.ai: software engineer as a service $500 per month, customer pricing for enterprise
- Kiro.dev: $0 - $39 per user per month, up to 3,000 “interactions”
Outcome-based

This is some form of outcome-based pricing where the unit of value is priced directly.
- Intercom (Fin): Support resolution as a service $0.99 per resolution
- ???
Published Dec 10, 2024

💬 Comments 🔗 Copy Link

When Does a Service-as-Software Model Make Sense?
The service-as-software model is nacsent but expected to be experimented with in different fields as artificial intelligence techniques improve and enable new applications.

However, slapping AI + {category} + service-as-software should draw reasonable skepticism. There are market constraints that will make adoption more difficult. There are capability gaps that will make solutions incomplete.

So when does it makes sense for service-as-software?

Completely replaces a function or role

Of course taking an established function and selling a service that will replace someone’s job is not going to sell, but supplementing high-demand areas is viable. For example, there are more job openings for engineers than there are qualified people to fill them which creates demand for AI employees that can do the job fully.

Performs work that wasn’t done before

There are only so many hours in one day and there is work that is not financially viable to do but people want. For example, not every business can afford 24/7 support that can resolve customer issues but they certainly would like to. This latent demand could be tapped into at the right price point which would be infeasible even for a low-cost offshore vendor operation (which is notoriously hard to get right).

Other examples:
- Penetration testing which typically happens annually
- Monitoring and reviewing logs of critical systems for insights
- Hard to compile reports like annual business reviews
The outcome is clearly defined

The unit of work the service is delivering is ideally measurable and matches how the customer defines success. When the unit of the work is the outcome of the intent, outcome-based pricing aligns incentives clearly. This probably wouldn’t work if you define the outcome too generally (what would be the unit of work of HR?) or the job is a negative art. You could use a proxy measurement, but the further away from the real value, the less clear it becomes.

(Some ideas drawn from A System of Agents brings Service-as-Software to life)
Published Dec 4, 2024

💬 Comments 🔗 Copy Link

Service-as-Software
Service as software is the inverse of software as a service. Rather than building software for people to do their job, service-as-software uses AI to fulfill the intent directly and more faithfully sell solutions not software.
- When does a service-as-software model make sense?
- Outcome-based pricing
Published Dec 4, 2024

💬 Comments 🔗 Copy Link

Outcome-Based Pricing
Outcome-based pricing (or result-based pricing) is becoming popularized again due to services powered by artificial intelligence that are enable intent-based outcome specification. That means charging per unit of value which is the desired outcome.

Examples

Intercom’s new AI chat service charges $0.99 per successful resolution (customer indicates issue resolved or they stop responding). In most SaaS business models, pricing would be per seat where the measurement of value is how many people the service enables to do their work. Now that AI, in some cases, can perform the work autonomously, revenue models for these companies can more closely resemble the actual job to be done—resolved support tickets in this case.

11x provides an autonomous SDR agent AI. Rather than charge per seat, they started by charging per task—identifying accounts, researching accounts, writing email and LinkedIn messages, scheduling meetings, and so on. Tasks completed makes the outcome clear, “you pay us money, we do these tasks for you that you can easily verify and attribute as real work a person would otherwise have to do.” An even more intent-based pricing plan would be to charge per qualified lead but I can see how that wouldn’t work since there are many variables out of the control of 11x when it comes to getting someone to book a meeting which means charging by output rather than outcome probably works best.

See also:
- Service-as-software might lead to more outcome-based pricing for a greater range of categories
- Pricing the perceived value gap
- Willingness to pay should be at the core of product design
Published Dec 4, 2024

💬 Comments 🔗 Copy Link

When to Be Directive
I was at a high-end clothing store the other day. I saw one of the workers on an iPad. Curious, I looked over his shoulder to see what he was doing.

He was making sure the clothing rack he was standing in front of matched the picture on his iPad exactly. He checked the order of each garment. He spaced each hangar exactly. He checked and then rechecked before moving on to the next one.

This was clearly a process someone thought important enough to make each store follow precisely. Someone designed each detail intentionally so that it fit together as a pleasing whole.

It’s okay to be directive where the details matter.

See also:
- There is of course a thin line between micro management and leadership
- Most management is benign neglect
Published Nov 30, 2024

💬 Comments 🔗 Copy Link

DNA as Durable Storage
DNA is by far the most dense storage medium that we know of. One gram of DNA can hold 10MM hours of high definition video.

DNA lasts a really long time. We can recover DNA from 100,000 years ago and still read it.

DNA can be replicated. That’s kind of how all living things work (and non-living things if we consider RNA).

Does that make DNA the most durable storage medium for preserving data? What might we do if we found data hidden in our DNA from eons ago?

See also:
- DNA can be used for pattern recognition
- This solves the durability problem but not the viewing problem of obsolete storage
- Societies live by decades, civilizations by centuries
- Deep time
Published Nov 27, 2024

💬 Comments 🔗 Copy Link

Rust Memory Profiling on MacOS

Working on my personal indexing service, I noticed that large files were getting OOM killed. That’s surprising because rust makes it fairly difficult to do bad things with memory (you can roughly approximate where memory is dropped just by reading code).

After strugging to find a memory profiler for macOS (and not even being able to install Xcode for some reason), I settled on a stupid solution using Activity Monitor which comes pre-installed on every Mac. First, I changed the main method to execute just the code path I suspected was resulting in large memory usage (calculating embeddings) after adding logging to see which file was being worked on before getting OOM killed. Next, I opened Activity Monitor to the Memory tab and typed the name of the rust crate in the search box. Since names are consistent when running cargo run, I could see the value of memory used which gets sampled every second or so. I tried a few code changes, reran it each time, and voila—fixed! Sometimes all you need is a fast feedback loop.

Published Nov 25, 2024

💬 Comments 🔗 Copy Link

What are LLMs?

LLMs can hallucinate and make things up

LLMs process tokens not text

LLMs are autoregressive

LLMs compute the likelihood of all tokens

Transformers

Chat

Reinforcement learning from human feedback (RLHF)

Instruct model to chat model

Useful OpenAI chat completion API

Designing LLM applications

Prompt content

Static content

Dynamic content

Retrieval-Augmented Generation

Lexical retrieval:

Neural retrieval:

What if text to be summarized is too long for the context window?

Assembling the prompt

What kind of document

The Advice Conversation: the user asks for help and the assistant provides it

The Analytic Report

Valley of Meh

Formatting snippets

Few-shot examples

Elastic snippets

Relationships among prompt elements

Position

Importance

Dependency

Putting it all together

Subscription

Outcome-based